Dr McDermott will demontrate the large data capabilities of polars / r-polars and duckdb using the well-known NYC Taxi data set. We recommends using either shell commands (using the AWS command-line tools)
mkdir -p nyc-taxi/year=2012
aws s3 cp s3://voltrondata-labs-datasets/nyc-taxi/year=2012 nyc-taxi/year=2012 \
--recursive --no-sign-request
or from R (using an appropriate Arrow build)
suppressMessages({
library(arrow) # install.packages('arrow', repos = c('https://apache.r-universe.dev')
library(dplyr)
})
data_path <- "nyc-taxi/year=2012" # Or set your own preferred path
open_dataset("s3://voltrondata-labs-datasets/nyc-taxi/year=2012") |>
write_dataset(data_path, partitioning = "month")
The files are available on the shared morrow server below /opt/data/nyctaxi/year=2012
and take up 8.4gb compressed.