If you are beginning in R, chances are that you have used read.csv()
to import CSV files into R. While this function works perfectly fine, it can only read one file at a time. Hence, new R programmers often read multiple files successively and combine the data afterward.
Whereas this can work fine if you have only a few files, this can become tedious when the number of files to read increases. A better approach would be to use a list of files and read them at once. For quite a while, I have been using a combination of map_df()
from the purrr package.
# Create a vector of file names
files <- c("file1.csv", "file2.csv", "file3.csv", "file4.csv", "file5.csv")
# Read and combine all data files into a single data frame
big_df <- map_df(files, read_csv)
In the release of readr 2.0.0, the read_csv()
function can directly take a list of files as input, eliminating the need to use the mad_df()
function. Hence, we can now read multiples files as follow:
# Read and combine all data files into a single data frame without using the
# map_df function
big_df <- read_csv(files)
In this short blog post, I wanted to benchmark the speed difference between map_df(files, read_csv)
and read_csv(files)
. To do it so let’s first generate some data files.
Photo by Marc Sendra Martorell on Unsplash
library(nycflights13)
purrr::iwalk(
split(flights, flights$carrier),
~ {
.x$carrier[[1]]
data.table::fwrite(.x, glue::glue("/tmp/flights_{.y}.csv"))
}
)
files <- fs::dir_ls(path = "/tmp", glob = "*flights*csv")
files
#> /tmp/flights_9E.csv /tmp/flights_AA.csv /tmp/flights_AS.csv /tmp/flights_B6.csv
#> /tmp/flights_DL.csv /tmp/flights_EV.csv /tmp/flights_F9.csv /tmp/flights_FL.csv
#> /tmp/flights_HA.csv /tmp/flights_MQ.csv /tmp/flights_OO.csv /tmp/flights_UA.csv
#> /tmp/flights_US.csv /tmp/flights_VX.csv /tmp/flights_WN.csv /tmp/flights_YV.csv
We can look at what the data look like.
read_csv(files[[1]])
#> # A tibble: 18,460 × 19
#> year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2013 1 1 810 810 0 1048 1037 11 9
#> 2 2013 1 1 1451 1500 -9 1634 1636 -2 9
#> 3 2013 1 1 1452 1455 -3 1637 1639 -2 9
#> 4 2013 1 1 1454 1500 -6 1635 1636 -1 9
#> 5 2013 1 1 1507 1515 -8 1651 1656 -5 9
#> 6 2013 1 1 1530 1530 0 1650 1655 -5 9
#> 7 2013 1 1 1546 1540 6 1753 1748 5 9
#> 8 2013 1 1 1550 1550 0 1844 1831 13 9
#> 9 2013 1 1 1552 1600 -8 1749 1757 -8 9
#> 10 2013 1 1 1554 1600 -6 1701 1734 -33 9
#> # … with 18,450 more rows, 9 more variables: flight <dbl>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Now that data files have been successfully created, we can compare the two reading options.
res <- microbenchmark::microbenchmark(
map_df_read_csv = map_df(files, read_csv, col_types = cols(carrier = col_character())),
read_csv = read_csv(files, col_types = cols(carrier = col_character())),
times = 100
)
res
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> map_df_read_csv 392.6330 499.1544 488.1987 506.1408 510.7469 542.6262 100
#> read_csv 164.3161 168.4232 178.9344 172.2544 177.2635 293.7781 100
autoplot(res)
Using read_csv()
directly seems to be much (~two times) faster than the map_df(files, read_csv)
combination.
Session info
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23)
#> os Linux Mint 21
#> system x86_64, linux-gnu
#> ui X11
#> language en_CA:en
#> collate en_CA.UTF-8
#> ctype en_CA.UTF-8
#> tz America/Montreal
#> date 2022-08-10
#> pandoc 2.18 @ /usr/lib/rstudio/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> ! package * version date (UTC) lib source
#> P assertthat 0.2.1 2019-03-21 [?] RSPM
#> P backports 1.4.1 2021-12-13 [?] RSPM
#> P bit 4.0.4 2020-08-04 [?] RSPM
#> P bit64 4.0.5 2020-08-30 [?] RSPM
#> P broom 1.0.0 2022-07-01 [?] RSPM (R 4.2.1)
#> P cachem 1.0.6 2021-08-19 [?] RSPM (R 4.2.0)
#> P callr 3.7.1 2022-07-13 [?] RSPM (R 4.2.1)
#> P cellranger 1.1.0 2016-07-27 [?] RSPM
#> P cli 3.3.0 2022-04-25 [?] RSPM (R 4.2.0)
#> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.1)
#> P colorspace 2.0-3 2022-02-21 [?] RSPM (R 4.2.0)
#> P crayon 1.5.1 2022-03-26 [?] RSPM (R 4.2.0)
#> P data.table 1.14.2 2021-09-27 [?] RSPM
#> P DBI 1.1.3 2022-06-18 [?] RSPM (R 4.2.0)
#> P dbplyr 2.2.1 2022-06-27 [?] RSPM (R 4.2.0)
#> P devtools 2.4.4 2022-07-20 [?] RSPM (R 4.2.0)
#> P digest 0.6.29 2021-12-01 [?] RSPM
#> P dplyr * 1.0.9 2022-04-28 [?] RSPM (R 4.2.0)
#> P ellipsis 0.3.2 2021-04-29 [?] RSPM
#> P evaluate 0.15 2022-02-18 [?] RSPM (R 4.2.0)
#> P extrafont 0.18 2022-04-12 [?] RSPM (R 4.2.0)
#> P extrafontdb 1.0 2012-06-11 [?] RSPM (R 4.2.0)
#> P fansi 1.0.3 2022-03-24 [?] RSPM (R 4.2.0)
#> P farver 2.1.1 2022-07-06 [?] RSPM (R 4.2.1)
#> P fastmap 1.1.0 2021-01-25 [?] RSPM
#> P forcats * 0.5.1 2021-01-27 [?] RSPM
#> P fs 1.5.2 2021-12-08 [?] RSPM
#> P gargle 1.2.0 2021-07-02 [?] RSPM
#> P generics 0.1.3 2022-07-05 [?] RSPM (R 4.2.1)
#> P ggplot2 * 3.3.6 2022-05-03 [?] RSPM (R 4.2.0)
#> P ggpmthemes * 0.0.2 2022-08-08 [?] Github (pmassicotte/ggpmthemes@993d61e)
#> P glue 1.6.2 2022-02-24 [?] RSPM (R 4.2.0)
#> P googledrive 2.0.0 2021-07-08 [?] RSPM
#> P googlesheets4 1.0.0 2021-07-21 [?] RSPM
#> P gtable 0.3.0 2019-03-25 [?] RSPM
#> P haven 2.5.0 2022-04-15 [?] RSPM (R 4.2.0)
#> P hms 1.1.1 2021-09-26 [?] RSPM
#> P htmltools 0.5.3 2022-07-18 [?] RSPM (R 4.2.1)
#> P htmlwidgets 1.5.4 2021-09-08 [?] RSPM (R 4.2.0)
#> P httpuv 1.6.5 2022-01-05 [?] RSPM (R 4.2.0)
#> P httr 1.4.3 2022-05-04 [?] RSPM (R 4.2.0)
#> P jsonlite 1.8.0 2022-02-22 [?] RSPM (R 4.2.0)
#> P knitr 1.39 2022-04-26 [?] RSPM (R 4.2.0)
#> P later 1.3.0 2021-08-18 [?] RSPM (R 4.2.0)
#> P lifecycle 1.0.1 2021-09-24 [?] RSPM
#> P lubridate 1.8.0 2021-10-07 [?] RSPM
#> P magrittr 2.0.3 2022-03-30 [?] RSPM (R 4.2.0)
#> P memoise 2.0.1 2021-11-26 [?] CRAN (R 4.2.0)
#> P microbenchmark 1.4.9 2021-11-09 [?] RSPM (R 4.2.0)
#> P mime 0.12 2021-09-28 [?] RSPM
#> P miniUI 0.1.1.1 2018-05-18 [?] RSPM (R 4.2.0)
#> P modelr 0.1.8 2020-05-19 [?] RSPM
#> P munsell 0.5.0 2018-06-12 [?] RSPM
#> P nycflights13 * 1.0.2 2021-04-12 [?] RSPM (R 4.2.0)
#> P pillar 1.8.0 2022-07-18 [?] RSPM (R 4.2.1)
#> P pkgbuild 1.3.1 2021-12-20 [?] CRAN (R 4.2.0)
#> P pkgconfig 2.0.3 2019-09-22 [?] RSPM
#> P pkgload 1.3.0 2022-06-27 [?] RSPM (R 4.2.0)
#> P prettyunits 1.1.1 2020-01-24 [?] RSPM
#> P processx 3.7.0 2022-07-07 [?] RSPM (R 4.2.1)
#> P profvis 0.3.7 2020-11-02 [?] RSPM (R 4.2.0)
#> P promises 1.2.0.1 2021-02-11 [?] RSPM (R 4.2.0)
#> P ps 1.7.1 2022-06-18 [?] RSPM (R 4.2.0)
#> P purrr * 0.3.4 2020-04-17 [?] RSPM
#> P R6 2.5.1 2021-08-19 [?] RSPM
#> P Rcpp 1.0.9 2022-07-08 [?] RSPM (R 4.2.1)
#> P readr * 2.1.2 2022-01-30 [?] RSPM (R 4.2.0)
#> P readxl 1.4.0 2022-03-28 [?] RSPM (R 4.2.0)
#> P remotes 2.4.2 2021-11-30 [?] CRAN (R 4.2.0)
#> renv 0.15.5 2022-05-26 [1] RSPM (R 4.2.0)
#> P reprex 2.0.1 2021-08-05 [?] RSPM
#> P rlang 1.0.4 2022-07-12 [?] RSPM (R 4.2.1)
#> P rmarkdown 2.14 2022-04-25 [?] RSPM (R 4.2.0)
#> P rstudioapi 0.13 2020-11-12 [?] RSPM
#> P Rttf2pt1 1.3.10 2022-02-07 [?] RSPM (R 4.2.0)
#> P rvest 1.0.2 2021-10-16 [?] RSPM
#> P scales 1.2.0 2022-04-13 [?] RSPM (R 4.2.0)
#> P sessioninfo 1.2.2 2021-12-06 [?] CRAN (R 4.2.0)
#> P shiny 1.7.2 2022-07-19 [?] RSPM (R 4.2.1)
#> P stringi 1.7.8 2022-07-11 [?] RSPM (R 4.2.1)
#> P stringr * 1.4.0 2019-02-10 [?] RSPM
#> P tibble * 3.1.8 2022-07-22 [?] RSPM (R 4.2.1)
#> P tidyr * 1.2.0 2022-02-01 [?] RSPM (R 4.2.0)
#> P tidyselect 1.1.2 2022-02-21 [?] RSPM (R 4.2.0)
#> P tidyverse * 1.3.2 2022-07-18 [?] RSPM (R 4.2.0)
#> P tzdb 0.3.0 2022-03-28 [?] RSPM (R 4.2.0)
#> P urlchecker 1.0.1 2021-11-30 [?] RSPM (R 4.2.0)
#> P usethis 2.1.6 2022-05-25 [?] RSPM (R 4.2.0)
#> P utf8 1.2.2 2021-07-24 [?] RSPM
#> P vctrs 0.4.1 2022-04-13 [?] RSPM (R 4.2.0)
#> P vroom 1.5.7 2021-11-30 [?] RSPM
#> P withr 2.5.0 2022-03-03 [?] RSPM (R 4.2.0)
#> P xfun 0.31 2022-05-10 [?] RSPM (R 4.2.0)
#> P xml2 1.3.3 2021-11-30 [?] RSPM
#> P xtable 1.8-4 2019-04-21 [?] RSPM (R 4.2.0)
#> P yaml 2.3.5 2022-02-21 [?] RSPM (R 4.2.0)
#>
#> [1] /media/LaCie16TB/work/r-blog/renv/library/R-4.2/x86_64-pc-linux-gnu
#> [2] /usr/local/lib/R/library
#>
#> P ── Loaded and on-disk path mismatch.
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────