Reading multiples CSV files using readr

R
Data manipulation
Author

Philippe Massicotte

Published

February 15, 2022

If you are beginning in R, chances are that you have used read.csv() to import CSV files into R. While this function works perfectly fine, it can only read one file at a time. Hence, new R programmers often read multiple files successively and combine the data afterward.


# Read all the data files
df1 <- read.csv("file1.csv")
df2 <- read.csv("file2.csv")
df3 <- read.csv("file3.csv")
df4 <- read.csv("file4.csv")
df5 <- read.csv("file5.csv")

# Combine all the data frame together
big_df <- rbind(df1, df2, df3, df4, df5)

Whereas this can work fine if you have only a few files, this can become tedious when the number of files to read increases. A better approach would be to use a list of files and read them at once. For quite a while, I have been using a combination of map_df() from the purrr package.

# Create a vector of file names
files <- c("file1.csv", "file2.csv", "file3.csv", "file4.csv", "file5.csv")

# Read and combine all data files into a single data frame
big_df <- map_df(files, read_csv)

In the release of readr 2.0.0, the read_csv() function can directly take a list of files as input, eliminating the need to use the mad_df() function. Hence, we can now read multiples files as follow:

# Read and combine all data files into a single data frame without using the
# map_df function
big_df <- read_csv(files)

In this short blog post, I wanted to benchmark the speed difference between map_df(files, read_csv) and read_csv(files). To do it so let’s first generate some data files.

Photo by Marc Sendra Martorell on Unsplash

library(nycflights13)

purrr::iwalk(
  split(flights, flights$carrier),
  ~ {
    .x$carrier[[1]]
    data.table::fwrite(.x, glue::glue("/tmp/flights_{.y}.csv"))
  }
)

files <- fs::dir_ls(path = "/tmp", glob = "*flights*csv")
files
#> /tmp/flights_9E.csv /tmp/flights_AA.csv /tmp/flights_AS.csv /tmp/flights_B6.csv 
#> /tmp/flights_DL.csv /tmp/flights_EV.csv /tmp/flights_F9.csv /tmp/flights_FL.csv 
#> /tmp/flights_HA.csv /tmp/flights_MQ.csv /tmp/flights_OO.csv /tmp/flights_UA.csv 
#> /tmp/flights_US.csv /tmp/flights_VX.csv /tmp/flights_WN.csv /tmp/flights_YV.csv

We can look at what the data look like.

read_csv(files[[1]])
#> # A tibble: 18,460 × 19
#>     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>    <dbl> <dbl> <dbl>    <dbl>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1  2013     1     1      810        810       0    1048    1037      11       9
#>  2  2013     1     1     1451       1500      -9    1634    1636      -2       9
#>  3  2013     1     1     1452       1455      -3    1637    1639      -2       9
#>  4  2013     1     1     1454       1500      -6    1635    1636      -1       9
#>  5  2013     1     1     1507       1515      -8    1651    1656      -5       9
#>  6  2013     1     1     1530       1530       0    1650    1655      -5       9
#>  7  2013     1     1     1546       1540       6    1753    1748       5       9
#>  8  2013     1     1     1550       1550       0    1844    1831      13       9
#>  9  2013     1     1     1552       1600      -8    1749    1757      -8       9
#> 10  2013     1     1     1554       1600      -6    1701    1734     -33       9
#> # … with 18,450 more rows, 9 more variables: flight <dbl>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Now that data files have been successfully created, we can compare the two reading options.

res <- microbenchmark::microbenchmark(
  map_df_read_csv = map_df(files, read_csv, col_types = cols(carrier = col_character())),
  read_csv = read_csv(files, col_types = cols(carrier = col_character())),
  times = 100
)

res
#> Unit: milliseconds
#>             expr      min       lq     mean   median       uq      max neval
#>  map_df_read_csv 392.6330 499.1544 488.1987 506.1408 510.7469 542.6262   100
#>         read_csv 164.3161 168.4232 178.9344 172.2544 177.2635 293.7781   100

autoplot(res)

Using read_csv() directly seems to be much (~two times) faster than the map_df(files, read_csv) combination.

Session info
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       Linux Mint 21
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en_CA:en
#>  collate  en_CA.UTF-8
#>  ctype    en_CA.UTF-8
#>  tz       America/Montreal
#>  date     2022-08-10
#>  pandoc   2.18 @ /usr/lib/rstudio/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  ! package        * version date (UTC) lib source
#>  P assertthat       0.2.1   2019-03-21 [?] RSPM
#>  P backports        1.4.1   2021-12-13 [?] RSPM
#>  P bit              4.0.4   2020-08-04 [?] RSPM
#>  P bit64            4.0.5   2020-08-30 [?] RSPM
#>  P broom            1.0.0   2022-07-01 [?] RSPM (R 4.2.1)
#>  P cachem           1.0.6   2021-08-19 [?] RSPM (R 4.2.0)
#>  P callr            3.7.1   2022-07-13 [?] RSPM (R 4.2.1)
#>  P cellranger       1.1.0   2016-07-27 [?] RSPM
#>  P cli              3.3.0   2022-04-25 [?] RSPM (R 4.2.0)
#>    codetools        0.2-18  2020-11-04 [2] CRAN (R 4.2.1)
#>  P colorspace       2.0-3   2022-02-21 [?] RSPM (R 4.2.0)
#>  P crayon           1.5.1   2022-03-26 [?] RSPM (R 4.2.0)
#>  P data.table       1.14.2  2021-09-27 [?] RSPM
#>  P DBI              1.1.3   2022-06-18 [?] RSPM (R 4.2.0)
#>  P dbplyr           2.2.1   2022-06-27 [?] RSPM (R 4.2.0)
#>  P devtools         2.4.4   2022-07-20 [?] RSPM (R 4.2.0)
#>  P digest           0.6.29  2021-12-01 [?] RSPM
#>  P dplyr          * 1.0.9   2022-04-28 [?] RSPM (R 4.2.0)
#>  P ellipsis         0.3.2   2021-04-29 [?] RSPM
#>  P evaluate         0.15    2022-02-18 [?] RSPM (R 4.2.0)
#>  P extrafont        0.18    2022-04-12 [?] RSPM (R 4.2.0)
#>  P extrafontdb      1.0     2012-06-11 [?] RSPM (R 4.2.0)
#>  P fansi            1.0.3   2022-03-24 [?] RSPM (R 4.2.0)
#>  P farver           2.1.1   2022-07-06 [?] RSPM (R 4.2.1)
#>  P fastmap          1.1.0   2021-01-25 [?] RSPM
#>  P forcats        * 0.5.1   2021-01-27 [?] RSPM
#>  P fs               1.5.2   2021-12-08 [?] RSPM
#>  P gargle           1.2.0   2021-07-02 [?] RSPM
#>  P generics         0.1.3   2022-07-05 [?] RSPM (R 4.2.1)
#>  P ggplot2        * 3.3.6   2022-05-03 [?] RSPM (R 4.2.0)
#>  P ggpmthemes     * 0.0.2   2022-08-08 [?] Github (pmassicotte/ggpmthemes@993d61e)
#>  P glue             1.6.2   2022-02-24 [?] RSPM (R 4.2.0)
#>  P googledrive      2.0.0   2021-07-08 [?] RSPM
#>  P googlesheets4    1.0.0   2021-07-21 [?] RSPM
#>  P gtable           0.3.0   2019-03-25 [?] RSPM
#>  P haven            2.5.0   2022-04-15 [?] RSPM (R 4.2.0)
#>  P hms              1.1.1   2021-09-26 [?] RSPM
#>  P htmltools        0.5.3   2022-07-18 [?] RSPM (R 4.2.1)
#>  P htmlwidgets      1.5.4   2021-09-08 [?] RSPM (R 4.2.0)
#>  P httpuv           1.6.5   2022-01-05 [?] RSPM (R 4.2.0)
#>  P httr             1.4.3   2022-05-04 [?] RSPM (R 4.2.0)
#>  P jsonlite         1.8.0   2022-02-22 [?] RSPM (R 4.2.0)
#>  P knitr            1.39    2022-04-26 [?] RSPM (R 4.2.0)
#>  P later            1.3.0   2021-08-18 [?] RSPM (R 4.2.0)
#>  P lifecycle        1.0.1   2021-09-24 [?] RSPM
#>  P lubridate        1.8.0   2021-10-07 [?] RSPM
#>  P magrittr         2.0.3   2022-03-30 [?] RSPM (R 4.2.0)
#>  P memoise          2.0.1   2021-11-26 [?] CRAN (R 4.2.0)
#>  P microbenchmark   1.4.9   2021-11-09 [?] RSPM (R 4.2.0)
#>  P mime             0.12    2021-09-28 [?] RSPM
#>  P miniUI           0.1.1.1 2018-05-18 [?] RSPM (R 4.2.0)
#>  P modelr           0.1.8   2020-05-19 [?] RSPM
#>  P munsell          0.5.0   2018-06-12 [?] RSPM
#>  P nycflights13   * 1.0.2   2021-04-12 [?] RSPM (R 4.2.0)
#>  P pillar           1.8.0   2022-07-18 [?] RSPM (R 4.2.1)
#>  P pkgbuild         1.3.1   2021-12-20 [?] CRAN (R 4.2.0)
#>  P pkgconfig        2.0.3   2019-09-22 [?] RSPM
#>  P pkgload          1.3.0   2022-06-27 [?] RSPM (R 4.2.0)
#>  P prettyunits      1.1.1   2020-01-24 [?] RSPM
#>  P processx         3.7.0   2022-07-07 [?] RSPM (R 4.2.1)
#>  P profvis          0.3.7   2020-11-02 [?] RSPM (R 4.2.0)
#>  P promises         1.2.0.1 2021-02-11 [?] RSPM (R 4.2.0)
#>  P ps               1.7.1   2022-06-18 [?] RSPM (R 4.2.0)
#>  P purrr          * 0.3.4   2020-04-17 [?] RSPM
#>  P R6               2.5.1   2021-08-19 [?] RSPM
#>  P Rcpp             1.0.9   2022-07-08 [?] RSPM (R 4.2.1)
#>  P readr          * 2.1.2   2022-01-30 [?] RSPM (R 4.2.0)
#>  P readxl           1.4.0   2022-03-28 [?] RSPM (R 4.2.0)
#>  P remotes          2.4.2   2021-11-30 [?] CRAN (R 4.2.0)
#>    renv             0.15.5  2022-05-26 [1] RSPM (R 4.2.0)
#>  P reprex           2.0.1   2021-08-05 [?] RSPM
#>  P rlang            1.0.4   2022-07-12 [?] RSPM (R 4.2.1)
#>  P rmarkdown        2.14    2022-04-25 [?] RSPM (R 4.2.0)
#>  P rstudioapi       0.13    2020-11-12 [?] RSPM
#>  P Rttf2pt1         1.3.10  2022-02-07 [?] RSPM (R 4.2.0)
#>  P rvest            1.0.2   2021-10-16 [?] RSPM
#>  P scales           1.2.0   2022-04-13 [?] RSPM (R 4.2.0)
#>  P sessioninfo      1.2.2   2021-12-06 [?] CRAN (R 4.2.0)
#>  P shiny            1.7.2   2022-07-19 [?] RSPM (R 4.2.1)
#>  P stringi          1.7.8   2022-07-11 [?] RSPM (R 4.2.1)
#>  P stringr        * 1.4.0   2019-02-10 [?] RSPM
#>  P tibble         * 3.1.8   2022-07-22 [?] RSPM (R 4.2.1)
#>  P tidyr          * 1.2.0   2022-02-01 [?] RSPM (R 4.2.0)
#>  P tidyselect       1.1.2   2022-02-21 [?] RSPM (R 4.2.0)
#>  P tidyverse      * 1.3.2   2022-07-18 [?] RSPM (R 4.2.0)
#>  P tzdb             0.3.0   2022-03-28 [?] RSPM (R 4.2.0)
#>  P urlchecker       1.0.1   2021-11-30 [?] RSPM (R 4.2.0)
#>  P usethis          2.1.6   2022-05-25 [?] RSPM (R 4.2.0)
#>  P utf8             1.2.2   2021-07-24 [?] RSPM
#>  P vctrs            0.4.1   2022-04-13 [?] RSPM (R 4.2.0)
#>  P vroom            1.5.7   2021-11-30 [?] RSPM
#>  P withr            2.5.0   2022-03-03 [?] RSPM (R 4.2.0)
#>  P xfun             0.31    2022-05-10 [?] RSPM (R 4.2.0)
#>  P xml2             1.3.3   2021-11-30 [?] RSPM
#>  P xtable           1.8-4   2019-04-21 [?] RSPM (R 4.2.0)
#>  P yaml             2.3.5   2022-02-21 [?] RSPM (R 4.2.0)
#> 
#>  [1] /media/LaCie16TB/work/r-blog/renv/library/R-4.2/x86_64-pc-linux-gnu
#>  [2] /usr/local/lib/R/library
#> 
#>  P ── Loaded and on-disk path mismatch.
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────