Analyzing the programming languages used in R packages

It is easy to integrate other programming languages withing R. For instance, Rcpp and reticulate can be used to interface R with C++ and Python. In this post, I am analyzing the programming languages used in the R packages published on CRAN. I have downloaded all published packages and used cloc (v1.82) to count the number of lines of code in each package. Disclaimer: cloc does not only count line of code for programming language. It also counts the number of lines in a markup language such as markdown. In this post, I make no difference between language types. At the time of writing this post, approximately 14 700 packages were analyzed.

Setup

Let’s load all the packages needed for this analysis.

library(tidyverse)
library(rvest)
library(glue)
library(curl)
library(fs)
library(furrr)
library(tidytext)
library(ggpmthemes) # devtools::install_github("pmassicotte/ggpmthemes")

theme_set(theme_exo())

# Setup the number of cores to use with furrr
plan(multiprocess(workers = availableCores() - 1))

Download the CRAN packages

The first step consists in downloading all the packages onto my hard drive. I am using rvest and curl for this operation.

links <- read_html("https://cran.r-project.org/src/contrib/") %>%
  html_nodes("a") %>%
  html_attr("href") %>%
  enframe(name = NULL, value = "link") %>%
  filter(str_ends(link, "tar.gz")) %>%
  mutate(destfile = glue("g:/r-packages/{link}")) %>%
  mutate(link = glue("https://cran.r-project.org/src/contrib/{link}"))

# Download packages
links %>%
  future_pmap(~ curl_download(url = ..1, destfile = ..2), .progress = TRUE)

Count lines of code

To count lines of code, cloc (v1.82) needs to be installed in your PATH. Once installed, you can use cloc to analyze a directory containing source code files. Furthermore, results generated by cloc can be exported into a CSV file. Here, one CSV file for each package will be generated.


links <- links %>%
  mutate(loc_csv = glue("{tools::file_path_sans_ext(destfile)}.csv"))

extract_loc <- function(destfile, loc_csv) {
  untar(destfile, exdir = "g:/r-packages/")
  pkg_dir <- untar(destfile, exdir = "g:/r-packages/", list = TRUE)
  pkg_dir <- glue("g:/r-packages/{pkg_dir}", pkg_dir = pkg_dir[[1]])

  system(
    glue(
      "cloc-1.82.exe {pkg_dir} --out={loc_csv} --csv"
    )
  )
}

# Count the number of code lines and write a CSV file.
links %>%
  future_pmap(~ extract_loc(..2, ..3), .progress = TRUE)

Then, we can read all the generated CSV files.

csv_files <- dir_ls("g:/r-packages/", regexp = "\\.csv$")

pkg <-
  future_map_dfr(csv_files,
    function(file) {
      data.table::fread(
        file,
        header = FALSE,
        skip = 1,
        col.names = c("file", "language", "blank", "comment", "code")
      ) %>%
        mutate(pkg = !!file)
    }
  ) %>%
  as_tibble() %>%
  filter(language != "SUM") %>%
  extract(
    col = pkg,
    into = c("pkg_name", "version"),
    regex = "G:/r-packages/(.*)_(.*).tar.csv"
  )

pkg

Languages used in R packages

There are a total of 108 programming languages used in CRAN R packages. The following graph shows the top 16 most used (determined by the number of lines).

languages <- pkg %>% 
  count(language, sort = TRUE, name = "n_package")

languages
#> # A tibble: 108 x 2
#>    language     n_package
#>    <chr>            <int>
#>  1 R                14689
#>  2 Markdown          5710
#>  3 HTML              3680
#>  4 C                 2162
#>  5 C++               2041
#>  6 C/C++ Header      1867
#>  7 Bourne Shell       500
#>  8 CSS                459
#>  9 TeX                401
#> 10 JavaScript         370
#> # ... with 98 more rows

languages %>% 
  top_n(16, n_package) %>% 
  mutate(language = fct_reorder(language, n_package)) %>% 
  ggplot(aes(x = language, y = n_package)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(expand = expand_scale(mult = c(0, 0.2))) +
  xlab(NULL) +
  ylab("Number of package") +
  labs(title = str_wrap("Number of R packages using the top 16 most used programming languages", 40)) +
  labs(subtitle = glue("Based on {n_distinct(pkg$pkg_name)} packages")) +
  labs(caption = "Data: https://cran.r-project.org/src/contrib/")

Count the number of LOC per language

What are the most used programming languages used in R packages? Obviously, R is the #1 language and there is a total of 22,822,548.0 lines of R code in published R packages. That is pretty impressive! The next graph shows the total number of lines of code found in all CRAN packages.

most_popular <- pkg %>% 
  group_by(language) %>% 
  summarise(total_loc = sum(code)) %>% 
  filter(dense_rank(desc(total_loc)) <= 16) %>% 
  mutate(language = fct_reorder(language, total_loc, .fun = sum)) 

most_popular %>%
  ggplot(aes(x = language, y = total_loc)) +
  geom_col() +
  coord_flip() +
  xlab(NULL) +
  ylab("Number of line of code") +
  scale_y_continuous(expand = expand_scale(mult = c(0, 0.2)), labels = scales::comma) +
  labs(
    title = str_wrap("Top 16 programming languages used in R packages", 40),
    subtitle = glue("Based on {n_distinct(pkg$pkg_name)} packages"),
    caption = "Data: https://cran.r-project.org/src/contrib/"
  )

Programming languages used in R packages

The following graph shows the 18 packages with the highest number of lines of code differentiated by languages.

pkg %>% 
  group_by(pkg_name) %>% 
  add_tally(code) %>% 
  ungroup() %>% 
  filter(dense_rank(desc(n)) <= 18) %>% 
  mutate(language = reorder_within(language, code, pkg_name)) %>% 
  mutate(pkg_name = fct_reorder(pkg_name, code, sum, .desc = TRUE)) %>% 
  ggplot(aes(x = language, y = code)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ pkg_name, scales = "free", ncol = 3) +
  scale_x_reordered() +
  scale_y_log10(expand = expand_scale(mult = c(0, 0.2))) +
  xlab(NULL) +
  ylab("Number of line of code") +
  labs(title = str_wrap("Programming languages used in packages with the most number of lines of code", 40)) +
  labs(subtitle = "Packages are arranged in descending order of the total number of lines of code") +
  labs(caption = "Data: https://cran.r-project.org/src/contrib/")

The tidyverse

The following graph shows the programming languages used by the package included in the tidyverse ecosystem.

pkg %>%
  filter(
    pkg_name %in% c(
      "ggplot2",
      "dplyr",
      "tidyr",
      "readr",
      "purrr",
      "tibble",
      "stringr",
      "focats",
      "lubridate",
      "hms", 
      "feather", 
      "haven",
      "jsonlite",
      "readxl",
      "rvest",
      "xml2",
      "modelr",
      "broom"
    )
  ) %>%
  mutate(language = reorder_within(language, code, pkg_name)) %>%
  mutate(pkg_name = fct_reorder(pkg_name, code, sum, .desc = TRUE)) %>%
  ggplot(aes(x = language, y = code)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~pkg_name, scales = "free", ncol = 3) +
  scale_x_reordered() +
  scale_y_continuous(expand = expand_scale(mult = c(0, 0.2))) +
  # scale_y_continuous(labels = scales::scientific) +
  xlab(NULL) +
  ylab("Number of line of code") +
  labs(title = "Programming languages used in the tidyverse") +
  labs(subtitle = "Packages are arranged in descending order of the total number of lines of code") +
  labs(caption = "Data: https://cran.r-project.org/src/contrib/")

R, C++ and Python

R, C++ and Python are the three programming languages I use the most. I wanted to know what was the proportion of lines of code were used by these languages in all CRAN packages. For the exercise, I regrouped C/C++ code together. It is interesting to see that C++ represent approximately 40% of the code lines of R packages. Thanks to Rcpp!

big_three <- pkg %>% 
  filter(language %in% c("R", "C++", "C", "C/C++ Header", "Python")) %>% 
  mutate(language = case_when(
    language %in% c("C++", "C", "C/C++ Header") ~ "C++",
    TRUE ~ language
  )) %>% 
  group_by(language) %>% 
  summarise(percent = sum(code) / sum(.$code))

big_three %>%
  mutate(language = fct_reorder(language, percent)) %>%
  ggplot(aes(x = language, y = percent)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = scales::percent, expand = expand_scale(mult = c(0, 0.2))) +
  xlab(NULL) +
  ylab("Percentage of line of code") +
  labs(
    title = str_wrap("Percentage of R, C++ and Python used in R packages", 40),
    subtitle = glue("Based on {n_distinct(pkg$pkg_name)} packages"),
    caption = "Data: https://cran.r-project.org/src/contrib/"
  )


See also

comments powered by Disqus