Analysis of speaker genders at the 2018 ArcticNet Annual Scientific Meeting

Last week I was participating to the ArcticNet Annual Scientific Meeting 2018 in Ottawa. I was pleasantly surprised to see that there was a large proportion of women present at the conference. After the panel discussion on women in northern science which took place on Thursday night, I decided to see if I could use R to scan the scientific PDF program to determine how many men and women were giving scientific presentations.

For this exercise, I will use the pdftools library to scan the program PDF.

library(pdftools)

dat <- pdf_text("http://www.arcticnetmeetings.ca/asm2018/docs/PROGRAM_ASM2018_FINAL.pdf")
dat <- iconv(dat, from = "UTF-8", to = "ASCII//TRANSLIT")

res <- map(dat[18:51], function(x) { # Only scan pages 18-51

  # Use a simple regular expression to extract two consequtive upperletter
  # words. This could be better.
  # str_match_all(dat[[21]], "\\W+([A-Z\\s\\-]{1,})\\r\\n")
  ## Remove the \r if building on Linux
  str_match_all(x, "\\W+([[:upper:]]{3,}\\W+[[:upper:]]{3,})\\r\\n")[[1]][, 2]
  
}) %>%
  unlist() %>%
  tibble(full_name = .) %>%
  filter(!str_detect(full_name, "SESSION")) %>%
  separate(full_name, into = c("first_name", "last_name"))

If I did things correctly, there were 386 speakers at the conference.

## # A tibble: 386 x 2
##    first_name last_name
##    <chr>      <chr>    
##  1 LOUIS      TETU     
##  2 MICHAEL    DELAUNAY 
##  3 MICHAEL    ROSS     
##  4 DESAI      SHAN     
##  5 CATHERINE  BURKE    
##  6 BRIAN      MOORMAN  
##  7 ACHIM      ROTH     
##  8 MATTHEW    ASPLIN   
##  9 DUSTIN     WHALEN   
## 10 ALESSIA    GUZZI    
## # ... with 376 more rows

Then, using the gender package, we can determine the genders.

library(gender)

res <- res %>% 
  nest(c(first_name, last_name)) %>%
  mutate(gender = map(data, ~ gender(.$first_name))) %>%
  unnest(gender) %>%
  mutate(gender = str_to_title(gender))

res
## # A tibble: 347 x 6
##    name      proportion_male proportion_female gender year_min year_max
##    <chr>               <dbl>             <dbl> <chr>     <dbl>    <dbl>
##  1 ADRIAN             0.927             0.0732 Male       1932     2012
##  2 AGATHE             0                 1      Female     1932     2012
##  3 ALEJANDRO          0.992             0.0075 Male       1932     2012
##  4 ALESSIA            0                 1      Female     1932     2012
##  5 ALEXANDER          0.992             0.0079 Male       1932     2012
##  6 ALEXANDRA          0.0037            0.996  Female     1932     2012
##  7 ALEXANDRA          0.0037            0.996  Female     1932     2012
##  8 ALEXANDRE          0.964             0.0355 Male       1932     2012
##  9 ALEXIS             0.145             0.855  Female     1932     2012
## 10 ALISON             0.0045            0.996  Female     1932     2012
## # ... with 337 more rows

Finally, we can create a plot to visualize the data.

res %>%
  count(gender) %>%
  ggplot(aes(x = gender, y = n)) +
  geom_col() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  ylab("Number of speaker") +
  scale_y_continuous(breaks = seq(0, 200, by = 20)) +
  labs(title = "Number of male and female presenters at the ArcticNet 2018 meeting") +
  labs(caption = "Data source: http://www.arcticnetmeetings.ca/asm2018/docs/PROGRAM_ASM2018_FINAL.pdf")

This simple analysis confirms that there were as many men and women presenters at the conference. In the next graph we can see the most popular names of the speakers.

res %>% 
  group_by(gender) %>% 
  count(name) %>% 
  arrange(desc(n)) %>% 
  slice(1:5) %>% 
  ggplot(aes(x = reorder(name, n, sort), y = n)) +
  geom_bar(stat = "identity") + 
  facet_wrap(~gender, scales = "free_y") +
  coord_flip() +
  scale_y_continuous(breaks = seq(0, 10, by = 1)) +
  ylab("Number of speakers") +
  labs(title = "Most frequent speaker names at the ArcticNet 2018 meeting") +
  theme(axis.title.y = element_blank())