Scraping Wikipedia Table: Genocides

how-to
wrangling
Author

Nathan Craig

Published

November 1, 2023

Abstract

Wikipedia is full of interesting tables. This post works through scraping a wikipedia table on genocides for use in R. The approach of scraping using rvest is compared to use of the ChatGPT large language model. I find that rvest is much better and likely applicable to a wider range of wikipedia tables. Also examed is the use of regular expressions to remove wikipedia footnotes from table cells, removal of commas in string representations of numbers along with their conversion to numeric objects, as well as some ggplot considerations. Plotting details involve making a sorted and stacked bar chart (along with a perplexing snag), bar width and spacing issues, and wrapping of long text fields using scales::cut_short_ scale().

Introduction

Recently I read an article in Coding the Past about using ChatGPT to wrangle Wikipedia tables into a format usable in R. I tried the method, but found it didn’t work well with tables that had blank cells. Many wikipedia tables have blank cells, so this is likely to be a common issue. Following a post on pipe dreams, I turned instead to using rvest for scraping wikipedia table data. In my experience, scraping the data is a much better option than messing around with ChatGPT.

I selected the list of genocides as the table to retrieve.

library(ggplot2)

Select the table and scrape the object

url <- "https://en.wikipedia.org/wiki/List_of_genocides"
url_bow <- polite::bow(url)

Inspect the page and look for HTML called table.wikitable.

ind_html <-
  polite::scrape(url_bow) |>   # scrape web page
  rvest::html_nodes("table.wikitable") |>  # pull out specific table
  rvest::html_table(fill = TRUE) 

The rvest function returns a list of one item, so we need to index the first item in the list to get the data frame. We do so using the double square bracket [[]] index operator which returns an object of the class of item that is contained in the list.

ind_html[[1]] |>
  head() 
Event Location Period Period Estimated killings Estimated killings Proportion of group killed
Event Location From To Lowest Highest Proportion of group killed
Rohingya genocide[N 1] Rakhine StateMyanmar 2016 Present 9,000–13,700[9] 43,000[10] Before the 2015 Rohingya refugee crisis and the military crackdown in 2016 and 2017, the Rohingya population in Myanmar was around 1.0 to 1.3 million, chiefly in the northern Rakhine townships, which were 80–98% Rohingya. Since 2015, over 900,000 Rohingya refugees have fled to south-eastern Bangladesh alone, and more to other surrounding countries, and major Muslim nations. More than 100,000 Rohingyas in Myanmar are confined in camps for internally displaced persons.
Iraqi Turkmen genocide[N 2] Islamic State-controlled territory in northern Iraq 2014 2017 3,500 8,400
Genocide of Yazidis by the Islamic State[N 3] Islamic State-controlled territory in northern Iraq and Syria 2014 2019 2,100[18] 5,000[19]
Darfur genocide[N 4] Darfur, Sudan 2003 Present 98,000[22] 500,000[23]
Effacer le tableau[N 5] North Kivu, Democratic Republic of the Congo 2002 2003 60,000[26][24] 70,000[26] 40% of the Eastern Congo’s Pygmy population killed[N 6]

Remove footnotes and commas and convert strings into numbers

This worked well, but there are still lingering characters from footnotes. We want to remove the brackets and the numbers inside them, but leave the rest of the text. These can be removed with regular expressions…so time to turn to StackOverflow. We’ll start with an isolated example, ensure that works properly, and then attempt to apply it to the entire data frame.

test <- "2,100[17]"

To do this, we’ll need to use the double escape //.

test2 <- gsub("\\[([^]]+)\\]", "",test)
test2
[1] "2,100"
lapply(ind_html[[1]], function(x) gsub("\\[([^]]+)\\]", "",x)) |>
  as.data.frame() |> 
  head()
Event Location Period Period.1 Estimated.killings Estimated.killings.1 Proportion.of.group.killed
Event Location From To Lowest Highest Proportion of group killed
Rohingya genocide Rakhine StateMyanmar 2016 Present 9,000–13,700 43,000 Before the 2015 Rohingya refugee crisis and the military crackdown in 2016 and 2017, the Rohingya population in Myanmar was around 1.0 to 1.3 million, chiefly in the northern Rakhine townships, which were 80–98% Rohingya. Since 2015, over 900,000 Rohingya refugees have fled to south-eastern Bangladesh alone, and more to other surrounding countries, and major Muslim nations. More than 100,000 Rohingyas in Myanmar are confined in camps for internally displaced persons.
Iraqi Turkmen genocide Islamic State-controlled territory in northern Iraq 2014 2017 3,500 8,400
Genocide of Yazidis by the Islamic State Islamic State-controlled territory in northern Iraq and Syria 2014 2019 2,100 5,000
Darfur genocide Darfur, Sudan 2003 Present 98,000 500,000
Effacer le tableau North Kivu, Democratic Republic of the Congo 2002 2003 60,000 70,000 40% of the Eastern Congo’s Pygmy population killed

So far so good, the footnote characters are gone, so lets save this as a dataframe. However, there is a bit more work to do before this can be graphed.

df <- lapply(ind_html[[1]], function(x) gsub("\\[([^]]+)\\]", "",x)) |> 
  as.data.frame()  

Strings that contain a comma are coerced to NA and this isn’t useful. So the commas need to be removed

as.numeric(test2)
[1] NA

We can use gsub() to remove the , and then convert to numeric.

gsub(",", "", test2) |> 
  as.numeric()
[1] 2100

We need to remove the commas from any column that has numbers, but we want to keep commas in columns that are text. The low end estimate column has some ranges that are separated by a - so we would want to either take the numbers before or after this character. Since these are supposed to be low estimates, we’ll take numbers before and remove everything from the - to the end.

df$Estimated.killings <- as.numeric(gsub(",", "", df$Estimated.killings))
sample(df$Estimated.killings, 20)
 [1]   13000 4204000  200000    3500  100000   80000      60  480000   83000
[10]      NA  200000  120000      40   68000  600000 3000000   34000   50000
[19] 1386734  100000

Plotting the table: sorted bar chart and long x values

We can plot loss of life by event using a bar chart.

The plot presents two fairly common challenges.

  1. Some of the label text is very long. As described in this post and the documentation, we can use the scale_x_discrete() function with the label_wrap_gen() argument.

  2. The numbers are very large and would be better represented as an abbreviation. This can be achieved with the scales() package using the label_number() argument scale_cut with a value of cut_short_scale(). The documentation and a tidyverse blog post provide some other options.

df |> 
  dplyr::filter(!is.na(Estimated.killings)) |>  # remove NA values
  ggplot(aes(x = reorder(Event,Estimated.killings), y = Estimated.killings)) +
    geom_bar(stat = "identity", position=position_dodge(.5)) +
    scale_x_discrete(labels = label_wrap_gen(50))+
    scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
    coord_flip()

Just noting that if we reorder based on location, the reorder function does not work properly (I did this). The issue is that there three locations where more than one genocide is listed on the table.

df |> dplyr::count(Location) |> dplyr::arrange(n) |> tail()
Location n
42 Ukraine and the heavily Ukrainian-populated northern Kuban, in the Soviet Union 1
43 Uruguay 1
44 Van Diemen’s Land (now Australia) 1
45 Zanzibar (now part of Tanzania) 1
46 Ottoman Empire (now Turkey, Syria and Iraq) 2
47 German-occupied Europe 3

According to this answer over at the RStudio forums, ggplot sums the values within each of these locations. In the case of German-occupied Europe, there are three genocides summed so the value is exceptionally large. However, Ottoman Empire (now Turkey, Syria and Iraq) is also not displaying properly. It just isn’t as easy to spot.

df |> 
  dplyr::filter(!is.na(Estimated.killings)) |>  # remove NA values
  ggplot(aes(x = reorder(Location,Estimated.killings), y = Estimated.killings)) +
    geom_bar(stat = "identity") +
    scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale()))+
    scale_x_discrete(labels = label_wrap_gen(40))+
    coord_flip()

Citation

BibTeX citation:
@online{craig2023,
  author = {Craig, Nathan},
  title = {Scraping {Wikipedia} {Table:} {Genocides}},
  date = {2023-11-01},
  url = {https://nmc.quarto.pub/nmc/posts/2023-11-01-wikipedia-table-scrape.html},
  langid = {en},
  abstract = {Wikipedia is full of interesting tables. This post works
    through scraping a wikipedia table on genocides for use in R. The
    approach of scraping using `rvest` is compared to use of the ChatGPT
    large language model. I find that `rvest` is much better and likely
    applicable to a wider range of wikipedia tables. Also examed is the
    use of regular expressions to remove wikipedia footnotes from table
    cells, removal of commas in string representations of numbers along
    with their conversion to numeric objects, as well as some `ggplot`
    considerations. Plotting details involve making a sorted and stacked
    bar chart (along with a perplexing snag), bar width and spacing
    issues, and wrapping of long text fields using `scales::cut\_short\_
    scale()`.}
}
For attribution, please cite this work as:
Craig, Nathan. 2023. “Scraping Wikipedia Table: Genocides.” November 1, 2023. https://nmc.quarto.pub/nmc/posts/2023-11-01-wikipedia-table-scrape.html.