Reading and Visualizing GDELT Data

how-to
GDELT
wrangling

A quick look at how to incorporate GEDELT visualizations or create ones using xts and dygraphs.

Author

Nathan Craig

Published

October 10, 2022

The Global Database of Events, Language, and Tone (GDELT) offers a TV News explorer. I’ve tinkered with using this to look at themes like Critical Race Theory and explored a bit how to access GDELT data via api. Previously, I visualized the data using ggplot2 but recently I began using the dygraphs library and thought it might be interesting to apply to GDELT data. Representing GDELT searches using dygraphs is the main thing this post looks at. First, let’s dig into simply embedding one of the graphs produced by GDELT.

Embedding

GDELT provides nice interactive visualizations and Quarto offers a simple mechanism to embed such a visualization using iframe along with fenced divs. If you just want something quick and dirty, embedding is a fine way to go. However, graph customization is very limited.

It is worth noting that GDELT TV News search interface offers two output options: Summary Overview Dashboard and Comparison Visualization.

The Summary Overview Dashboard returns individual station mentions of a single search term. It allows one to see similarities and differences between networks regarding the mention of a specific search term. The Comparison Visualization allows one to look at similarities and differences between search terms where all news networks are merged. Each tells a different story about the data. We will look at both options.

Output

Figure 1: Critical Race Theory showing coverage by station.
Figure 2: Comparison visualization showing differences and similarities in the coverage of “Critical Race Theory” and “Evolution” over time.

Note while Figure 1 and Figure 2 show generally similar trends, the two graphs are not synchronized in any way. To synchronize the two graphs, we’ll need to make our own. Before we get into that, let’s look at how to embed a visualization generated by GDELT.

How To Embed

To get the embed code click on the EXPORT hamburger menu and select “Embed This Chart”

Select the output text and paste it into a quarto document in source mode.

That text needs to be copied and formatted as two separate parts: first there is an html chunk containing a css class for .respviz and then a fenced div containing the iframe objects. In the example below, please note that the request URL was shortened for readability.


```{=html}

<style>
.respviz { height: 400px; }
@media screen and (min-width : 0px) and (max-width : 767px){ .respviz { height: 250px; } }
</style>

```

``` column-page

::: {#fig-crt .column-page}

<iframe src="<<URL>>" scrolling="no" width=100% frameborder="0" class="respviz"></iframe>

Critical Race Theory

:::

Voilà, (with a fully formed URL) this should produce an embedded graph like the ones shown at the top of this page. Now, let’s turn to how we can create our own visualizations that can be customized.

Constructing Visualizations

Making one’s own visualization of the data isn’t that much more complicated than embedding, and there is the added benefit of having access to a wider range of customizations.

Comparison Searches

This works very well when the GDELT search interface Output Type is set to Summary Overview Dashboard. However, when the GDELT search interface Output Type is set to Comparison Visualization, the Export -> Save as JSON option is not available. I’m not sure why this is, but if one wants to build a comparison graph, at present, it is necessary to download a csv file. We’ll do that for a search wereh output type is set to Comparison Visualization.

Read this file and convert it to a format that dygraph can read. First, we make a data frame.

Code
df2 <- read_csv("crt-evolution-comparison.csv") |> 
  rename(date = 1,
         "Critical Race Theory" = 2,
         "Evolution" = 3) |> 
  mutate(date = my(date))

I found that reading from a csv was different than reading from the API above. To convert the data frame to something dygraphs wants, I was initially using as.ts(read.zoo(df_wide)). However, when reading from csv many of the 0 readings were converted to NA. This can be overcome with z2[is.na(z2)] = 0. However, if one simply uses read.zoo() this isn’t necessary. I just used zoo which seems to work fine.

Code
ts2 <- read.zoo(df2)
Code
dygraph(ts2) |> 
  dyLegend(width = 500)
Figure 4: GDELT volume timeline comparing mentions of “Critical Race Theory” and Evolution

Descriptive Statistics

Having visualized at the data, there are some questions one might want to ask about their structure. Does one station cover “Critical Race Theory” more than another? Is “Critical Race Theory” discussed more often than Evolution?

Does one station cover CRT more often than another?

Let’s narrow the question slightly. For the period of time covered by GDELT, are there differences in the average and total coverage of CRT by network? We can begin by generating a few simple descriptive statistics and plotting these.

Code
df_crt_summary <- group_by(df, network) |> 
  summarise(
    sum = sum(value),
    mean = mean(value),
    sd = sd(value)
  )
df_crt_summary|> kable(digits =3)
Table 1: Summary statistics of CRT coverage by network
network sum mean sd
BLOOMBERG 0.292 0.000 0.002
CNBC 0.570 0.000 0.003
CNN 42.394 0.008 0.064
CSPAN 44.640 0.008 0.067
CSPAN2 57.415 0.011 0.064
CSPAN3 37.049 0.007 0.050
FBC 76.825 0.014 0.091
FOXNEWS 167.729 0.031 0.151
MSNBC 59.909 0.011 0.067
Code
ggplot(df_crt_summary, aes(y=mean, x= network, color=network))+
  geom_point(aes(size=4))+
  theme(legend.position = "none")

ggplot(df_crt_summary, aes(y=mean, x= network, color=network))+
  geom_point(aes(size=4))+
  geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd))+ 
  theme(legend.position = "none")

ggplot(df_crt_summary, aes(y=sum, x= network, color=network))+
  geom_point(aes(size=4))+ 
  theme(legend.position = "none")
(a) Average coverage
(b) Average and standard deviation coverage
(c) Sum coverage
Figure 5: CRT summary value plots

Based on the average and sum plots, it looks like FOXNEWS reported more frequently on CRT than other networks. The average plot is complicated to read because while FOXNEWS has the hugest average coverage (Figure 5 (a)) when one includes standard deviation the differences are less clear (Figure 5 (b)). This is because of the very large variation in FOXNEWS coverage. Differences in coverage appear clearer when looking at total coverage; here we can see that FOXNEWS covered CRT more than twice as often as any other network (Figure 5 (c)). Intriguingly FBC, which is Fox Business, has the second highest average and second highest total coverage of CRT.

Is Critical Race Theory covered more often than Evolution?

Looking at Figure 4, it seems as though there are differences in the coverage of CRT and Evolution over time. Generally, Evolution is mentioned more frequently but in 2021 there was a marked increase in mentions of CRT. We can quantify the global differences Table 2.

Code
df2 |> pivot_longer(2:3, names_to = "search", values_to = "value") |> 
  group_by(search) |> 
  summarize(
    sum_norm = (sum(value)/as.numeric(difftime(max(df2$date), min(df2$date), units="days")))*100,
    mean = mean(value),
    sd = sd(value)
  )|> kable(digits =3)
Table 2: Summary statistics comparing mentions of CRT and Evolution in the GDELT TV News Archive.
search sum_norm mean sd
Critical Race Theory 0.053 0.016 0.062
Evolution 0.093 0.028 0.016

Both the time normalized sum and mean coverage of Evolution is higher than CRT, but the standard deviation of CRT coverage is much higher than Evolution. This is because of the very high amplitude coverage that occurred in 2021 and continues to early 2022.

Code
# Pre-2021 Coverage
df2 |> 
  filter(date < "2021-01-01") |> 
  pivot_longer(2:3, names_to = "search", values_to = "value") |> 
  group_by(search) |> 
  summarize(
    # Sum is temporally scaled
    sum_norm = (sum(value)/as.numeric(difftime("2021-01-01", min(df2$date), units="days")))*100,
    mean = mean(value),
    sd = sd(value)
  ) |> kable(digits = 4)
Code
# Post-2021 Coverage
df2 |> 
  filter(date > "2021-01-01") |> 
  pivot_longer(2:3, names_to = "search", values_to = "value") |> 
  group_by(search) |> 
  summarize(
    # Sum is temporally scaled
    sum_norm = (sum(value)/as.numeric(difftime(max(df2$date),"2021-01-01", units="days")))*100,
    mean = mean(value),
    sd = sd(value)
  ) |> kable(digits = 4)
Table 3: [“Pre-2021 Coverage”,“Post-2021 Coverage”]
search sum_norm mean sd
Critical Race Theory 0.0016 0.0005 0.0029
Evolution 0.0959 0.0292 0.0168
search sum_norm mean sd
Critical Race Theory 0.3966 0.1204 0.1320
Evolution 0.0666 0.0202 0.0072

Time series analysis (Kitagawa 2020), and time series forecasting (Hyndman and Athanasopoulos 2021) are enormous fields that I barely understand. However, I do know that when the pre-event state is low amplitude and low variability for long periods of time, then for the purpose of event analysis the mean and standard deviation of that signal can serve as a predicted counterfactual (Huntington-Klein 2022). Therefore, we should be able to compare the before state to the after state based on mean and standard deviation.

This is useful, but in this case the standard deviation is larger than the mean. This is because the signal is larger but still erratic over time. To talk about accumulated coverage over time, we can use a time normalized sum.

References

Huntington-Klein, Nick. 2022. “The Effect: An Introduction to Research Design and Causality.” In, 646. Boca Raton: CRC Press. https://theeffectbook.net/ch-EventStudies.html.
Hyndman, Rob J., and George Athanasopoulos. 2021. Forecasting: Principles and Practice (3rd Ed). Melbourne, Australia: OTexts. https://otexts.com/fpp3/index.html.
Kitagawa, Genshiro. 2020. Introduction to Time Series Modeling with Applications in r. 2nd edition. Boca Raton: CRC Press.

Citation

BibTeX citation:
@online{craig2022,
  author = {Craig, Nathan},
  title = {Reading and {Visualizing} {GDELT} {Data}},
  date = {2022-10-10},
  url = {https://nmc.quarto.pub/nmc/posts/2022-10-10-gdelt-visualizations},
  langid = {en}
}
For attribution, please cite this work as:
Craig, Nathan. 2022. “Reading and Visualizing GDELT Data.” October 10, 2022. https://nmc.quarto.pub/nmc/posts/2022-10-10-gdelt-visualizations.