Text Analysis of Biden and Trump Speeches During the 2020 Presidential Election

Introduction

The United States presidential election is one of the most followed political events in the world. As such, there are many who study the data involved in the hopes of both making predictions and informing the public on the current state of the election. In this blog post, we share our analysis of data from a key component of the election process: speeches given by the candidates. In particular, we analyzed text from speeches given by Joe Biden and Donald Trump during the lead up to the 2020 election. Our primary questions we hoped to answer were:

What are the most common words and phrases used by Trump and Biden?
What are the relationships between those words/phrases?
How did the frequency of these words/phrases change over time?

We answered these questions through a series of visualizations which display results which we acquired via techniques in text analysis.

trump/biden

Data

We acquired our data from rev.com, a site which contains transcripts from all Trump and Biden speeches during the 2020 presidential election cycle. We compiled a list of urls for the different speeches and then iterated over them, opening each page and scraping the speech into a text file. We also used Python to clean up the text files and remove words from the transcripts which were not actually in the speeches.

Visualizations

In order to address our three posed questions, we created three types of visualizations, one for each question. To identify the most frequent words used in their speeches, we created wordclouds with fontsize corresponding to word frequency. To identify relationships between the words, we created network graphs with edge sizes corresponding to “closeness” of these words within the documents. (We will define “closeness” in the network section). Lastly, we created line graphs to identify changes in word frequencies over time.

Word Frequency Wordclouds

The word cloud is used to highlight the frequently used words in both Donald Trump and Joe Biden’s speeches. We made used of stop words to remove words that are frequently used but provide little information. Some common English stop words include “I”, “she’ll”, “the”, etc. We created a vector to add our own stop words into the built in stopwords dataframe. By analyzing the most frequent words used by the two candidates we can get a better insight into the main issues the two candidates hope to solve or their main policies. For example, a frequent word used by Joe Biden is covid because one of Joe Biden’s main campaign policies was the eradication of the virus in the US and China was one of Trump’s frequently used words because China is the US’s foreign trade rival.

# Create a list of custom stopwords that should be added
word <- c("i’ve", "it’s", "i’m", "he’s", "let’s",
          "won’t", "we’re", "you’re", "that’s","we’ve", "don’t", "you’ve")
lexicon <-  rep("custom", times=length(word))

# Create a dataframe from the two vectors above
mystopwords <- data.frame(word, lexicon)
names(mystopwords) <- c("word", "lexicon")

# Add the dataframe to stop_words df that exists in the library stopwords
stop_words2 <-  dplyr::bind_rows(stop_words, mystopwords) %>%
  mutate(word = str_replace_all(word, "'","’"))

# function to draw word cloud
dscore<-function(data){
  # read in the data into speeches and remove missing rows.
  speeches <- read_csv(data) %>%
    filter(text != "Missing")
  
  word_frequencies <- speeches %>%
    unnest_tokens(output = word, input = text) %>%
    anti_join(stop_words2, by="word") %>%
    count(word, sort = TRUE)
  
  wordcloud2(word_frequencies, shape = "star")

}

Donald Trump WordCloud

Joe Biden WordCloud

Network Visualizations

To understand the relationships between speech words, we looked at two types of words: popular election topics such as climate change, health care, and COVID-19; and the most common words across all speeches. For each of these sets, we needed to define some metric for “closeness”. To do this, we emulated an analysis of Game of Thrones. Specifically, we defined the closeness of two words (within the full dataset) as the number of times that the words occur within d words of each other in a single speech, where d is a parameter specifying this word distance. For the first set of words (the popular election topics), we chose d to be 50 (a relatively high value) since the words considered are more specific in nature (than the other set of words), and therefore a higher choice of d will provide us with enough data to observe significant relationships. On the other hand, we chose d to be 10 for the most common words. Since the words in this set occur much more frequently than the popular election topics, we needed to choose a smaller value of d to refine the selection and avoid noisy data.

Text Mining

In order to mine the data in this way, we used Python over the speeches and words. For each speech, we iterated over all words, and for each relevant word, we collected relevant nearby words (those relevant words within d of the word being considered). Some sample code is provided:

output = [] # each element is one line of the csv output
output.append("word1,word2,weight") # csv header
output_dictionary = defaultdict(lambda: 0) # maps frozenset (of two strings) to int

for input_file in input_files: # these are the speeches
  with open(input_file) as f:
    text = f.read()
    
  regex = re.compile('[^a-zA-Z\' ]')
  text = regex.sub('', text)

  words = text.split(" ")
  for i in range(len(words)):
    word = words[i].lower()
    if word not in top_words: # top_words is the set of relevant words to be considered
      continue
          
    # acquire nearby words
    nearby_words = []
    for j in range(i - dist, i + dist + 1): # check all j within `dist` of word
      if j >= 0 and j < len(words) and j != i: # if j is a valid index
        if words[j] in top_words and words[j] != word:
          nearby_words.append(words[j])

    # increase weights accordingly
    for nearby_word in nearby_words:
      # create hashable frozenset object
      to_hash = frozenset([word, nearby_word])
      output_dictionary[to_hash] += 1

  for pair_set in output_dictionary.keys():
    pair_list = list(pair_set)
    weight = output_dictionary[pair_set]
    output.append(pair_list[0] + ',' + pair_list[1] + ',' + str(weight))

  output = '\n'.join(output)

Note that we actually double count connections between words, but it doesn’t matter for our purposes because we double count all connections, so in comparing these relationships there is no observable difference.

Network Analysis for Popular Election Topics:

Once the data was extracted, we used R’s igraph and ggnetwork to display the networks. Our first set of networks focuses on references to popular election topics we found on usnews.com. We mined the text as described above, and then displayed our results in R. We provide code for our first network:

biden_network_data <- read_csv("biden_speeches_network_data.csv") %>%
  arrange(desc(weight)) %>%
  top_n(16)

biden_network <- graph_from_data_frame(biden_network_data, directed = FALSE)

ggplot(data = ggnetwork(biden_network)
       , aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_edges(arrow=arrow(type="closed", length=unit(6,"pt"))
            , color = "lightgray", aes(size = weight)) +
  geom_nodes() +
  geom_nodelabel(aes(label = name)) +
  theme_blank() +
  theme(legend.position = "none") +
  ggtitle("Topic Relationships in Biden's Speeches")

One interesting takeaway from these visualizations is the more focused quality of Biden’s speeches. While Trump frequently references guns, trade, and Iran in the context of other topics, Biden does not do this as much. A second observation is that both candidates connect similar main topics, including heath care, education, COVID-19, and the economy. For Biden, climate change is one of these topics. (This is not the case for Trump).

Network Analysis for Most Commonly Used Words

Our second set of networks focuses on the most commonly used words across all speeches. The same procedure was used as above.

One observation from these networks is how central the word ‘people’ is to both candidates’ speeches. This is the most commonly used word by both candidates, and therefore it makes sense for it to be central in these networks. Furthermore, Biden’s use of commmon words is more focused, with nearly all of the strongest edges containing the word ‘people’. Trump’s use of these words, on the other hand, is more spread out.

Speech Analysis Over Time

#Linegraphs 1-2: Analysis of the word 'America' over time
biden_linegraph_words3 %>%
  filter(word == "america") %>%
  ggplot(aes(x = as.Date(file, "%m-%d-%y"), y = n)) + 
geom_line(color = "blue", size = 0.5) + geom_point(color = "blue") + 
labs(title = "Biden's use of the word 'America' over time.", 
x = "Date", y = "Word Frequency per Day") + theme_minimal()

trump_linegraph_words3 %>%
  filter(word == "america") %>%
  ggplot(aes(x = as.Date(file, "%m-%d-%y"), y = n)) + 
geom_line(color = "red", size = 0.5) + geom_point(color = "red") + 
labs(title = "Trump's use of the word 'America' over time.", 
x = "Date", y = "Word Frequency per Day") + theme_minimal()

#For the linegraphs, I converted our text files into two dataframes, 
#one for Trump and one for Biden. I created variables to capture the 
#word frequencies similarly to the Emily Dickinson wordcloud lab and 
#then removed stopwords. After removing some extraneous text, I filtered
#the words 'america', 'president', and 'virus', as shown above. I then
#created six total linegraphs (three for each candidate) with the date
#on the x-axis and frequency of each word on the y-axis.

We can see that Biden’s use of the word ‘America’ remains relatively consistent over the last stage of his campaigning. A noticeable outlier towards the end is in his victory speech in early 2021.

We can also see that Trump used the word ‘America’ a similar amount that Biden did during his campaigning. However, we can see a noticeable decrease corresponding with when Joe Biden defeated him in the general election.

Biden used the word ‘president’ a similar amount of times, and again we see a noticeable outlier corresponding with his victory speech.

There is not a statistically significant difference between the amount of times Biden and Trump used the word ‘president’. As before though, we see this drops off somewhat after the general election.

Trump and Biden both used the term ‘virus’ a similar amount during their public appearances. We see Biden mentioned it a lot during the start of the Covid-19 pandemic, and again during his victory speech. Again, Trump mentioned it fairly consistently, until his defeat by Biden at which point he would rarely talk about it.

Limitations, Pitfalls, and Future Research

There were a number of limitations with our analyses. Specifically for the network analysis, tuning the parameters always required compromise. The value for d was far from perfect, and in choosing it, we sometimes captured real connections, yet other times captured noisy/false connections. For example, presidents like to list topics in groups, even though these topics may be completely unrelated. When analyzing these topics, our text mining incorrectly establishes links. Moreover, we faced the challenge of choosing the correct number of edges to display. Too many edges results in a cluttered visualization, while too few results in missed information. In compromising, we accepted both of these problems to some degree.

The main limitation of the wordcloud is that is gives more emphasis to longer words over short words, therefore distorting the true frequency of the words since the longer words with more emphasis seem more frequent. As shown in the second wordcloud of Joe Biden, word clouds become inefficient when the word frequencies are less varied. Therefore, the Joe Biden wordcloud seems to have the same blue color for most of the words since the words are colored by frequency.

For the linegraphs, the speeches from Joe Biden available for scraping covered a larger timeframe than the speeches from Donald Trump. Even though it is around the same number of speeches for each candidate, it is possible the results could be different if we were to look at the same timeframe for both candidates.

While we feel like we adequately answered our specific questions, it’s definitely hard to make inference outside of that with larger questions, like how their different speech patterns contributed to the 2020 election. One interesting idea for future research is to analyze sentiment in their text, to determine the tone of language used and how that could shift over time.