Using Latent Dirichlet Allocation (LDA) to model the topics discussed during US Presidential Election debates, held between 1960 and 2016, to analyzing their variations
The paper can be downloaded here, and the code used is accessible on GitHub.
Topic models have often been trained on political texts, allowing an understanding for thematic party preferences and political bias. One such type of political text is the transcript of election debates between candidates. This paper researches different topic models to analyze the transcripts of US election debates, hosted between 1960 and 2016. These debates are at the party-level, to determine party nominees, and at the general level between final Presidential candidates. In total, 137 debates were used for the text analysis and topic modeling. The topic modeling presented the distributions (weights) of 9 key topics across all the debates, characterizing each debate by the topics and the topic weights. The 9 topics found which best characterize the debates were: the economy, healthcare, immigration, foreign policy, education, war & terrorism, energy, environment and social issues. To obtain the topic scores, supervised machine learning was employed in tandem with Latent Dirichlet Allocation (LDA) models, a form of topic modeling.
A form of supervised learning, topic models display the distributions of a pre-set number of differentiable topics within a corpus of documents. Each debate transcript is scraped from an online source, The American Presidency Project , and is treated as a 1corpus, with each sentence spoken by a participant is seen as a document within the corpus. To investigate the distribution of the topics discussed by candidates, the moderator comments must be dropped and the Natural Language ToolKit (NLTK) is used to clean the text. A combination of text analysis techniques and topic models and are used to extract the weightings of each topic discussed in 137 US Presidential and party-level debates. The weightings represent the proportion of the entire debate that is spent discussing each topic — if a topic has a weight of 0.25 for a debate, then 25% of the debate was spent discussing that ;topic. This allows the debates to be characterized by a set of topics. The analyses and results presented in the paper were performed on the variations amongst topic weightings between debates. The resulting data shows how much each topic was discussed by the participants at each debate. With these topic scores, insights can be made on party preferences and topic trends between election cycles.
Summary Statistics of the Raw Data
Using the LdaMulticore’s outputs on word probabilities within topics, the respective probabilities and weightings of each topic per debate were calculated. These values were used to determine the amount discussing each topic for each type of debate during each election cycle. Using this data, graphs of the variations in data were created to show how the political landscape has changed.
Summary of the data produced by the LDA
Graph 1 (below) shows the values for each topic compared to the four debate categories. Clearly, healthcare, energy and the environment have received the least amount of attention. Further, immigration is significantly lower during presidential debates relative to other debates. This is also seen for healthcare and social issues during Republican debates, where their topic shares are lower than the mean.
Graph 4 below, retrieved from the paper, shows the top 3 topics discussed at the Presidential Debates between 2000 and 2016. These topics — economy, foreign policy and War & Terrorism — have accounted for an average of 70.236% of all discussions.