A Look at Trump and Clinton’s Tweets Using Tweepy – Part 2: Term Frequency

This post is part of a series.
Part 1 can be found here.
Part 3 can be found here.

Word Frequency: Who Talks About What?

Next, I was curious about the topics discussed by both candidates. I used a word frequency counter for this. Initially I looked at the word frequencies in aggregate for the users’ last 2,000 tweets. This proved to be less interesting than segmenting the tweets month-by-month. The month-by-month method helped to show some topic shifts over time, particularly among Donald’s tweets. Keep in mind that this analysis was performed in mid-August.

There’s certainly a more elegant way to bucket these tweets using, but I wanted something quick:

userID = "realDonaldTrump"
setsize = 2000
monthTweetCounter = [0,0,0,0,0]
fNameList = ['tweetsfrom' + userID + 'August.json',
             'tweetsfrom' + userID + 'July.json',
             'tweetsfrom' + userID + 'June.json',
             'tweetsfrom' + userID + 'May.json',
             'tweetsfrom' + userID + 'April.json']

Augf = open(fNameList[0], 'w')
Julf = open(fNameList[1], 'w')
Junf = open(fNameList[2], 'w')
Mayf = open(fNameList[3], 'w')
Aprf = open(fNameList[4], 'w')

for status in tweepy.Cursor(api.user_timeline, id = userID).items(setsize):
    if(status.created_at > datetime.datetime(2016, 8, 1)):
        monthTweetCounter[0] += 1
    elif(status.created_at > datetime.datetime(2016, 7, 1)):
        monthTweetCounter[1] += 1
    elif(status.created_at > datetime.datetime(2016, 6, 1)):
        monthTweetCounter[2] += 1
    elif(status.created_at > datetime.datetime(2016, 5, 1)):
        monthTweetCounter[3] += 1
    elif(status.created_at > datetime.datetime(2016, 4, 1)):
        monthTweetCounter[4] += 1


This task requires a considerable amount of pre-processing to eliminate all of the unimportant terms. I considered including the pre-processing, but I think it would serve to confuse and distract readers. My methodology here was a tweaked version of Marco Bonzanini’s. If you wish to perform a similar analysis, I’d recommend his post on pre-processing. In addition to filtering generic “stop-words”, I chose to pull out some Twitter-specific terms (like “RT”) and some candidate specific noise.

Below is the code segment that follows pre-processing. I take the month-by-month files created above, remove any hashtag terms or responses to other accounts, run it through the filters detailed above, and then output the 5 month-by-month bar charts:

import operator
import json
from collections import Counter

userID = "realDonaldTrump"
fNameList = ['tweetsfrom' + userID + 'August.json',
             'tweetsfrom' + userID + 'July.json',
             'tweetsfrom' + userID + 'June.json',
             'tweetsfrom' + userID + 'May.json',
             'tweetsfrom' + userID + 'April.json']

for fname in fNameList:
    with open(fname, 'r') as f:
        count_all = Counter()
        for line in f:
            tweet = line
            # Count hashtags only
            terms_hash = [term for term in preprocess(tweet, True)
                  if term.startswith('#')]
            # Count terms only (no hashtags, no mentions)
            terms_only = [term for term in preprocess(tweet, True)
                  if term not in stop and
                  not term.startswith(('#', '@'))]

            # Update the counter

    import vincent
    word_freq = count_all.most_common(20)
    labels, freq = zip(*word_freq)
    data = {'data': freq, 'x': labels}
    bar = vincent.Bar(data, iter_idx='x')


The Output

Using this method, I produced 5 bar charts for each candidate filtered by month. Keep in mind that the candidates do not tweet at the same rate (as discussed in post 1), so comparing the raw numbers between candidates each month would be largely pointless. Here are the bar charts themselves. In a moment I’ll discuss some of the more interesting findings:

Trump August:


Trump July:

Trump July

Trump June:

Trump June

Trump May:

Trump May

Trump April:

Trump April

Clinton August:

Hillary August

Clinton July:

Hillary July

Clinton June:

Hillary June

Clinton May:

Hillary May

Clinton April:

Hillary April

1) They talk about themselves and each other more than anything else

By a landslide, both candidate’s use the term “Hillary” more than anything else. Trump’s frequent attacks on “Crooked Hillary Clinton” ensure that all 3 of these individual terms score highly each month. The exception here is April, which predates the “Crooked” nickname. In April, Trump was still contending with Ted Cruz and John Kasich, which explains why their names rank on his list.

Hillary’s twitter account is used frequently to quote her own speeches. Before I removed additional punctuation, open and close quotations ranked highly in Hillary’s frequent terms. Since she attributes these quotes to herself at the end of each, her own name ranks highly in her frequent terms. Trump’s last name frequently overtakes “Hillary” as her most used term, showing how her campaign is comparably negative in its focus.

2) Clinton talks about “women”, “families”, and “together”; Trump talks about “great”, “thank”, and “media”

I selected three terms that both candidates focus on recurrently. My interpretations of them are subjective.

Hillary seems to be referring to the demographics she already does well with or wants to further secure:  women and (working) families. I imagine the frequent use of “together” is to create a sense of ownership over one’s political opinion. The “I’m With Her” slogan is proven to have a similar effect. Voters are more likely to actually get to the polls if they feel a sense of identity with their voting choice, rather than a simple preference.

For Donald “great” is self evident, as it is part of his campaign slogan: “Make America Great Again”. “America” and “Americans” unsurprisingly ranked highly for both candidates. “Thank” has been used frequently after victories or meetings with other significant political figures. I included “media” because it ranked highly in August (alongside “dishonest”), as Donald received particularly negative attention from news organizations this month.

3) Hillary responds to the shooting in Orlando in June with “gun” and “lgbt”; Trump singles out Elizabeth Warren in May

The shooting in Orlando created a shift in Hillary’s Twitter topics. She chose to focus on the issues of gun control and hate crimes against the LGBT community, although the dialogue on this particular event has been very multifaceted. The Orlando shooting did not create a shift in Donald’s frequent terms (“ISIS” appears in July, not June), though I’m sure he addressed the topic.

Evidence of Donald’s publicized criticism of Elizabeth Warren appear in the chart of May, as “Elizabeth” and “Warren” appear prominently, more so even than  “Bernie”. “Goofy” (his nickname for Elizabeth) appears as well that month.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s