Joey Livorno | [email protected] | 4.24.2020
- Introduction
- Background Summary
- Project Proposal
- Data Gathering and Cleaning
- Exploratory Data Analysis
- Linguistic Analysis
- Machine Learning Analysis
- Problems
- Conclusion
- Sources
Twitter is a great place to consume a diverse wealth of content. No other platform today provides users with such a streamlined feed of news and entertainment in the way that Twitter does, which I believe is because of the "live-action" culture of the app. Because people and corporations are constantly tweeting updates about the topics we care about, Twitter has become the four billion dollar company that it is today.
While Twitter has become such a great source of media consumption, it has not come without its drawbacks. The ease of use has also lead to widespread misinformation and abuse. One of the most notorious twitter users just so happens to be President of the United States. Whether you are for or against Donald Trump, it is no secret that he has used his Twitter influence to not only share false information at times, but more importantly to endorse his friends and disparage his adversaries. This is a political tactic that is used many public figures, but is most notably seen with the President. In this project, my goal is to analyze the contents of Donald Trump's tweets and make discoveries regarding the overall trends of his tweeting habits.
Donald Trump was elected 45th President of the United States in November of 2016. Although he largely came into the public eye in 2015 when he began his campaign, Donald Trump has been a well-known celebrity for a very long time. His Twitter career began in 2009, though, just three years after Twitter was created. Since then, Trump has amassed over 78 million followers on the app and has tweeted about fifty thousand times. Needless to say, Trump's Twitter influence has become rather significant, but the way in which he uses this influence has been subject to controversy.
According to Politifact, Donald Trump is guilty of spreading misinformation on Twitter, especially in recent times during the COVID-19 pandemic. Trump also uses influence to disparage his enemies, which can be seen in the many "nicknames" he has given political opponents. Some noteable examples of this are "Sleepy" Joe Biden, "Crazy" Bernie Sanders (or Nancy Pelosi), and Elizabeth "Pocahontas" Warren. Rather than attack his candidates on the merits of their ideas, he quite literally resorts to name calling. This can become a common theme of his political career, as he often resorts to simple buzzwords--slogans, almost--to represent his positions on various matters. This can most certainly be seen in his Tweets, so being aware of the influence he has would be quite relevant information.
With all this in mind, my project proposal is to simply analyze the content of Trump's Twitter feed to find any striking trends in his habits. If there is any merit to the common assumption that Trump is a notorious Twitter user, then the data should in theory support this in one way or another. More specifically, my plan is to analyze the data from three perspectives. Firstly, I will conduct an Exploratory Data Analysis, which will simply analyze the overall trends in his data and try to link them to his influence. Second, I will conduct a more Linguistic analysis of the data, which will be characterized by a sentiment analysis. Finally, I will construct a few machine learning models that will predict various aspects of his tweets. The first will predict whether or not a tweet was composed by Trump, and the second will predict the sentiment within a certain range. In order to do this, though, I my first task is to obtain the data and clean it.
The first step of my project was to gather the data. The obvious first source I investigated was the Twitter API, the library that allows users to scrape Twitter data directly from the website. There were several problems with this approach, though. First, the data is very unorganized, with each row containing about 190 attributes! This is much more than what I needed. Another problem is that with a basic API package, one can only obtain about 3200 tweets. I needed more data than this to draw sound conclusions, so I decided to look elsewhere for my data. I found a website called the Trump Twitter Archive, which is a free-to-use website created by Brendan Brown that collects all of Trump's twitter data and provides it in a clear and easily downloadable format. I also needed a random sample of tweets that I could compare to Trump's, so for this I used a dataset from followthehashtag.com. This website also provided their data for free, and I was able to obtain about 200,000 tweets from April 14-16 in 2016.
Once I had obtained the data, I had to prepare each of the sets for analysis. For the Trump data, I used a library called TextBlob, a library that allows for relatively simple implementations of Natural Language Processing. I used their sentiment library to create the polarity values that I would use for my analysis. Next, I placed each of the tweets in categories that described their sentiment on a higher level. Low sentiment values were given the label 'L', low-neutral values 'LN', and so on. Finally, I added miscellaneous attributes to the data that would make it easier to process, such as isolating the year, tokenization, etc.
I then began a similar path with the randomly sampled data. First, I got rid of all the columns except for the data and the label 'NT', meaning 'Not Trump.' After this, I copied the dataframe and assigned polarity values to these copied rows. Finally, I created a copy of the Trump data that only contained the text content and the label 'T', meaning 'Trump', and concatenated the two. Because there were only about forty thousand Trump tweets, I had to extract about a fifth of the random twitter data to be used, otherwise my machine learning models would be inaccurate.
In this section, I looked at the data as a whole to find an striking features. The first thing I noted was that Trump's overall polarity was suprisingly high. It was by no means high, but it was in the high neutral range, and given the discourse about Trump's Twitter, I would've expected it to be lower. What was even more suprising was that the average polarity of the random users was lower than that of the President. While this was quite shocking, I think the most likely reason for my shock was because of a bias that is prevelant in media today. Simply put, we pay more attention to incendiary material. If the President tweets something nice on a holiday, no one cares, but if he tweets something sassy about Nancy Pelosi, everyone, for better or worse, eats it up. It's because of these tweets--the outliers--that I would expect his polarity to be much lower, but they are just that: outliers.
Other than that, everyone was more or less to be expected. Overall, his influence has definitely grown over time, with very sharp increases around 2016, which again is when he came into the public eye as a Presidential candidate. This is evidenced by both his trend of receiving retweets and favorites on the app.
In this portion of my project, I focused on conducting the sentiment analysis of Trump's tweets. To do this, I analyzed the sentiment of his tweets by year, both in their complete form and when broken into smaller divisions based on keywords. I created these smaller divisions by making subgroups of the dataframes that contained certain words. For example, for the 'russia' subgroup, I included the words 'russia', 'russian', 'moscow', and 'putin'. I did this in a similar fashion for China and Iran.
Overall, Trump's sentiment has always fluctuated around that high neutral area, with 2009 and 2010 being outlier years. I would count this as negligible, though, as each of those years contains way fewer tweets when compared to the others. I found the trends regarding Russia to be the most interesting, as he spoke quite negatively about them in 2016 and 2017, but this increased dramatically in 2018 to now. I suspect this is due to the Mueller Investigation that began in 2018, and perhaps it caused the focus of Donald Trump's Twitter wrath to shift to the FBI rather than the country. China and Iran were not as conclusive, but I was surprised that China was not as low as the other countries given the rhetoric we would expect from the President regarding that nation.
The machine learning models that I created I think gave me the most concrete information regarding my original project ambitions. I began this process by making a pipeline that consisted of a TfidfVectorizer and a Multinomial Naive Bayes. I then created the dictionary of parameters that I would use in my tests, and fed both of these two the GridSearchCV object. Finally, I fit the GridSearchCV to my Trump/Not Trump data and printed the results. With an accuracy of 93%, I can confidently say that this run was successful. The absolute baseline would be 50% since there are two options (T and NT), though I would say the human baseline would be a bit higher, especially for someone who is politically literate, or at least a follower of Trump on Twitter.
The next model I ran was constructed in the same way to the first, other than the fact that I gave it different parameters. This time, I fit the model to the polarity group data (L, LN, HN, H) and got a new set of results. This run was a little less successful than the first, maxing out at about 70% accuracy. Because this was also much higher than the baseline--it was 25% this time because there were four options--I would still classify it as successful, especially because it would be much harder for a human to accomplish this task, as the work of TextBlob may seem abstract at a glance. With that in mind, this run almost acts as a testiment to the effectiveness of TextBlob as a reliable sentiment analysis tool.
I experienced many issues and obstacles over the course of my project, and while most of them were overcome, I had to work around several and, at times, complete omit the causes of a few. One such problem I experienced early on was figuring out what exactly my project would do. I worried that I wouldn't have enough in my analysis section at first, which is why I decided to add a machine learning portion. Another problem I experienced was when I was choosing which matters I would analyze with Machine Learning. While there was many things I would have liked to analyze, due to time constraints I had to limit it to two. One such matter would have been classifying Trump tweets against that of another prominent politician. The Trump Twitter Archive also includes the tweets of many other public figures, so gathering the data would not have been difficult, but I simply did not have the time to process it, create another model, and then analyze the results.
A more technical problem that I chose to omit the cause of was when I attempted to use regex to find and replace emojis in the tweets with their text representation. Emojis do exactly what their name suggests: the convey emotion. With this in mind, it didn't make sense for me to just ignore them, and I even found a dictionary on GitHub that had the necessary conversions. I couldn't get the method to work that would replace them, though, and so I had to abandon this plan.
Overall, I would say that my project was successful. I found that a lot of the tweeting habits that we associate with Trump can be supported by the trends of his Twitter data, but more importantly I developed a lot of personal skills in terms of data processing and analysis that will help me in my future career.
While Twitter is a wonderful app that has done a lot to better the world in terms of providing the public with valuable information, users have used it to cause harm in both direct and indirect ways. When someone like you or I tweets, the text rarely harms anyone directly, but when someone with the influence of Donald Trump tweets, that message can have serious repercussions on groups of people. It is important for us to be aware of this influence so that if it is misused, we can be informed and check the person who is in the position of power.
https://www.gobankingrates.com/money/business/how-much-is-twitter-worth/ https://www.vanityfair.com/news/2020/03/twitter-manipulation-policy-biden-trump https://developer.twitter.com/en/docs https://www.politifact.com/factchecks/2020/apr/16/donald-trump/donald-trump-falsely-claims-nancy-pelosi-deleted-v/ https://www.rollcall.com/2019/08/26/trumps-nicknames-ranked-as-he-locks-in-on-2020-foes-and-foils/ http://www.trumptwitterarchive.com http://followthehashtag.com https://textblob.readthedocs.io/en/dev/