NLP on the 2018 Texas Race using the Reddit API

Objective

Using the tools learned in General Assembley's Data Science Immersive course, we will collect data via an API request and then building a binary classification predictior. In particular, we will apply this in the realm of Reddit. Using NLP, what characteristics of a post on Reddit contribute most to which subreddit it belongs to? Our data will focus on Texas Senate race between Ted Cruz (R) and Beto O'Rourke (D).

Python Packages Used:

Pandas
Numpy
Scikit-Learn

Data Collection

Build function to scrape reddit with inputs of subreddit name and number of pages to pull Reddit's API is limited to the last 1000 post for any subreddit, so how active a subreddit is can determine how frequently it is necessary to pull pages.

Clean data into useful corpus and assign each with a label of it's subreddit source. This creates domain specific training data that then can be used to classify new data.

Modeling

Classification

Use GridSearch and Pipelines to find best predictive model Run through different model types and GridSearch hyperparameters to tune for best performance
Metric Targeted: Accuracy

Findings

We can apply our model to a different Reddit subreddit to gauge the pulse and discussion of untagged posts. We will reference r/Texas with politic flags to see what this 'neutral' space is discussing more in terms of candidates.

After supplying r/texas data into the model, 69% of posts were geared toward Beto while 31% was geared towards Cruz. This initially indicates an advantage for Beto being the center of discussion but the sentiment for each post needs to be factored in as well.

Conclusion

Although we had a model that performed at a accuracy of 94%, the context of the discussion was heavily negative for both candidates for classification with a high probability.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Project_3_Reddit_JAB.ipynb		Project_3_Reddit_JAB.ipynb
README.md		README.md
presentation.pdf		presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP on the 2018 Texas Race using the Reddit API

Objective

Data Collection

Modeling

Findings

Conclusion

About

Releases

Packages

Languages

babyakja/subreddit_nlp_tx_senate_race

Folders and files

Latest commit

History

Repository files navigation

NLP on the 2018 Texas Race using the Reddit API

Objective

Data Collection

Modeling

Findings

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages