The main goal of this project is collecting data by scraping Reddit website and then building a binary classifier to identify where a given post came from.
For the project I chose highly correlated subreddits:
- Dog lovers subreddit - Dogs https://www.reddit.com/r/dogs/
- Dog haters subreddit - Dogfree https://www.reddit.com/r/Dogfree
This model could be used for identifying dog lovers and dog haters based on their posts in different social networks and websites and offer them target ads.
- Data was gathered from Reddit's API, using the Python requests library. Reddit's API returns a JSON file with the page’s content. Collection was automated using AWS and Cron.
- The final dataset contains 2245 rows and 3 columns
Column | Type | Description |
---|---|---|
subreddit | string | which subreddit a post belongs to (dogs/dogfree) |
title | string | title of a subreddit post |
selftext | string | text of a subreddit post if post contains it |
- Code folder - Contains all Jupyter notebooks:
- 01_Data_collection.ipynb
- 02_Preprocessing_and_Modeling.ipynb
- Datasets folder - Contains .csv files - collected data from Reddit's API
- Images folder - Contains images
- AWS folder - Contains script running on EC2 AWS and stats file
- Presentation folder - Contains the presentation.
The project will do the following parts:
- Data gathering from Reddit's API
- Cleaning data
- Doing an EDA
- Text preprocessing
- Modeling with Gridsearching Hyperparameters
- Evaluating models
In spite of the fact that I chose highly correlated subreddits all three models (Logistic Regression, Naive Bayes and Random Forest) showed at least 1.7x better accuracy score than the baseline. The best model is Logistic Regression, we can predict with an accuracy of about 95.4% where a given post came from, 19 out of 20 posts.