Project 3 : SubReddit Classifier

Problem statement

A Book publisher has come up with an new and interesting idea for cook book and is looking forward to publish it.With so much of materials (like cooking and baking recipes) available online, they want to check if people are still interested in reading a cook book i.e., will it be profitable to publish a cookbook.

This Project compares top 1000 posts from two subreddit - Bookclub and Cooking from Reddit which is an American social news aggregation, web content rating, and discussion website. Bookclub is a discussion forum about all things related to books, authors, genres and recommendation of books. Cooking , as the name suggests is a discussion forum related to recipes, preparation and various cuisines and all things related to cooking.

In this Project, these two subreddits are explored to see if people in Bookclub and Cooking are discussing about cookbook, despite so many cooking/baking recipes available online.

Executive Summary

Data Collection & Cleaning

Data is extracted from reddit website by webscrapping (-a technique to automatically access and extract large amounts of information from a website ) the subreddit Cooking and Bookclub and cleaned. The link to those jupter notebooks is as follows:

Click here to open notebook Cooking
Click here to open notebook Bookclub

Exploratory Data Analysis

This step involves the following:

Import and Read data - Reading the csv file

Data Dictionary - is specified below

Data Visualization - Creating histogram and word cloud

Baseline Accuracy - Calculation

Data Dictionary :

Feature	Type	Description
subreddit	int64	specifies the type of subreddit
selftext	object	post of the subreddit
length	int64	length of the post

Pre Processing

This step involves the following methods :

Tokenizing - splitting data into distinct chunks

Removing Stopwords- Removing commonly used words/stop words as they take up space and processing time

Lemmatizing - return the base/dictionary form of a word

Modeling

This step creates three models and compares them.

Logistic Regression Model

Naive Bayes Model

Decision Tree Model

Comapring Models

Train and Test Scores:

Model	Train Score	Test Score
Logistic Regression Model	0.9980769230769231	0.9942418426103646
Naive Bayes Model	0.9871794871794872	0.9846449136276392
Decision Tree Model	0.9993589743589744	0.9328214971209213

Confusion Matrix Result:

Model	False Positives	False Negatives
Logistic Regression Model	0	3
Naive Bayes Model	8	0
Decision Tree Model	20	15

Inferential Visualizations

Creating a desicion tree with Labels

Creating word clouds for subreddit - Bookclub and Cooking

Conclusions and Recommendations

By interpreting the nodes of Decision Tree structure, it is evident that approximately 330 posts in Cooking contains the word "book" and most of the post includes words related to books like chapter, read etc. On the other hand very minimal post in Bookclub have words related to cooking like recipe,meat,ingredient, oil etc. Based on this data it can be assumed that post in cooking forum might have referred to any recipe from a book and posts in Bookclub might be referring to any recipe in a cookbook but cannot be confirmed. Since reddit is a popular forum, top 1000 posts from subreddit 'Cooking and 'Bookclub' were explored. This analysis can be extended to more posts in the same subreddit, other related subreddit and other popular online discussion forum to get a better insight about cookbook, which will help to conclude about profitability of publishing an cook book worldwide.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
data		data
.DS_Store		.DS_Store
README.md		README.md
Subreddit Classifier.pdf		Subreddit Classifier.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project 3 : SubReddit Classifier

Problem statement

Executive Summary

Contents

Data Collection & Cleaning

Exploratory Data Analysis

Pre Processing

Modeling

Inferential Visualizations

Conclusions and Recommendations

About

Releases

Packages

Languages

vaishnavibv13/Reddit-Classifier

Folders and files

Latest commit

History

Repository files navigation

Project 3 : SubReddit Classifier

Problem statement

Executive Summary

Contents

Data Collection & Cleaning

Exploratory Data Analysis

Pre Processing

Modeling

Inferential Visualizations

Conclusions and Recommendations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages