A research project focusing on machine learning and whether you can predict a subreddit based on the features of a comment on a submission
This is the reddit post I found my corpus on. I could not find the author's name, only the user name he posts by which is u/Stuck_In_The_Matrix. Credit for this corpus belongs to them and anyone who assisted them in assembling this corpus.
The Reddit Corpus as a direct site link
Directory File Structure -
- README - You are here... maybe... hopefully...
- Progress Report - updates on my project throughout the semester
- Project Plan - Writing out my project plan and details
- Presentation - my power point presentation slides in pdf form. Because it's a pdf there are no animations therefore I feel the need to point out there's a title underneath the meme on slide 2.
- Final Report - self - explanatory.
- Data_Samples Directory - contains Data Samples from processing the data
- AskReddit 1000 samples A relic from my first attempts to process the data
- 30000 Comment Samples with Karma Score above 50 - for viewing and interacting with the data I was primarily working with
- The data frame I used to tune my models
- CRC Directory - contains the files I used when using grid search on my models(CRC).
- Script for gridsearch
- SBatch file Sample Batch file I used
- Text file containing the output of the first job
- Legacy Notebooks Directory - contains notebooks from every phase of the project that are no longer in active use (though referenced)
- project-explore.ipynb My first attempt at processing the data for phase 1.
- Phase 2 Notebook - better processing - cleaning up of data and preliminary machine learning
- Phase 3 Notebook This is the meat of the machine learning analysis
- NLTK Tokenization I run nltk tokenization in a separate notebook to save run time on other notebooks.
- Presentation Data - a notebook that I used to generate some of the data I used in my presentation.
- Curent Phase Directory - Contains notebooks being actively worked on or from the most recent phase if none are actively worked on at the moment.
- Final Notebook - my final pieces of work - only notebook in the current directory.
- Images Directory - contains figures and images and such to be used in the final report
Here is my Guestbook