GitHub - JohnRStarr/Smoking-Gun-Classification: This repository acts as my term project for LING1340: Data Science for Linguists. In this project, I'll be working on building a classifier that can sort through emails to flag useful information for investigators.

Smoking Gun Classification

About:
This repository acts as my term project for LING1340: Data Science for Linguists. In this project, I'll be working on building a classifier that can sort through emails to flag useful information for investigators. The data set that I will be training my classifier on is a set of released emails from around 150 users--mostly senior employees--at former American energy company Enron. These emails are significant for various different types of natural language processing, but the unique situation of these emails being sourced from a notoriously corrupt company allows for the domain specific training that I'm looking for.

Why?:
I was inspired to take this on when I was looking through a catalogue of large data sets. The interesting thing though, is that most of these emails aren't really what you would expect out of the lore that surrounds insider trading or cooking the books. They're mostly like, "my kid wants a football for christmas." But I, like most people, want the juicy stuff! So I want to make a classifier that can separate the signal from the noise effectively.

Data Specifications:

version: May, 2015
size: 1.7GB, nearly 3,500 folders, over 500,000 emails
history: The dataset was originally published by the Federal Energy Regulation Commission during their investigation of Enron. The data was then worked on by the CALO (A Cognitive Assistant that Learns and Organizes), an organization within the AIC (Artificial Intelligence Center), that is a part of the SRI (Stanford Research Institute). The data was later purchased and remodeled by researchers at MIT in the 2010's. I accessed and found the data through former Carnegie Mellon professor William M. Cohen's website, linked below.
organization:
- relative path: maildir/lastname-firstinitial (specific to user)/email directory (specific to user)
  - maildir/arora-h/all_documents
  - maildir/arora-h/deleted_items
  - maildir/arora-h/discussion_threads
  - ...
- other notes: per William Cohen
  'The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form [email protected] whenever possible (i.e., recipient is specified in some parse-able format like "Doe, John" or "Mary K. Smith") and to [email protected] when no recipient was specified.'

Resources:

to read more about the Enron scandal: https://en.wikipedia.org/wiki/Enron#2001_Accounting_scandals
to access the data set: http://www.cs.cmu.edu/~enron/

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data_sample/cash-m		data_sample/cash-m
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Project_Report_1.ipynb		Project_Report_1.ipynb
Project_Report_2.ipynb		Project_Report_2.ipynb
README.md		README.md
progress-report.md		progress-report.md
project-plan.md		project-plan.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smoking Gun Classification

About

Releases

Packages

Languages

License

JohnRStarr/Smoking-Gun-Classification

Folders and files

Latest commit

History

Repository files navigation

Smoking Gun Classification

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages