Skip to content

This repository acts as my term project for LING1340: Data Science for Linguists. In this project, I'll be working on building a classifier that can sort through emails to flag useful information for investigators.

License

Notifications You must be signed in to change notification settings

JohnRStarr/Smoking-Gun-Classification

 
 

Repository files navigation

Smoking Gun Classification

Sean Steinle | [email protected]


About:
This repository acts as my term project for LING1340: Data Science for Linguists. In this project, I'll be working on building a classifier that can sort through emails to flag useful information for investigators. The data set that I will be training my classifier on is a set of released emails from around 150 users--mostly senior employees--at former American energy company Enron. These emails are significant for various different types of natural language processing, but the unique situation of these emails being sourced from a notoriously corrupt company allows for the domain specific training that I'm looking for.


Why?:
I was inspired to take this on when I was looking through a catalogue of large data sets. The interesting thing though, is that most of these emails aren't really what you would expect out of the lore that surrounds insider trading or cooking the books. They're mostly like, "my kid wants a football for christmas." But I, like most people, want the juicy stuff! So I want to make a classifier that can separate the signal from the noise effectively.


Data Specifications:

  • version: May, 2015
  • size: 1.7GB, nearly 3,500 folders, over 500,000 emails
  • history: The dataset was originally published by the Federal Energy Regulation Commission during their investigation of Enron. The data was then worked on by the CALO (A Cognitive Assistant that Learns and Organizes), an organization within the AIC (Artificial Intelligence Center), that is a part of the SRI (Stanford Research Institute). The data was later purchased and remodeled by researchers at MIT in the 2010's. I accessed and found the data through former Carnegie Mellon professor William M. Cohen's website, linked below.
  • organization:
    • relative path: maildir/lastname-firstinitial (specific to user)/email directory (specific to user)
      • maildir/arora-h/all_documents
      • maildir/arora-h/deleted_items
      • maildir/arora-h/discussion_threads
      • ...
    • other notes: per William Cohen
      'The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form [email protected] whenever possible (i.e., recipient is specified in some parse-able format like "Doe, John" or "Mary K. Smith") and to [email protected] when no recipient was specified.'

Resources:

About

This repository acts as my term project for LING1340: Data Science for Linguists. In this project, I'll be working on building a classifier that can sort through emails to flag useful information for investigators.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%