Skip to content

Latest commit

 

History

History
26 lines (15 loc) · 1.11 KB

README.md

File metadata and controls

26 lines (15 loc) · 1.11 KB

Email Spam Filter using Sci-kit, IPython and Naive Bayes Classifier

The developed tool is capable of the following:

  • Output the most occurring words with their frequency based on user preference
  • Feature extraction processing and simplifying raw data
  • Training the used naive bayes classifier
  • Outputting ham/spam confusion matrix

Due to the limited capabilities of the used machine, the used sets for testing and training was a smaller portion of a much bigger dataset. The filter was optimized to work on the smaller dataset, but it also can run the larger one, given the correct number of files in each label vector and identifying spam email in that vector.

Link to the smaller Test-Train dataset used

Link to the [whole 50MB dataset]

Main Parts

  • Part 1 - Most Common Words Extraction
  • Part 2 - Feature Extraction
  • Part 3 - Extracting Labeled Feature Vector per Training Email to One Single Two-Dimensional Matrix
  • Part 4 - Defining and Training Naive Bayes Classifier
  • Part 5 - Testing the Trained Model using the Test Set Defined