Skip to content

ana-2511/CodeClauseInternship_SpamClassifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

About Dataset Introduction This is a csv file containing related information of 5172 randomly picked email files and their respective labels for spam or not-spam classification.

About the Dataset The csv file contains 5172 rows, each row for each email. There are 3002 columns. The first column indicates Email name. The name has been set with numbers and not recipients' name to protect privacy. The last column has the labels for prediction : 1 for spam, 0 for not spam. The remaining 3000 columns are the 3000 most common words in all the emails, after excluding the non-alphabetical characters/words. For each row, the count of each word(column) in that email(row) is stored in the respective cells. Thus, information regarding all 5172 emails are stored in a compact dataframe rather than as separate text files.

Project Highlights: 🧠 Trained powerful machine learning models to classify over 5000 words into spam and non-spam categories. 🤖 Implemented top-notch algorithms: Logistic Regression, SVM, KNN, Random Forest, Naive Bayes, and Decision Tree Classifier. 📊 Achieved remarkable accuracy levels with Logistic Regression and Random Forest leading the way at an impressive 97.53% accuracy! Key Takeaways: In the realm of spam detection, precision is paramount. The project's standout performers, Logistic Regression and Random Forest, showcase cutting-edge accuracy, providing robust protection against unwanted emails. 🔍 Algorithmic Success: *Logistic Regression: 97.53% Accuracy *Random Forest: 97.53% Accuracy *SVM, KNN, Naive Bayes, and Decision Tree Classifier also demonstrated commendable performance with more than 90% accuracy level

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published