Skip to content

Latest commit

 

History

History
89 lines (65 loc) · 3.2 KB

README.rst

File metadata and controls

89 lines (65 loc) · 3.2 KB

This package contains a stripped down version of the SpamBayes classifier, with the following changes:

  • The classifier and tokenizer code has been kept. All other code has been removed.
  • The tokenizer has been stripped down and simplified. In particular all code designed specifically for email parsing has been removed.
  • The ClassifierDb class has been reduced to a simple dict subclass. The custom pickling code has been removed, as have all database backends.
  • The remaining code has been updated and made compatible with Python 3.
  • An orthogonalsparse bigram (OSB) transformation has been added.
  • Unicode handling has been improved.

What's it good for?

I use sbclassifier to protect websites against contact form spam.

With a training set of a handful each of spam and non-spam messages it is already useful. Once the training data set gets above about 20 messages of each type I am happy to let it filter out the most obvious spam.

Usage

The above script will print out:

0.902
[('*H*', 0.104), ('*S*', 0.908), ('can', 0.155), ('for', 0.845), ('service', 0.845), ('traffic', 0.845), ('and', 0.908)]

sbclassifier assigns 90% probability to this unknown message being spam. It can also produce a sequence of (word, probability) pairs that reveals the tokens that were important in this calculation.

More information

The spambayes source repository contains a wealth of information on how and why the classifier works as it works, as does the SpamBayes wiki.

Copyright

Copyright (C) 2002-2013 Python Software Foundation; All Rights Reserved

The Python Software Foundation (PSF) holds copyright on all material in this project. You may use it under the terms of the PSF license; see LICENSE.txt.