Implementation of the voted perceptron as detailed by Yoav Freund and Robert E. Schapire in "Large Margin Classification Using the Perceptron Algorithm" (1999).
In summary...
The general concept is simiilar to a boosting algorithm. The algorithm is the same as a regular perceptron algorithm, but instead of overwriting the previous weight vector, the algorithm stores each update of the weight vector in an array, repeating this for T epochs (T is a hyper-parameter found through validation testing by splitting up the training set data). Each perceptron weight vector has its own weight attached to it, determined by its lifetime; Models that last longer without misclassification have heavier weights, and vice versa. Then, the weighted sum of each model's prediction on a test point is thresholded by sign()
for a final prediction.
main()
produces and evaluates the voted perceptron model's performance on training sets of various sizes. The training sets are generated by test_file_creator.py
which is available for modification (main()
may also need filepath modifications if so).
Training files Xtrain.csv
and Ytrain.cs
. Rows in Xtrain.csv
represent a point in vector form, columns represent feature dimensions. Each row in Ytrain.csv
holds the ground truth label for each corresponding point in Xtrain.csv
(either 0 or 1).
Note: While the labels in Ytrain.csv
are 0 and 1, they are converted to -1 and 1 in training to follow the implementation by Freund and Schapire. The output goes back to 0 and 1. Future Ytrain.csv
files can be use -1 and 1 without changing the code (though you may want to comment out the label conversion for marginal performance gains, especially with larger datasets). Test sets not included, but must follow the same format as the training files.
This is a folder holding a collection of the test data sets produced by test_file_creator.py
when called in the main()
function of run.py
. It includes the test sets and the predictions.
Creates training sets using 5%, 10%, 20%, 50%, and 100% of the first 90% of the original training set, and uses the last 10% of the training data as a test set for local evaluation.