README

NOTE: See Project Report for theory and implementation details.


This repository contains scripts and data for machine learning project by Sindhuula Selvaraju and Zachary Silver.
This repository contains 2 directories:
    1. Data : Raw and parsed data files
    2. Code : Scripts to parse and classify data and compute prediction accuracy

Usage:
    parse_data.py : python parse_data.py <input_file> <output_file> <stopword_file>
    assign_feature_weights.py : python assign_feature_weights.py <input_file> <output_file> <words_file>    
Points to note:
    1. The preprocessing/parsing of our raw data involves removing frequently occuring words and phrases and making sure keywords necessary for the classification get a higher weight
    2. The feature vector is formd similar to the way it waas formed in our homeworks but instead of integer our features are strings
    3. We're still using the model and predict files to keep track of possible points of failure.

    5. Our neural network uses some number of stacked "perceptrons." Each
    perceptron represents its input as a vector and its output as a number.
    Depending on the number of nodes in our hidden layers, we will create
    multiple perceptrons. For example, if we have 10 nodes in a hidden layer,
    each of those nodes will be the output of 10 perceptrons that represent
    the previous layers. These outputs will then be the inputs for a final
    perceptron that gives the final output value.