This is a ML model built using S&P 500 stock data. S&P 500 stock data are pulled using Yahoo-Finance API
This is a Python based project and I highly recommend you to use Anaconda platform since it allows you to handle python modules with ease.
- A decent Python platform with fundamental Python Knowledge
- Basic API knowledge
- Python Modules
- Numpy
- Pandas
- Pandas-datareader
- Matplotlib
- BeautifulSoup4
- sklearn (scikit-learn)
- yfinance (yahoo-finance)
Supervised with classified outputs (buy, hold, sell)
- cross_validation; allows us to create shuffled training and testing samples. This is important since we can avoid testing the alogrithm on the same data as we used for training.
- LinearSVC, KNeighborsClassifier, RandomForestClassifier; classifiers used to predict.
- VotingClassifier; lets all 3 classifiers above to vote on what each thinks the class is for the feature sets.
- Remove unecessary data; we only need adj_close column since we want to predict based on previous closed values.
- Generate a correlation table to see if you can identify any relationships.
- Fill in the missing data with 0. Some companies may not have existed nor gone public in the time period we have chosen to get data.
- Our features are the pricing changes(in percentage) from the previous day for all companies. Therefore, we normalize it.
- Some normalized values will be infinite due to the 0 values that we've previously filled; convert these to NaNs and drop them later.
- Our labels will be 1, 0, and -1 which indicate buy, hold, and sell.
This model's accuracy varies roughly between 37% and 49% depending on the company we choose to predict. The results are not very satisfying and this could be due to multiple reasons. We have built a model using data from 505 different companies. Certainly, some companies have relationships and strong correlations with each other; however, in general, different companies behave diffrently and it is not easy to come up with a single general model for 505 different companies. I recommend grouping companies into industrial categories -such as tech, pharmacy, banking, and etc- and generate a model in each category to improve accuracy.