Skip to content

jongwoojeff/machine-learning-with-stock-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

machine-learning-with-stock-data

This is a ML model built using S&P 500 stock data. S&P 500 stock data are pulled using Yahoo-Finance API

Getting Started

This is a Python based project and I highly recommend you to use Anaconda platform since it allows you to handle python modules with ease.

Prerequisites

  1. A decent Python platform with fundamental Python Knowledge
  2. Basic API knowledge
  3. Python Modules
    • Numpy
    • Pandas
    • Pandas-datareader
    • Matplotlib
    • BeautifulSoup4
    • sklearn (scikit-learn)
    • yfinance (yahoo-finance)

Machine Learning Details

Model Type

Supervised with classified outputs (buy, hold, sell)

Methods used

  1. cross_validation; allows us to create shuffled training and testing samples. This is important since we can avoid testing the alogrithm on the same data as we used for training.
  2. LinearSVC, KNeighborsClassifier, RandomForestClassifier; classifiers used to predict.
  3. VotingClassifier; lets all 3 classifiers above to vote on what each thinks the class is for the feature sets.

Feature Engineering

  1. Remove unecessary data; we only need adj_close column since we want to predict based on previous closed values.
  2. Generate a correlation table to see if you can identify any relationships.
  3. Fill in the missing data with 0. Some companies may not have existed nor gone public in the time period we have chosen to get data.
  4. Our features are the pricing changes(in percentage) from the previous day for all companies. Therefore, we normalize it.
  5. Some normalized values will be infinite due to the 0 values that we've previously filled; convert these to NaNs and drop them later.
  6. Our labels will be 1, 0, and -1 which indicate buy, hold, and sell.

Evaluation

This model's accuracy varies roughly between 37% and 49% depending on the company we choose to predict. The results are not very satisfying and this could be due to multiple reasons. We have built a model using data from 505 different companies. Certainly, some companies have relationships and strong correlations with each other; however, in general, different companies behave diffrently and it is not easy to come up with a single general model for 505 different companies. I recommend grouping companies into industrial categories -such as tech, pharmacy, banking, and etc- and generate a model in each category to improve accuracy.

Acknowledgement

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages