Home

Welcome to our Design Document!

This project was completed by Alec Loftus, Mohammed Alsawi, Erin Maley, and Rohan Sethi in Dr. Wheeler's COMP383/483 class. The project was provided by Dr. Peter Kekenes-Huskey for his lab work at the Stritch School of Medicine.

Overview:

We know the shapes of proteins are vital to enabling their function. We also know that the pH environment around the protein also plays an impact in its ability to fold- the polarities of each amino acid in the protein contribute to folding and are impacted by the pH. With all this, some 30% of proteins that are coded for never fold at all. The ability and willingness to fold by nature and in different pHs can generally be predicted by sequence alone, which is what this project seeks to do.

Datasets have been provided of proteins and their folding propensities at pH 3.0 and pH 7.0 by Dr. Peter Kekenes-Huskey and his lab. By generating training and testing datasets from these for machine learning models, we will be able to evaluate each model's ability to accurately predict whether or not a protein will fold, given we know that at pH 3.0 the proteins should fold and at pH 7.0 they should be unfolded.

To start, all features will be included in the testing. After F1 scores have been collected, modifications may be made to the datasets to test prediction accuracy with more and/or less features, depending upon when we or what Dr. Kekenes-Huskey think could be important to aid in machine learning. By the end, we will have a greater understanding of what features provide the most reliable information for the most accurate model of protein folding prediction.

Context:

While protein folding can be predicted by sequence, knowing the tool(s) that do it best given certain relevant features is a step in the process that cannot always be applied universally. Taking the datasets provided by Dr. Kekenes-Huskey and evaluating protein folding prediction skill of various models is the first step in addressing the problem of fully predicting conformations of folded proteins. Accurately predicting folding means an at least decently accurate assessment of the degrees of rotation of the protein when it is able to fold, which means that understanding full conformation from sequence alone could be more possible than it already is.

Goals and Non-Goals:

Primary Goals:

Create scripts for training and testing a couple of ML models on synthesized protein datasets
Have documented accuracy reports for various models
Consider other relevant information that may be nice to have in test data (more of a scientific question/goal than programming)

Non-Goals:

Deep learning models testing the datasets (would be nice to include for comparison but not essential for success)
We will not be determining more specific information about the protein folding other than whether it is 'folded' according to the set threshold (i.e. we won't label degree or category of folding) and the ML model's accuracy with that binary.

Proposed Solution:

The structure of this project is less developing a full pipeline to run a number of scripts/tools and produce usable data. Instead, this project serves as a means of testing machine learning models and gauging their aptitude for predicting protein folding propensity in different pHs. Data will be initialized and prepared for each model as needed and as specified for testing. Subsets of data will be created for training that will be different and/or subset from the data that we use to test each model.

Scikit-learn will be the primary ML tool used to run the data as well as to assess accuracy of prediction. F1 scores will be generated from the precision assessing functions of the tool. Repetition of testing models with various parts of the data will help to understand the weight of each feature in predicting protein folding. Keeping a record of all of the predictions made by each tool and their F1 scores with each subset of data with any feature alterations will allow results to be verifiable and to help determine which tool(s) work the best at predicting protein folding under the circumstances outlined for us by the lab providing the project.

Milestones:

Week	Alec Loftus	Mohammed Alsawi	Erin Maley	Rohan Sethi	Deadlines/Events
_{Week 1 3/13-3/17}	_{Read literature Work on Design Doc}	_{Begin slide work}	_{Prepare questions for PKH}	_{Read through code files}	_{First group meeting!}
_{Week 2 3/20-3/24}	_{Weekly Milestones table/slides Make Design Document}	_{Implementation Plan slides Assist with Proposed Solution}	_{Introduction Slides Prepare questions for PKH}	_{Implementation Plan slides Assist with Goals & Non-Goals}	_{Meeting w/ PKH Tuesday Initial Presentation}
_{Week 3 3/27-3/31}	_{Keep Wiki/README.md updated Pull/push everything by EOD Friday for Repo check Prepare 5-min presentation}	_{Work on script for KNN & RF (training)}	_{Comment through .ipynb code Prepare presentation Work on script to check accuracy}	_{Work on DT & SVM scripts (training) Split datasets}	_{5-min Presentation Repo Check #1}
_{Week 4 4/3-4/7}	_{Test Mohammed's training code Work on script to generate more false/robust data}	_{Test Erin's accuracy code}	_{Test Rohan's training code}	_{Work on script to generate more false/robust data Update README to include information on the ML models used}	_{EASTER BREAK}
_{Week 5 4/10-4/14}	_{Work on deep learning scripts/testing All pushes by EOD Friday for Repo Check #2}	_{Check all code thus far has been commented and pushed Make and give presentation}	_{Update wiki/weekly milestones Make and give presentation}	_{Train using deep learning models}	_{5-min Presentation Repo Check #2}
_{Week 6 4/17-4/21}	_{Update README and comment code Pulls for Repo Check #3 Work on Deep-learning training models}	_{Test all code for training}	_{Continue work on deep learning models code Begin work on Final Presentation}	_{Comment code Check in with PKH}	_{Cross-team hacking Repo Check #3}
_{Week 7 4/24-4/28}	_{Test Erin's deep learning code final time Work on Presentation}	_{Work on Final Presentation}	_{Check all code Overview presentation work}	_{Work on Final Presentation}	_{Final Presentation}
_{Week 8 5/1-5/5}	_{Ensure all pulls go through Update the Wiki final time}	_{Work on App Note Update README}	_{Work on App Note}	_{Make sure all code has been commented}	_{Final App Note Final Project Code turned in}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly