The goal of this project is for you to practice statistical analysis using the iterative data analysis process. For this project, you will use this Housing Prices dataset we choose for you. You need to download the train.csv
dataset then use your statistical analysis skills to analyze this dataset. The goal of your analysis is to identify the most important features of houses that affect the sale prices.
You will be working individually for this project, but we'll be guiding you along the process and helping you as you go.
The technical requirements for this project are as follows:
-
Try to apply everything you have learned so far about data analysis (in creative ways if you can) such as data cleaning, data manipulation, data visualization, and various statistical analysis methods.
-
Apply the iterative data analysis process -- setting expectations, collecting information, and reacting to data / revising expectations.
-
Conduct your analysis in Jupyter Notebook using Pandas, Numpy, Scipy, Matplotlib, Seaborn, Plotly, and other Python libraries you have learned, as necessary.
The following deliverables should be pushed to your Github repo for this project.
-
A Jupyter Notebook (statistical-analysis.ipynb) containing your Python codes, outputs, and data visualizations. Make sure to include explanations for each of your steps in Markdown cells or Python comments.
-
[optional] A
README.md
file containing any additional information.
-
Explore data and understand what the fields mean.
-
Examine the relationships between the sales price and other features in the dataset. Use data visualization techniques to help you gain intuitive understanding of the relationships.
-
Make informed guess on which features should be investigated in depth.
-
Data cleaning & manipulation. Apply the following techniques as appropriate:
- Adjust skewed data distribution.
- Remove columns with high proportion of missing values.
- Remove records with missing values.
- Feature reduction.
- Convert categorical data to numerical.
-
Compute field relationship scores with the chosen statistical model.
-
Present your findings in statistical summary and/or data visualizations.
-
Technical Requirements: Did you deliver a project that met all the technical requirements? Given what the class has covered so far, did you build something that was reasonably complex?
-
Creativity: Did you add a personal spin or creative element into your project submission? Did you incorporate domain knowledge or unique perspective into your analysis.
-
Code Quality: Did you follow code style guidance and best practices covered in class?
-
Total: Your instructors will give you a total score on your project between:
Score Expectations 0 Does not meet expectations 1 Meets expectactions, good job! 2 Exceeds expectations, you wonderful creature, you!
This will be useful as an overall gauge of whether you met the project goals, but the more important scores are described in the specs above, which can help you identify where to focus your efforts for the next project!