-
Notifications
You must be signed in to change notification settings - Fork 0
Project2
In this project you will be developing a logistic regression algorithm to classify the possibility of passengers surviving the Titanic sink, using some independent variables(categorical and continuous variable).
Note: Rstudio is the recommended tool for this project, because it is what was used to define this project.
Link to dataset: Download test and train dataset from this link: Titanic dataset
- Delete the following variables not needed for the analysis: i. PassengerId ii. Ticket iii. Name iv. Cabin v. Embarked vi. SibSp vii. Parch viii. Fare
Deliverable: Provide a screen shot of your resulting data frame.
- Convert the following variables to factors: i. PClass ii. Survived
Deliverable: Provide screenshot of structure of data frame.
- Analyze your data frame to find the values that are NA.
a) How many variables has NAs and how many rows has NAs?
b) Use any of the following methods to remove NA:
i) Deleting the rows with NA, ii) Replacing with mean of the variable. iii) Replacing with minimum of the variable.
_Justify your choice. _
- Run the logistic regression using glm() function using all the variables in your data frame.
Deliverable: Provide a screenshot of the summary of logistic regression
- Run a prediction using the model you created in number 4.
Deliverable: Provide a screenshot of your output. Note that you don’t have to capture everything.
- Show a graph that compares how your algorithm predicted your data compared to the actual value. Note that you will need ggplot2 libraries for this part.
_**Deliverable: **_Provide a screenshot of your graph.
- Repeat step 1, 2 and 3 on the test data set.
Deliverable: same as above. Skip Deliverable for 3b if you are using the same method.
- Use the logistic model you created in Part 1 to run a prediction of the test data frame.
Deliverable: Provide screen shot of your prediction probability.
- Create a vector and fill it in with “0”, the size of your test data frame. For example, if your test data frame has an observation of 10 rows, make the vector length 10.
Deliverable: Provide screenshot of your vector.
- Fill in the Vector with “1” if your prediction is greater than 0.5
Deliverable: Provide screenshot of your new vector. Bind the new vector to your original test data.
Final Deliverable Provide a Zip of your code, work space and project documentation.
Useful tutorials
Overview
Basic Concepts
- What is Data Mining?
- Data mining goals
- Data objects and statistical concepts
- Machine Learning techniques
- Applications
- Related Technologies
Machine Learning Algorithms
- Association rules
- Classification
- Prediction
- Clustering
Machine Learning tool tutorials
Assignment
Advanced Topics
Data warehouse and OLAP