Project2

Project – Regression Analysis.

In this project you will be developing a logistic regression algorithm to classify the possibility of passengers surviving the Titanic sink, using some independent variables(categorical and continuous variable).

Note: Rstudio is the recommended tool for this project, because it is what was used to define this project.

Link to dataset: Download test and train dataset from this link: Titanic dataset

Part 1- Training algorithm

Data Preparation

Delete the following variables not needed for the analysis: i. PassengerId ii. Ticket iii. Name iv. Cabin v. Embarked vi. SibSp vii. Parch viii. Fare

Deliverable: Provide a screen shot of your resulting data frame.

Convert the following variables to factors: i. PClass ii. Survived

Deliverable: Provide screenshot of structure of data frame.

Analyze your data frame to find the values that are NA.

a) How many variables has NAs and how many rows has NAs?

b) Use any of the following methods to remove NA:

i) Deleting the rows with NA, ii) Replacing with mean of the variable. iii) Replacing with minimum of the variable.

_Justify your choice. _

Logistic Regression

Run the logistic regression using glm() function using all the variables in your data frame.

Deliverable: Provide a screenshot of the summary of logistic regression

Run a prediction using the model you created in number 4.

Deliverable: Provide a screenshot of your output. Note that you don’t have to capture everything.

Show a graph that compares how your algorithm predicted your data compared to the actual value. Note that you will need ggplot2 libraries for this part.

_**Deliverable: **_Provide a screenshot of your graph.

Part 2 - Testing Algorithm with new data

Repeat step 1, 2 and 3 on the test data set.

Deliverable: same as above. Skip Deliverable for 3b if you are using the same method.

Use the logistic model you created in Part 1 to run a prediction of the test data frame.

Deliverable: Provide screen shot of your prediction probability.

Create a vector and fill it in with “0”, the size of your test data frame. For example, if your test data frame has an observation of 10 rows, make the vector length 10.

Deliverable: Provide screenshot of your vector.