Skip to content
This repository has been archived by the owner on Sep 24, 2020. It is now read-only.

Assignment 1

dgupt097 edited this page Sep 15, 2018 · 22 revisions

The basics of R:

Assignment 1

R is excellent with working with Data. With very little amount of R code you can do a lot of manipulating Data.

This exercise will cover some of the basic of data analysis and manipulation with R. In this assignment, we would look at the likelihood of a passenger on the titanic surviving the sink based on their ticket class on the titanic using R basics and simple Histogram chart.

  1. Setting up your environment: Download R studio and R “RStudio provides popular open source and enterprise-ready professional software for the R statistical computing environment.”
  2. Go to Kaggle competitions.Get the titanic test and train dataset. You would have to sign up to Kaggle or you would not have access to the competitions.
  3. Reading CSV file: Read the train and test csv file you have downloaded from Kaggle. To read your csv file use the following code: train <- read.csv(“train.csv”, header = True)
  4. Adding a Variable (Survived) to a data frame: After reading the csv files you would realize that test.csv and train.csv has 10 and 11 variables respectively. In this exercise you would create a new data frame called test.survived by adding a new variable to test.csv to make it 11 variables using the following code: test.survived <- data.frame(survived = rep(“None”, nrow(test)), test[,])
  5. Combining data frames: Combine train dataset and test.survived dataset using following code: data.combined <- rbind(train, test.survived)
  6. View the structure of your data.frame to view its datatypes using the following code: str(data.combined)
  7. Changing a variables data type: You will be changing the data types of the pclass variable and survived to a factor with the following code: data.combined$survived <- as.factor(data.combined$Survived) Part 2: For the second part of this assignment we will be using a RStudio library for creating graphs.
  8. Installing ggplot2: Go to the packages tab and click install. Enter ggplot2 in the packages field. Click install.
  9. In this part we will be creating a histogram with ggplot from the train dataset, that shows the percentage of surviving or perishing on the titanic based on your ticket class: ggplot(train, aes(x = Pclass, fill = factor(Survived))) + geom_bar() + xlab("Pclass") + ylab("Total Count") + labs(fill = "Survived") You should expect something like this: image

Useful function and concept definition for this assignment

  1. data.frame() : A data frame is a list of variables of the same number of rows with unique row names, given class "data.frame".
  2. rep(x, times) : rep repeats the value of x in based on the definition of times.

3. rbind: Combines data frames by columns or rows, respectively

  1. str: Compactly display the internal structure of an R object

5. factor: Conceptually, factors are variables in R which take on a limited number of different values; such variables are often referred to as categorical variables

Part B.

Give 2 to 3 lines definition of each of the following concepts:

  1. Data.frame
  2. rep
  3. rbind
  4. factor
  5. str
Clone this wiki locally