-
Notifications
You must be signed in to change notification settings - Fork 0
Assignment 1
The basics of R:
Assignment 1
R is excellent with working with Data. With very little amount of R code you can do a lot of manipulating Data.
This exercise will cover some of the basic of data analysis and manipulation with R. In this assignment, we would look at the likelihood of a passenger on the titanic surviving the sink based on their ticket class on the titanic using R basics and simple Histogram chart.
- Setting up your environment: Download R studio and R “RStudio provides popular open source and enterprise-ready professional software for the R statistical computing environment.”
- Go to Kaggle competitions.Get the titanic test and train dataset. You would have to sign up to Kaggle or you would not have access to the competitions.
-
Reading CSV file: Read the train and test csv file you have downloaded from Kaggle. To read your csv file use the following code:
train <- read.csv(“train.csv”, header = True)
-
Adding a Variable (Survived) to a data frame: After reading the csv files you would realize that test.csv and train.csv has 10 and 11 variables respectively. In this exercise you would create a new data frame called test.survived by adding a new variable to test.csv to make it 11 variables using the following code:
test.survived <- data.frame(survived = rep(“None”, nrow(test)), test[,])
-
Combining data frames: Combine train dataset and test.survived dataset using following code:
data.combined <- rbind(train, test.survived)
- View the structure of your data.frame to view its datatypes using the following code: str(data.combined)
-
Changing a variables data type: You will be changing the data types of the pclass variable and survived to a factor with the following code:
data.combined$survived <- as.factor(data.combined$Survived)
Part 2: For the second part of this assignment we will be using a RStudio library for creating graphs. - Installing ggplot2: Go to the packages tab and click install. Enter ggplot2 in the packages field. Click install.
- In this part we will be creating a histogram with ggplot from the train dataset, that shows the percentage of surviving or perishing on the titanic based on your ticket class:
ggplot(train, aes(x = Pclass, fill = factor(Survived))) + geom_bar() + xlab("Pclass") + ylab("Total Count") + labs(fill = "Survived")
You should expect something like this:
Useful function and concept definition for this assignment
-
data.frame()
: A data frame is a list of variables of the same number of rows with unique row names, given class "data.frame". -
rep(x, times)
: rep repeats the value of x in based on the definition of times.
3. rbind
: Combines data frames by columns or rows, respectively
-
str
: Compactly display the internal structure of an R object
5. factor
: Conceptually, factors are variables in R which take on a limited number of different values; such variables are often referred to as categorical variables
Part B.
Give 2 to 3 lines definition of each of the following concepts:
- Data.frame
- rep
- rbind
- factor
- str
Overview
Basic Concepts
- What is Data Mining?
- Data mining goals
- Data objects and statistical concepts
- Machine Learning techniques
- Applications
- Related Technologies
Machine Learning Algorithms
- Association rules
- Classification
- Prediction
- Clustering
Machine Learning tool tutorials
Assignment
Advanced Topics
Data warehouse and OLAP