-
Notifications
You must be signed in to change notification settings - Fork 3
Supplementary: R Basics Tutorial
- R is one of the most used programming languages for statistical computing , data analysis and visualization
- R is free and open-source
- Runs on all major platforms
- Can import Python modules and/or scripts
- Pros and Cons for R:
- Pros:
- Ideal for data visualization
- Huge community for developers
- Cons:
- Can take more time to get output
- Memory-intensive
- Not ideal for big data
- Pros:
Options for running R:
- R console
- Rstudio IDE
Some basic commands in R:
?ls #Get help of a specific function
help.search #Search R help files for a word or phrase
help(package = 'tidyverse') #Find help for a package
getwd() #Get the current working directory
setwd('/path/to/desired/directory') #Set a new working directory as desired
install.packages('tidyverse') #Download and install packages from CRAN
library(tidyverse) #Load packages
ggplot2::geom_density #Call a specific function from a specific package
data(iris) #Load a pre-built dataset
Rules for making variables in R:
- It must not start with a number or underscore(_).
- The name can be a combination of letters, numbers, period(.) and underscore(_). (No hyphens(-))
- Variable names are case-sensitive.
- If the first letter of the object name is a period, the second letter cannot be a number.
- Words that are already in use (such as TRUE / FALSE / print) cannot be used as variable names.
NumA = 24 # Make a numeric variable NumA, and give it a value 24
NumB = 37 # Make a numeric variable NumB, and give it a value 37
NumA <- 24 # Assign value with arrow sign
24 -> NumA # Forward assignment
NumA + NumB # Sum two variables
NumA - NumB # Subtract two variables
NumA * NumB # Multiply two variables
NumA / NumB # Divide two variables
NumA ^ 2 # Exponentiation of a variable
- Arrays
- Vectors (uni-dimensional array)
- Matrices (two-dimensional array)
- Lists
- Dataframes
- Tibbles
- Factors
Vectors are Uni-dimentional array of a fixed data type
Creating Vectors: (Try print() to see the results)
VecA = c(10,20,30,40,50) # Make a numeric vector VecA
VecB = c("R2-D2","C-3PO","BB-8","IG-11","K-2SO") # Make a string vector VecB
VecC = 1:5 # Make a numeric vector between 2 numbers
VecD = 1.1:5.5 # You can see that elements have an interval 1
VecE = seq(2,8, by=2) # Make a vector between 2 and 8, with interval 2
VecF = rep(2,3) # Make a vector with 2 repeated for 3 times
VecG = rep(c(2,3),3) # Make a vector with a combination repeated for 3 times
Commands to modify vectors:
VecB = c("R2-D2","C-3PO","BB-8","IG-11","K-2SO")
VecB[1]
VecB[-4]
Vec[1:3]
sort(VecB)
rev(VecB)
VecB<-sort(VecB)
VecA = c(10,20,30,40,50)
table(VecA)
VecG = rep(c(2,3),3)
table(VecG)
unique(VecG)
length(VecG) # Show the number of elements in matrix VecG
append(VecG, 8)
VecG = append(VecG, 8)
print(VecG)
Matrices are 2-dimensional arrays of a fixed data type. They can be all numbers or all letters.
Commands for creating matrices:
matrix(1:9) # Make a matrix
matrix(1:9,nrow=3,ncol=3) # How does it differ from the last command?
matrix(1:9, 3, 3) # This will deliver the same outcome
matrix(1:9,nrow=3,ncol=3,byrow=TRUE) # How does it differ from the last command?
diag(c(6, 1, 6), 3, 3) # Make a diagonal matrix with 3 rows and 3 columns, filled by elements (6, 1, 6)
diag(1, 3, 3) # Make a diagonal matrix with 3 rows and 3 columns, filled by element 1
MatA <- matrix(1:9,nrow=3,ncol=3,byrow=TRUE) # Make a matrix MaxA with 3rows and 3 columns
dim(MatA) # Show the dimensions of matrix MaxA
length(MatA) # Show the number of elements in matrix MaxA
rownames(MatA) = c("a", "b", "c") # Assign row names to matrix MaxA
colnames(MatA) = c("d", "e", "f") # Assign column names to matrix MaxA
print(MatA)
Commands for accessing elements in a matrix:
MatA[2,] # Access the 2nd row of matrix MaxA
MatA[,2] # Access the 2nd column of matrix MaxA
MatA[2,2] # Access the element of matrix MaxA on 2nd row, 2nd column
MatA["b",3] # Search for row named "b", and print the element in the 3rd column
MatA[1:2,1:3] # Show a subset of matrix MaxA subtracted from first 2 rows, first 3 columns
Commands for modifying a matrix:
MatA <- matrix(1:9, 3, 3, byrow=TRUE)
MatB <- matrix(10:12, 1, 3, byrow=TRUE)
MatC <- rbind(MatA, MatB)
print(MatC)
MatD <- matrix(10:12, 3, 1, byrow=TRUE)
MatE <- cbind(MatA, MatD)
print(MatE)
summary(MatE)
t(MatA) # Transpose MatA
An R list is a uni-dimentional, heterogeneous data structure.
It can be a list of vectors, a list of matrices, a list of characters, a list of functions… etc.
Commands for creating a R list (and comparison to matrices)
Count=1:6
Friends = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross")
ListA = c(Count, Friends)
print(ListA)
ListB = list(Count, Friends)
print(ListB)
ListC = list("number" = Count, "name" = Friends)
str(ListC)
print(ListC)
Commands for accessing components in R list:
print(ListC$name)
print(ListC$number)
print(ListC[[1]])
print(ListC[[2]])
print(ListC[[1]][c(1,6)])
Commands for modifying components in a R list:
print(ListC$number)
ListC[[1]][7]=7
print(ListC)
append(ListC[[1]], 8)
print(ListC)
Commands for concatenating / merging R lists
VecA = c(1,2,3,4,5)
VecB = c("R2-D2","C-3PO","BB-8","IG-11","K-2SO")
SW <- list(index = VecA, Android = VecB)
print(SW)
Count = 1:6
Names = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross")
Friends = list("index" = Count, "Names" = Names)
print(Friends)
ListD = c(SW, Friends)
str(ListD)
ListE = list(SW, Friends)
str(ListE)
Data frames are collections of vectors with the same lengths. They can be similar to matrices, but each column can be of different data type.
Comparison of matrices and data frames:
MatA <- matrix(1:9,nrow=3,ncol=3,byrow=TRUE)
MatA[which(a$d %in% 5),]
MatA[which(as.data.frame(MatA)$d %in% 5),]
dfA <- as.data.frame(MatA)
print(dfA)
Commands for creating R data frames:
dfA <- data.frame(
id = c(1:6),
name = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross")
)
summary(dfA)
print(dfA$name)
dfA$LastName<-c("Green", "Geller", "Buffay", "Tribbiani", "Bing", "Geller")
print(dfA)
Commands for accessing items in data frames:
print(dfA)
dfA[1]
dfA[[1]]
dfA[['name']]
dfA$name
dim(dfA)
Commands for adding / removing data in data frame
# Adding a new row
NewRow<-c(7, "Jack", "Geller")
dfA <- rbind(dfA, NewRow)
dfA
# Adding a new column
NewColumn <- c(234, 234, 234, 234, 234, 234, 22)
dfA <- cbind(dfA, NewColumn)
dfA
colname(dfA)[ncol[dfA]] <- "Episodes"
dfA
# making subsets from data frames
subset(dfA, id != 7)
dfA[,-ncol(dfA)]
A Tibble is considered an enhanced version of a data frame in R
- A Tibble never alters the input type.
- With Tibble, there is no need for us to be bothered about the automatic changing of characters to strings.
- Tibbles can also contain columns that are lists.
- We can use non-standard variable names in Tibble.
- We can start the name of a Tibble with a number; Tibbles can also contain spaces.
- To utilize these names, we must mention them in backticks.
- Tibble only recycles the vectors with a length of 1.
- Tibble has no row names.
- When printing the content of a tibble object, only top 10 rows will be shown.
Commands for tibbles:
# To use tibble, we will need to load tibble package
library(tibble)
mtcars
as_tibble(mtcars)
class(mtcars)
dfA <- data.frame(
id = c(1:6),
name = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross"),
LastName= c("Green", "Geller", "Buffay", "Tribbiani", "Bing", "Geller")
)
row.names(dfA)<-c("one","two","three","four","five","six")
dfA
> as.tibble(dfA)
tblA <- as.tibble(dfA)
tblA
Factor are a data structure that categorically represents the data.
For more information, please check https://www.geeksforgeeks.org/r-factors/?ref=lbp
Commands for creating R factors
VecA <- c("female", "male", "male", "female")
VecA
## [1] "female" "male" "male" "female"
FacA <- factor(VecA)
FacA
## [1] female male male female
## Levels: female male
FacB <- factor(c("female", "male", "male", "female"), levels = c("female", "transgender", "male"))
FacB
## [1] female male male female
## Levels: female transgender male
levels(FacB)
Commands for accessing elements of a R factor
FacB <- factor(c("female", "male", "male", "female"))
FacB[3]
## [1] male
## Levels: female male
Factors in data frames:
age <- c(40, 49, 48, 40, 67, 52, 53)
salary <- c(103200, 106200, 150200, 10606, 10390, 14070, 10220)
gender <- factor(c("male", "male", "transgender", "female", "male", "female", "transgender"))
employee<- data.frame(age, salary, gender)
print(employee)
print(is.factor(employee$gender))
What is a loop? The term looping, cycling, and iterating are all equivalent to replicating steps. There are 3 ways to execute automated multi-step processes in R:
- For loop
- While loop
- Repeat loop
For loop in R:
for (var in vector)
{
statement(s)
}
# For example
for(i in 1:5){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
# example 2
for (i in 1: 4)
{
print(i ^ 2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
# example 3
for (i in c("John", "Jeff", "Joseph", "Mary")){
print(i)
}
# example 4
for (i in c("John", "Jeff", "Joseph", "Mary")){
print(paste(i," has a little lamb"))
}
## [1] "John has a little lamb"
## [1] "Jeff has a little lamb"
## [1] "Joseph has a little lamb"
## [1] "Mary has a little lamb"
# example 5
VecX <- c(-8, 9, 11, 45)
for (i in VecX)
{
print(i)
}
## [1] -8
## [1] 9
## [1] 11
## [1] 45
For loop can have nested structure
for (i in 1:3)
{
for (j in 1:i)
{
print(i * j)
}
}
## [1] 1
## [1] 2
## [1] 4
## [1] 3
## [1] 6
## [1] 9
Jump statements in R for loop
for (i in c(3, 6, 23, 19, 0, 21))
{
if (i == 0)
{
break
}
print(i)
}
print("Outside Loop")
## [1] 3
## [1] 6
## [1] 23
## [1] 19
## [1] "Outside Loop"
Next statements in R for loops:
for (i in c(3, 6, 23, 19, 0, 21))
{
if (i == 0)
{
next
}
print(i)
}
print('Outside Loop')
## [1] 3
## [1] 6
## [1] 23
## [1] 19
## [1] 21
## [1] "Outside Loop"
Creating multiple plots within a for loop
# create a matrix of data
mat <- matrix(rnorm(100), ncol = 5)
# set up the plot layout
par(mfrow = c(2,3))
# loop over columns of the matrix
```R
for (i in 1:5) { # create a histogram for each column
hist(mat[, i], main = paste("Column", i), xlab = "Values", col = "lightorange")
}
Typical structure for R function
function_name <- function(parameters){
function body
}
Example 1
hello_world <- function(){
'Hello, World!'
}
print(hello_world())
Example 2
circumference <- function(r=1){
2*pi*r
}
print(circumference())
print(circumference(2))
Example 3
sum_two_nums <- function(x, y){
x + y
}
print(sum_two_nums(1, 2))
Example 4
mean_median <- function(vector){
mean <- mean(vector)
median <- median(vector)
return(c(mean, median))
}
print(mean_median(c(1, 1, 1, 2, 3)))
Example 5
subtract_two_nums <- function(x, y){
x - y
}
print(subtract_two_nums(x=3, y=1))
print(subtract_two_nums(y=1, x=3))
Example 6
calculate_calories_women <- function(weight, height, age){
(10 * weight) + (6.25 * height) - (5 * age) - 161
}
print(calculate_calories_women(age=30, 60, 165))
An S3 object is a base type with at least a class attribute (other attributes may be used to store other data). Basically, a list with its class attribute set to some class name, is an S3 object.
- S3 is relatively simple but also informal. They don't have formal definitions for classes and methods, while S4 has rigorous definitions for them.
- S4 classes are defined using the setClass() function, which specifies the 'slots' used with specific data type. S3 in the other hand, is basically a list with a class attribute without formal rules.
- S4 allows for formal definition of classes and slots, which ensure that the object have a specific structure and thus provide a clear and consistent framework. This makes S4 suitable for object-oriented programming in R over S3, especially when it comes to more complex and structured applications. In contrast, the informal structure of S3 may lead to inconsistency.
Commands for creating S3 and S4 objects
S3A <- list(name = "apple", color = "red", flavor = "sweet and sour")
class(S3A) <- "Fruit"
S3A
## $name
## [1] "apple"
##
## $color
## [1] "red"
##
## $flavor
## [1] "sweet and sour"
##
## attr(,"class")
## [1] "Fruit"
setClass("fruit", slots = list(name = "character", color = "character", flavor = "character"))
S4A <- new("fruit", name = "apple", color = "red", flavor = "sweet and sour")
S4A
## An object of class "fruit"
## Slot "name":
## [1] "apple"
##
## Slot "color":
## [1] "red"
##
## Slot "flavor":
## [1] "sweet and sour"
Tidyverse is a collection of R packages designed for data science. It includes tools for data manipulation, visualization, and modeling, all sharing an underlying philosophy and common APIs. The core packages of Tidyverse, as shown in the graph, are:
- ggplot2
- dplyr
- tidyr
- readr
- purrr
- tibble
- stringr
- forcats
For details, please check https://www.tidyverse.org/
Tidyverse (specifically dplyr) comes with functions to manipulate your data, in which all functions take a tibble as their first argument and return a tibble as output. Selecting columns and logically subsetting your dataset plays a tremendous role in most of bioinformatic data analysis.
TibA <- tibble(
name = c("Timothy", "Ricky", "Bob", "Shawn", "Eric", "Tat", "Max"),
age = c(35, 30, 34, 42, 45, 43, 12),
city = c("Pittsburgh", "ShangHai", "Sacramento", "Dallas", "Irvine", "Austin", "Houston"),
weight = c(160, 168, 172, 195, 180, 175, 140),
height = c(183, 180, 181, 168, 178, 175, 132)
)
select(TibA, name, city, height)
## # A tibble: 7 × 3
## name city height
## <chr> <chr> <dbl>
## 1 Timothy Pittsburgh 183
## 2 Ricky ShangHai 180
## 3 Bob Sacramento 181
## 4 Shawn Dallas 168
## 5 Eric Irvine 178
## 6 Tat Austin 175
## 7 Max Houston 132
select(TibA, 2,4)
## # A tibble: 7 × 2
## age weight
## <dbl> <dbl>
## 1 35 160
## 2 30 168
## 3 34 172
## 4 42 195
## 5 45 180
## 6 43 175
## 7 12 140
select(TibA, -city)
## # A tibble: 7 × 4
## name age weight height
## <chr> <dbl> <dbl> <dbl>
## 1 Timothy 35 160 183
## 2 Ricky 30 168 180
## 3 Bob 34 172 181
## 4 Shawn 42 195 168
## 5 Eric 45 180 178
## 6 Tat 43 175 175
## 7 Max 12 140 132
filter(TibA, Height>=170)
## # A tibble: 5 × 5
## name age city weight height
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 Timothy 35 Pittsburgh 160 183
## 2 Ricky 30 ShangHai 168 180
## 3 Bob 34 Sacramento 172 181
## 4 Eric 45 Irvine 180 178
## 5 Tat 43 Austin 175 175
filter(TibA, name == "Shawn")
## # A tibble: 1 × 5
## name age city weight height
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 Shawn 42 Dallas 195 168
https://www.r-project.org/about.html
https://www.geeksforgeeks.org/r-tutorial/?ref=lbp
https://www.codecademy.com/resources/docs/r/variables
https://www.starwars.com/databank/
https://www.educative.io/answers/what-is-tibble-versus-data-frame-in-r
https://github.com/rstudio/cheatsheets/blob/main/data-import.pdf
For more info about us click here.