Skip to content

Supplementary: R Basics Tutorial

masashi0924 edited this page Jul 2, 2024 · 31 revisions

Introduction to R

image

  • R is one of the most used programming languages for statistical computing , data analysis and visualization
  • R is free and open-source
  • Runs on all major platforms
  • Can import Python modules and/or scripts
  • Pros and Cons for R:
    • Pros:
      • Ideal for data visualization
      • Huge community for developers
    • Cons:
      • Can take more time to get output
      • Memory-intensive
      • Not ideal for big data

Options for running R:

  • R console
  • Rstudio IDE

Some basic commands in R:

?ls                                             #Get help of a specific function
help.search                                     #Search R help files for a word or phrase
help(package = 'tidyverse')                     #Find help for a package
getwd()                                         #Get the current working directory
setwd('/path/to/desired/directory')             #Set a new working directory as desired
install.packages('tidyverse')                   #Download and install packages from CRAN
library(tidyverse)                              #Load packages
ggplot2::geom_density                           #Call a specific function from a specific package
data(iris)                                      #Load a pre-built dataset

Basic Operators in R

Rules for making variables in R:

  • It must not start with a number or underscore(_).
  • The name can be a combination of letters, numbers, period(.) and underscore(_). (No hyphens(-))
  • Variable names are case-sensitive.
  • If the first letter of the object name is a period, the second letter cannot be a number.
  • Words that are already in use (such as TRUE / FALSE / print) cannot be used as variable names.
NumA = 24                # Make a numeric variable NumA, and give it a value 24
NumB = 37                # Make a numeric variable NumB, and give it a value 37
NumA <- 24               # Assign value with arrow sign
24   -> NumA             # Forward assignment
NumA + NumB              # Sum two variables
NumA - NumB              # Subtract two variables
NumA * NumB              # Multiply two variables
NumA / NumB              # Divide two variables
NumA ^ 2                 # Exponentiation of a variable

Data Structures in R

The most essential data structures used in R include: 

  • Arrays
    • Vectors (uni-dimensional array)
    • Matrices (two-dimensional array)
  • Lists
  • Dataframes
  • Tibbles
  • Factors

Vectors

Vectors are Uni-dimentional array of a fixed data type

Creating Vectors: (Try print() to see the results)

VecA = c(10,20,30,40,50)                               # Make a numeric vector VecA
VecB = c("R2-D2","C-3PO","BB-8","IG-11","K-2SO")       # Make a string vector VecB
VecC = 1:5                                             # Make a numeric vector between 2 numbers
VecD = 1.1:5.5                                         # You can see that elements have an interval 1
VecE = seq(2,8, by=2)                                  # Make a vector between 2 and 8, with interval 2
VecF = rep(2,3)                                        # Make a vector with 2 repeated for 3 times
VecG = rep(c(2,3),3)                                   # Make a vector with a combination repeated for 3 times

image image image image image

Commands to modify vectors:

VecB = c("R2-D2","C-3PO","BB-8","IG-11","K-2SO")
VecB[1]
VecB[-4]
Vec[1:3]
sort(VecB)
rev(VecB)
VecB<-sort(VecB)
VecA = c(10,20,30,40,50)
table(VecA)
VecG = rep(c(2,3),3) 
table(VecG)
unique(VecG)
length(VecG)                                   # Show the number of elements in matrix VecG
append(VecG, 8)
VecG = append(VecG, 8)
print(VecG)

Matrices

Matrices are 2-dimensional arrays of a fixed data type. They can be all numbers or all letters.

Commands for creating matrices:

matrix(1:9)                                    # Make a matrix
matrix(1:9,nrow=3,ncol=3)                      # How does it differ from the last command?
matrix(1:9, 3, 3)                              # This will deliver the same outcome
matrix(1:9,nrow=3,ncol=3,byrow=TRUE)           # How does it differ from the last command?
diag(c(6, 1, 6), 3, 3)                         # Make a diagonal matrix with 3 rows and 3 columns, filled by elements (6, 1, 6)
diag(1, 3, 3)                                  # Make a diagonal matrix with 3 rows and 3 columns, filled by element 1
MatA <- matrix(1:9,nrow=3,ncol=3,byrow=TRUE)   # Make a matrix MaxA with 3rows and 3 columns 
dim(MatA)                                      # Show the dimensions of matrix MaxA
length(MatA)                                   # Show the number of elements in matrix MaxA
rownames(MatA) = c("a", "b", "c")              # Assign row names to matrix MaxA
colnames(MatA) = c("d", "e", "f")              # Assign column names to matrix MaxA
print(MatA)

Commands for accessing elements in a matrix:

MatA[2,]                                       # Access the 2nd row of matrix MaxA
MatA[,2]                                       # Access the 2nd column of matrix MaxA
MatA[2,2]                                      # Access the element of matrix MaxA on 2nd row, 2nd column
MatA["b",3]                                    # Search for row named "b", and print the element in the 3rd column
MatA[1:2,1:3]                                  # Show a subset of matrix MaxA subtracted from first 2 rows, first 3 columns

Commands for modifying a matrix:

MatA <- matrix(1:9, 3, 3, byrow=TRUE)
MatB <- matrix(10:12, 1, 3, byrow=TRUE)
MatC <- rbind(MatA, MatB)
print(MatC)
MatD <- matrix(10:12, 3, 1, byrow=TRUE)
MatE <- cbind(MatA, MatD)
print(MatE)
summary(MatE)
t(MatA)                                        # Transpose MatA

List

An R list is a uni-dimentional, heterogeneous data structure.

It can be a list of vectors, a list of matrices, a list of characters, a list of functions… etc.

Commands for creating a R list (and comparison to matrices)

Count=1:6
Friends = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross")
ListA = c(Count, Friends) 
print(ListA)
ListB = list(Count, Friends) 
print(ListB) 
ListC = list("number" = Count, "name" = Friends) 
str(ListC)
print(ListC) 

Commands for accessing components in R list:

print(ListC$name)
print(ListC$number)
print(ListC[[1]])
print(ListC[[2]])
print(ListC[[1]][c(1,6)])

Commands for modifying components in a R list:

print(ListC$number)
ListC[[1]][7]=7
print(ListC)
append(ListC[[1]], 8)
print(ListC)

Commands for concatenating / merging R lists

VecA = c(1,2,3,4,5)                                
VecB = c("R2-D2","C-3PO","BB-8","IG-11","K-2SO")       
SW <- list(index = VecA, Android = VecB)
print(SW)
Count = 1:6
Names = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross")
Friends = list("index" = Count, "Names" = Names) 
print(Friends)
ListD = c(SW, Friends)
str(ListD)
ListE = list(SW, Friends)
str(ListE)

Data frames

Data frames are collections of vectors with the same lengths. They can be similar to matrices, but each column can be of different data type.

Comparison of matrices and data frames:

MatA <- matrix(1:9,nrow=3,ncol=3,byrow=TRUE) 
MatA[which(a$d %in% 5),] 
MatA[which(as.data.frame(MatA)$d %in% 5),] 
dfA <- as.data.frame(MatA) 
print(dfA)

Commands for creating R data frames:

dfA <- data.frame(
	id = c(1:6), 
	name = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross")
)
summary(dfA)
print(dfA$name)
dfA$LastName<-c("Green", "Geller", "Buffay", "Tribbiani", "Bing", "Geller")
print(dfA)

Commands for accessing items in data frames:

print(dfA)
dfA[1]
dfA[[1]]
dfA[['name']]
dfA$name
dim(dfA)

Commands for adding / removing data in data frame

# Adding a new row
NewRow<-c(7, "Jack", "Geller")
dfA <- rbind(dfA, NewRow)
dfA
# Adding a new column
NewColumn <- c(234, 234, 234, 234, 234, 234, 22)
dfA <- cbind(dfA, NewColumn)
dfA
colname(dfA)[ncol[dfA]] <- "Episodes"
dfA
# making subsets from data frames
subset(dfA, id != 7)
dfA[,-ncol(dfA)]

Tibbles

A Tibble is considered an enhanced version of a data frame in R

  • A Tibble never alters the input type.
  • With Tibble, there is no need for us to be bothered about the automatic changing of characters to strings.
  • Tibbles can also contain columns that are lists.
  • We can use non-standard variable names in Tibble.
  • We can start the name of a Tibble with a number; Tibbles can also contain spaces.
  • To utilize these names, we must mention them in backticks.
  • Tibble only recycles the vectors with a length of 1.
  • Tibble has no row names.
  • When printing the content of a tibble object, only top 10 rows will be shown.

Commands for tibbles:

# To use tibble, we will need to load tibble package 
library(tibble)
mtcars
as_tibble(mtcars)
class(mtcars)

dfA <- data.frame(
	id = c(1:6), 
	name = c("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross"),
        LastName= c("Green", "Geller", "Buffay", "Tribbiani", "Bing", "Geller")
)
row.names(dfA)<-c("one","two","three","four","five","six")
dfA
> as.tibble(dfA)
tblA <- as.tibble(dfA)
tblA

Factors

Factor are a data structure that categorically represents the data.

For more information, please check https://www.geeksforgeeks.org/r-factors/?ref=lbp

Commands for creating R factors

VecA <- c("female", "male", "male", "female")
VecA
## [1] "female" "male"   "male"   "female"
FacA <- factor(VecA)
FacA
## [1] female male   male   female
## Levels: female male

FacB <- factor(c("female", "male", "male", "female"), levels = c("female", "transgender", "male"))
FacB
## [1] female male   male   female
## Levels: female transgender male
levels(FacB)

Commands for accessing elements of a R factor

FacB <- factor(c("female", "male", "male", "female"))
FacB[3]
## [1] male
## Levels: female male

Factors in data frames:

age <- c(40, 49, 48, 40, 67, 52, 53) 
salary <- c(103200, 106200, 150200, 10606, 10390, 14070, 10220) 
gender <- factor(c("male", "male", "transgender", "female", "male", "female", "transgender")) 
employee<- data.frame(age, salary, gender) 
print(employee) 
print(is.factor(employee$gender))

How to use loop in R

What is a loop? The term looping, cycling, and iterating are all equivalent to replicating steps. There are 3 ways to execute automated multi-step processes in R:

  1. For loop
  2. While loop
  3. Repeat loop

image

For loop in R:

for (var in vector)
 {
   statement(s)    
}
# For example
for(i in 1:5){
    print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

# example 2 
for (i in 1: 4)
{
    print(i ^ 2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16

# example 3
for (i in c("John", "Jeff", "Joseph", "Mary")){
    print(i)
}

# example 4
for (i in c("John", "Jeff", "Joseph", "Mary")){
    print(paste(i," has a little lamb"))
}
## [1] "John  has a little lamb"
## [1] "Jeff  has a little lamb"
## [1] "Joseph  has a little lamb"
## [1] "Mary  has a little lamb"
# example 5
VecX <- c(-8, 9, 11, 45)
for (i in VecX)
{
    print(i)
}
## [1] -8
## [1] 9
## [1] 11
## [1] 45

For loop can have nested structure

for (i in 1:3)
{
    for (j in 1:i)
    {
        print(i * j)
    }
}
## [1] 1
## [1] 2
## [1] 4
## [1] 3
## [1] 6
## [1] 9

Jump statements in R for loop

for (i in c(3, 6, 23, 19, 0, 21))
{
    if (i == 0)
    {
        break
    }
   print(i)
}
print("Outside Loop")
## [1] 3
## [1] 6
## [1] 23
## [1] 19
## [1] "Outside Loop"

Next statements in R for loops:

for (i in c(3, 6, 23, 19, 0, 21))
{
    if (i == 0)
    {
        next
    }
    print(i)
}
print('Outside Loop')
## [1] 3
## [1] 6
## [1] 23
## [1] 19
## [1] 21
## [1] "Outside Loop"

Creating multiple plots within a for loop

# create a matrix of data
mat <- matrix(rnorm(100), ncol = 5# set up the plot layout
par(mfrow = c(2,3))

# loop over columns of the matrix
```R
for (i in 1:5) {                         # create a histogram for each column  
            hist(mat[, i], main = paste("Column", i), xlab = "Values", col = "lightorange")
}

image

Functions in R

Typical structure for R function

function_name <- function(parameters){
  function body 
}

Example 1

hello_world <- function(){
    'Hello, World!'
}
print(hello_world())

Example 2

circumference <- function(r=1){
    2*pi*r
}
print(circumference())
print(circumference(2))

Example 3

sum_two_nums <- function(x, y){
    x + y
}
print(sum_two_nums(1, 2))

Example 4

mean_median <- function(vector){
    mean <- mean(vector)
    median <- median(vector)
    return(c(mean, median))
}
print(mean_median(c(1, 1, 1, 2, 3)))

Example 5

subtract_two_nums <- function(x, y){
    x - y
}
print(subtract_two_nums(x=3, y=1))
print(subtract_two_nums(y=1, x=3))

Example 6

calculate_calories_women <- function(weight, height, age){
    (10 * weight) + (6.25 * height) - (5 * age) - 161
}
print(calculate_calories_women(age=30, 60, 165))

S4 data

What is a S3 object?

An S3 object is a base type with at least a class attribute (other attributes may be used to store other data). Basically, a list with its class attribute set to some class name, is an S3 object.

What's the difference between S3 and S4 object?

  • S3 is relatively simple but also informal. They don't have formal definitions for classes and methods, while S4 has rigorous definitions for them.
  • S4 classes are defined using the setClass() function, which specifies the 'slots' used with specific data type. S3 in the other hand, is basically a list with a class attribute without formal rules.

What's the benefit of using S4?

  • S4 allows for formal definition of classes and slots, which ensure that the object have a specific structure and thus provide a clear and consistent framework. This makes S4 suitable for object-oriented programming in R over S3, especially when it comes to more complex and structured applications. In contrast, the informal structure of S3 may lead to inconsistency.

Commands for creating S3 and S4 objects

S3A <- list(name = "apple", color = "red", flavor = "sweet and sour")
class(S3A) <- "Fruit"
S3A
## $name
## [1] "apple"
## 
## $color
## [1] "red"
## 
## $flavor
## [1] "sweet and sour"
## 
## attr(,"class")
## [1] "Fruit"

setClass("fruit", slots = list(name = "character", color = "character", flavor = "character"))
S4A <- new("fruit", name = "apple", color = "red", flavor = "sweet and sour")
S4A
## An object of class "fruit"
## Slot "name":
## [1] "apple"
## 
## Slot "color":
## [1] "red"
## 
## Slot "flavor":
## [1] "sweet and sour"

Practice with Tidyverse: Data selection and filtering in tibbles

image

Tidyverse is a collection of R packages designed for data science. It includes tools for data manipulation, visualization, and modeling, all sharing an underlying philosophy and common APIs. The core packages of Tidyverse, as shown in the graph, are:

  • ggplot2
  • dplyr
  • tidyr
  • readr
  • purrr
  • tibble
  • stringr
  • forcats

For details, please check https://www.tidyverse.org/

Filtering and subsetting your data

Tidyverse (specifically dplyr) comes with functions to manipulate your data, in which all functions take a tibble as their first argument and return a tibble as output. Selecting columns and logically subsetting your dataset plays a tremendous role in most of bioinformatic data analysis.

TibA <- tibble(
  name = c("Timothy", "Ricky", "Bob", "Shawn", "Eric", "Tat", "Max"),
  age = c(35, 30, 34, 42, 45, 43, 12),
  city = c("Pittsburgh", "ShangHai", "Sacramento", "Dallas", "Irvine", "Austin", "Houston"),
  weight = c(160, 168, 172, 195, 180, 175, 140),
  height = c(183, 180, 181, 168, 178, 175, 132)
)
select(TibA, name, city, height)
## # A tibble: 7 × 3
##   name    city       height
##   <chr>   <chr>       <dbl>
## 1 Timothy Pittsburgh    183
## 2 Ricky   ShangHai      180
## 3 Bob     Sacramento    181
## 4 Shawn   Dallas        168
## 5 Eric    Irvine        178
## 6 Tat     Austin        175
## 7 Max     Houston       132

select(TibA, 2,4)
## # A tibble: 7 × 2
##     age weight
##   <dbl>  <dbl>
## 1    35    160
## 2    30    168
## 3    34    172
## 4    42    195
## 5    45    180
## 6    43    175
## 7    12    140

select(TibA, -city)

## # A tibble: 7 × 4
##   name      age weight height
##   <chr>   <dbl>  <dbl>  <dbl>
## 1 Timothy    35    160    183
## 2 Ricky      30    168    180
## 3 Bob        34    172    181
## 4 Shawn      42    195    168
## 5 Eric       45    180    178
## 6 Tat        43    175    175
## 7 Max        12    140    132

filter(TibA, Height>=170)
## # A tibble: 5 × 5
##   name      age city       weight height
##   <chr>   <dbl> <chr>       <dbl>  <dbl>
## 1 Timothy    35 Pittsburgh    160    183
## 2 Ricky      30 ShangHai      168    180
## 3 Bob        34 Sacramento    172    181
## 4 Eric       45 Irvine        180    178
## 5 Tat        43 Austin        175    175

filter(TibA, name == "Shawn")
## # A tibble: 1 × 5
##   name    age city   weight height
##   <chr> <dbl> <chr>   <dbl>  <dbl>
## 1 Shawn    42 Dallas    195    168

References

https://www.r-project.org/about.html

https://www.geeksforgeeks.org/r-tutorial/?ref=lbp

https://www.codecademy.com/resources/docs/r/variables

https://www.starwars.com/databank/

https://www.educative.io/answers/what-is-tibble-versus-data-frame-in-r

https://github.com/rstudio/cheatsheets/blob/main/data-import.pdf

https://www.programiz.com/r/s4-class

https://www.tidyverse.org/