forked from jmledford3115/datascibiol
-
Notifications
You must be signed in to change notification settings - Fork 0
/
lab2_1.Rmd
197 lines (165 loc) · 6.64 KB
/
lab2_1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
title: "Data Matrices"
author: "Joel Ledford"
date: "Winter 2019"
output:
html_document:
theme: spacelab
toc: yes
toc_float: yes
pdf_document:
toc: yes
---
## Setup
At the beginning of each class, please make sure that you navigate to the class [GitHub](https://github.com/FRS417-DataScienceBiologists) and download the `class_files` folder. In our class GitHub, you will see a directory called `class_files`. Click on the `Clone or download` button and download the files as a `.zip`. You should now have a folder called `class_files` on your computer inside of which are all of the files that we will use for today's class.
## Learning Goals
*At the end of this exercise, you will be able to:*
1. Combine a series of vectors into a data matrix.
2. Name columns and rows in a data matrix.
3. Select values and use summary functions in a data matrix.
4. Explain the difference between a data matrix and a data frame.
## Data Matrix
Last time, you learned how to work with vectors. Today we will organize the vectors into a new type of data structure called a data matrix. Like vectors, data matrices are restricted to *data of the same type*. In the short example (from DataCamp) below, we will build a new data matrix using the matrix command.
Box office earnings for Star Wars movies (in millions!). Notice that these are separate vectors.
```{r}
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
```
Create a new object called `box_office`. Here we are using the `c` command to combine the three vectors into one.
```{r}
box_office <- c(new_hope, empire_strikes, return_jedi)
box_office
```
Create `star_wars_matrix` using the `matrix()` command. We need to tell R how to organize the `box_office` vector using the `nrow` and `byrow` commands.
```{r}
star_wars_matrix <- matrix(box_office, nrow = 3, byrow = T)
star_wars_matrix
```
## Name the rows and columns
Vectors `region` and `titles`, used for naming.
```{r}
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
```
Name the columns using `colnames()` with the vector region.
```{r}
colnames(star_wars_matrix) <- region
```
Name the rows using `rownames()` with the vector titles.
```{r}
rownames(star_wars_matrix) <- titles
```
Print star_wars_matrix.
```{r}
star_wars_matrix
```
## Using a data matrix
Once you have a data matrix, you can perform lots of different analyses. For example, you can calculate the total earnings of each movie.
```{r}
worldwide_vector <- rowSums(star_wars_matrix)
worldwide_vector
```
And even add a new column to reflect this calculation. `cbind()` adds columns.
```{r}
all_wars_matrix <- cbind(star_wars_matrix, worldwide_vector)
all_wars_matrix
```
We can combine additional data. `rbind()` adds rows. Create three new vectors for the new movies.
```{r}
the_phantom_menace <- c(474.5, 552.5)
attack_of_the_clones <- c(310.7, 338.7)
revenge_of_the_sith <- c(380.3, 468.5)
```
Combine the new movies into a single vector as before.
```{r}
box_office2 <- c(the_phantom_menace, attack_of_the_clones, revenge_of_the_sith)
```
Use the `matrix()` command to organize the `box_office2` vector.
```{r}
star_wars_matrix2 <- matrix(box_office2, nrow = 3, byrow = T)
```
Create a new vector that will be used for naming. We can re-use region.
```{r}
titles2 <- c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith")
```
Use `colnames()` and `rownames()` to name the data matrix.
```{r}
colnames(star_wars_matrix2) <- region
rownames(star_wars_matrix2) <- titles2
```
Combine the two data matrices into one using `rbind()`.
```{r}
all_wars_matrix2 <- rbind(star_wars_matrix, star_wars_matrix2)
all_wars_matrix2
```
## Select elements in a data matrix
The same methods of selecting elements in a vector apply to data matrices. We use `[]`.
```{r}
all_wars_matrix2[1,2]
```
We can also select values in an entire row or column. This can be useful for calculations. Here we calculate the mean of the entire second column.
```{r}
non_us_earnings <- all_wars_matrix2[ ,2]
mean(non_us_earnings)
```
## Practice
1. Below are data collected by three scientists (Jill, Steve, Susan in order) measuring temperatures of eight hot springs.
```{r}
spring_1 <- c(36.25, 35.40, 35.30)
spring_2 <- c(35.15, 35.35, 33.35)
spring_3 <- c(30.70, 29.65, 29.20)
spring_4 <- c(39.70, 40.05, 38.65)
spring_5 <- c(31.85, 31.40, 29.30)
spring_6 <- c(30.20, 30.65, 29.75)
spring_7 <- c(32.90, 32.50, 32.80)
spring_8 <- c(36.80, 36.45, 33.15)
```
2. Build a data matrix that has the springs as rows and the columns as scientists.
```{r}
springs <- c(spring_1, spring_2, spring_3, spring_4, spring_5, spring_6, spring_7, spring_8)
springs_matrix <- matrix(springs, nrow=8, byrow = T)
springs_matrix
```
3. Let's name the rows and columns. Start by making two new vectors with the names.
```{r}
scientists <- c("Jill", "Steve", "Susan")
springs <- c("Bluebell Spring", "Opal Spring", "Riverside Spring", "Too Hot Spring", "Mystery Spring", "Emerald Spring", "Black Spring", "Pearl Spring")
```
4. Use `colnames()` and `rownames()` to name the columns and rows.
```{r}
colnames(springs_matrix) <- scientists
rownames(springs_matrix) <- springs
springs_matrix
```
5. Calculate the mean temperature of all three springs.
```{r}
mean_vector <- rowMeans(springs_matrix)
mean_vector
```
6. Add `mean_vector` as a new column.
```{r}
springs_matrix2 <- cbind(springs_matrix, mean_vector)
scientists2 <- c("Jill", "Steve", "Susan", "Mean Temp")
colnames(springs_matrix2) <- scientists2
springs_matrix2
```
## Data Frames
A data frame is a fancy way of saying data table with one caveat; the data in a data frame can be of multiple classes. This should make good sense as we often want to analyze relationships between numerical values, logical values, integers, etc. and this isn't possible in a data matrix.
Here is a vector of organisms.
```{r}
organism <- c("Human","Mouse","Fruit Fly", "Roundworm","Yeast")
```
Here are the data.
```{r}
genomeSizeBP <- c(3000000000,3000000000,135600000,97000000,12100000)
estGeneCount <- c(30000,30000,13061,19099,6034)
```
Instead of using the matrix command, we will combine the data using `data.frame()`.
```{r}
comparativeGenomeSize <- data.frame(organism = organism, genomeSizeBP = genomeSizeBP, estGeneCount = estGeneCount)
comparativeGenomeSize
```
Notice that not only are the data neater and cleaner looking, there is also information provided about the type of data in the frame. `dbl` means that the value is a type of numeric [double precision floating point](http://uc-r.github.io/integer_double/).
## That's it, let's take a break!
--> On to [part 2](https://jmledford3115.github.io/datascibiol/lab2_2.html)