-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathcoding_03_matrices_dataframes.Rmd
220 lines (145 loc) · 11.3 KB
/
coding_03_matrices_dataframes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
title: 'Matrices and Data Frames'
output:
html_notebook: default
html_document:
df_print: paged
pdf_document: default
author: Marius 't Hart
---
```{r setup, cache=FALSE, include=FALSE}
library(knitr)
opts_chunk$set(comment='', eval=FALSE)
```
In this third tutorial, you'll learn about matrices and data frames. Each of these have rows and columns, like a spreadsheet, where each cell contains some variable. They are useful to store and handle larger amounts of data. Of course, each piece of data in a matrix or data frame can still be used and changed as a kind of variable, but they can be treated as variable themselves as well. For example, you can make a list of named matrices or data frames to store the data from different groups of participants or different conditions in an experiment.
## Matrices
Here you will get to know matrices. They are the matrices from matrix algebra, so they are kind of rectangular tables that hold numbers (not strictly true in R). They can be created with the `matrix()` function:
```{r make_matrix}
my_matrix <- matrix(data=c(1:9),nrow=3)
my_matrix
```
By giving it a vector of 9 numbers and specifying that there are 3 rows, the function figures out that there should also be 3 columns in order to fit all the numbers. You can see that the matrix is filled by column, that is: the first column is filled first, and then the second column, and so on. You can change all this behavior:
```{r make_another_matrix}
matrix(data=c(1:9), nrow=2, ncol=6, byrow=TRUE)
```
This code complains, but it _does_ create a matrix of the specified shape, filling in the first row first, then moving on to the second row. When it runs out of numbers it just starts over again at the beginning of the input vector. If you specify a matrix that is too small to contain all the numbers in the input, you will get the same warning, but a matrix of the specified shape is returned and the excess numbers are discarded.
Try using 3 rows and 6 columns. What happened? Why is there no warning? What exactly did that warning say?
## Matrix indexing
We can go back to the first matrix, and use indexing to retrieve part of the content of the matrix:
```{r get_number_from_matrix}
my_matrix[3,3]
```
You can see that we now need two indices, one for the row, and one for the column. We can also retrieve an entire row at once, by not specifying the column index:
```{r retrieve_second_row}
my_matrix[2, ]
```
Change this code chunk to retrieve only the second column:
```{r retrieve_second_column}
my_matrix[,]
```
We can even get a sub-matrix from a matrix, by specifying a range of numbers for both the row and column indices:
```{r}
my_matrix[2:3, 2:3]
```
You can do simple math with matrices, just like with vectors:
```{r}
my_matrix + 2
```
Most importantly, matrices can be used to do matrix algebra, which will probably pop up in your analyses when you grow more advanced in using R, but we won't go into that here. However, understanding the row-by-column indexing will come in useful when dealing with data frames, the next topic.
# Data Frames
Run this chunk to see what a data frame usually looks like:
```{r show_simple_df}
col1 <- c('[1,1]','[2,1]','[3,1]')
col2 <- c('[1,2]','[2,2]','[3,2]')
col3 <- c('[1,3]','[2,3]','[3,3]')
my_df <- data.frame(col1,col2,col3,row.names=c('row1','row2','row3'),stringsAsFactors=FALSE)
my_df
```
You see that a data frame has rows and columns, like a matrix. Like a matrix, rows and columns can have names -- although some people say that rows should not have names. Also like a matrix, you can use indices to retrieve specific values. Correct this code to get the value from the cell that is in the first row and in the first column. The value should be: `[1,1]`.
```{r get_df_cell}
my_df[row_index,column_index]
```
You can also get a whole row, or a whole column from a data frame, by only specifying thw row or column index you're interested in. Correct this code to display the third row, and the second column:
```{r}
my_df[ x ]
my_df[ y ]
```
This chunk has two output windows, that you can select by clicking on the two rectangles at the top. By default, the last generated output will be selected. If you managed to complete the code as specified you will notice that the output is different. When you ask for a row of the data frame, you get a data frame with a single row, but there are still column names. When you ask for a column of the data frame, you get a character vector (or a list). Why is this?
## Cases and variables
This comes from the convention that a row represents a case, for example a participant in one of our studies, and a column represents a variable, for example their age in years at the time they participated, their sex, or the condition they were assigned to. Of course, we can also represent data collected in the experimental tasks that way, but for now, notice that age in years would be a numerical variable, while condition would be categorical and could be a character variable or an integer. That meas that if you want all variables for a specific case (a row of the data frame), you could end up with a series of different variables, and it would probably make sense to keep the variable names. On the other hand, if you'd only want one column, they're all in the same type of variable. Of course, when you ask for two columns from a data frame, that is no longer guaranteed.
Let's see if you get a data frame when you index two columns. Complete this code chunk, to get the last two columns of `my_df`:
```{r}
output <- dataframe[,2:3]
class(output)
```
So that seems to work as expected.
One way to think of data frames is as a list of vectors of equal length, where each of the vectors has variables of only one kind.
## Looking at data frames
Now you will explore a data frame that comes with R. It is called trees, and has some data on 31 black cherry trees' height, girth and volume -- for some odd, historic reason in non-metric units, so you can safely ignore the units. Let's see what the `str()` function can do for you:
```{r}
str(trees)
```
You can see that `trees` is of the class 'data.frame' and that it contains 31 observations of 3 variables. Then each of the following lines that starts with a `$` (dollar sign) describes one of those three variables: Girth, Height and Volume. Each of those are further described as `num` or numeric, and the first few values are listed.
The function `str()` is supposed to show you the _structure_ of a variable, and it can be used on other variables than data frames.
If you just want to see the first few rows in a data frame, you can use the function `head()`:
```{r}
head(trees)
```
This will also give you some idea of what is in a data frame. In order to look at the whole data frame, you can use `View()`. In R Studio, this will open a tab in the source code pane, and in a simple R shell this will open a new window to show you the data frame.
```{r}
View(trees)
```
Before starting to work with data, you should always look at it.
## Columns and dollars
The dollar signs at the start of the lines about the variables Girth, Height and Volume are there on purpose. You can type the name of a data frame, followed by dollar sign, followed by the column name (no spaces) to access the column.
Change the folowing chunk of code so that R shows the height of all trees:
```{r}
trees$Girth
```
We can also use the column name in square brackets for indexing, while leaving the row indices unspecified:
```{r}
trees[,'Girth']
```
While you can use the row names in indexing similarly, you can't use the earlier dollar notation for rows, only columns. Since columns are supposed to have one variable, and each row has a case or observation, you can use column names in formula's to express relationships between variables.
For example, for pine trees you might suspect that the height of the tree is related to age. Let's plot that first:
```{r}
plot(trees$Height,trees$Volume)
```
But you could have done that with simple indexing as well (`plot(trees[,2], trees[,3])`). So here's an example that only works only for column names: what if the Volume of the trees depends on both Girth and Height? You can create a simple linear model for that with the `lm()` function (don't worry about the details of that for now):
```{r}
lm(Volume ~ Girth + Height, data=trees)
```
You get a linear model that predicts the volume of a pine tree as: `(Girth * 4.7082) + (Height * 0.3393) - 57.9877`. Whether or not this model makes sense is another matter, but the point is that you have to put the variables that matter to you in separate columns. Well, it _can_ be done the other way around -- I tried -- but it takes way more effort. So let's just make data frames work for you.
## Factors
In the trees data frame, each column had a numeric variable. However, in experimental design, you will usually use conditions. These can be represented as categorical variables that are used to group variables in other columns. Such variables are called factors in R. We'll briefly look at them and may return to them later. To do this we will use the `sleep` data frame. Of course, you will first look at the structure of the data frame:
```{r}
str(sleep)
```
This looks somewhat similar to the trees data frame. Now there are 20 observations in 3 variables. There are two factors: group and ID. Group is a bit misleading, as it actually indicates the kind of sleeping pill taken, where ID indicates the participant. Each participant took both drugs, so there are not two separate groups of participants. The column extra indicates the number of extra hours of sleep the participant got through using the drug.
We'll have a look at the first few rows:
```{r}
head(sleep)
```
Below the column names, you can see the kind of variables: `extra` is a double, a kind of numeric variable type, and `group` and `ID` are factors.
And we'll look at the whole data frame:
```{r}
View(sleep)
```
You can see that each of the ten IDs appears once in each "group".
The datasets that come with R also have a help page (this is not case for your own datasets, unless you wrote one). In order to understand the data, we should also read the help page:
```{r}
help("sleep")
```
You can see that similar information is provided in the help page, so this verifies that you can learn a lot about a dataset by looking at it with the three tools you've learned. At the bottom of the help page you can also see some R code that supposedly shows you the data and some interesting bits of information about it. You can copy-paste that code into the Console to see what happens.
The use of factors will become more interesting when you actually do some statistics, but you should use variables like them to organize data as well. For example, you can take the same measures after training with aligned feedback or after training with rotated feedback.
You could represent a series of reaches with aligned or rotated feedback in a data frame with the following columns:
- aligned or rotated feedback
- trial number
- target angle
- time after trial onset
- hand X position
- hand Y position
- cursor X position
- cursor Y position
And we actually do something very close to this - as you will see in the next tutorial.
While the first three variables or columns would represent ways to group the data, or independent variables of a sort as the experiment determines them, for each separate trial, the last five columns describe a reach trajectory through time that we can process.