forked from bcb420-2022/R_basics
-
Notifications
You must be signed in to change notification settings - Fork 1
/
07-R-Data-Frames.Rmd
161 lines (112 loc) · 5.06 KB
/
07-R-Data-Frames.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# Data frames {#r-data-frames}
(R data frames)
## Overview
### Abstract:
Introduction to data frames: how to create, and modify them and how to retrieve data.
### Objectives:
This unit will:
* introduce R data frames;
* cover a number of basic operations.
### Outcomes:
After working through this unit you:
* know how to create and manipulate data frames;
* can extract rows, columns, and append new data rows;
### Deliverables:
**Time management**: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
**Journal**: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
**Insights**: If you find something particularly noteworthy about this unit, make a note in your insights! page.
### Prerequisites
[RPR-Objects-Vectors (R scalars and vectors)](#r-scalars)
## Data frames
`r task_counter <- task_counter + 1`
## Task `r task_counter` - Basic operations
```{block, type="rmd-task"}
Load the R-Exercise_BasicSetup project in RStudio if you don't already have it open.
Type init() as instructed after the project has loaded.
Continue below.
```
Data frames are probably the most important type of data object for bioinformatics in **R**; they emulate our mental model of data in a spreadsheet and can be used to implement entity-relationship datamodels.
Usually the result of reading external data from an input file is a data frame. The file below is included with the R-Exercise-BasicSetup project files - it is called plasmidData.tsv, and you can click on it in the Files Pane to open and inspect it.
<p align="center"><img src="./images/plasmid_file.png" alt="plasmid" width="75%" align="center" /></p>
This data set uses tabs as column separators and it has a header line. Similar files can be exported from Excel or other spreadsheet programs.
Read this as a data frame as follows:
```{r}
( plasmidData <- read.table(
file.path("data_files","plasmidData.tsv"),
sep="\t",
header=TRUE,
stringsAsFactors = FALSE) )
objectInfo(plasmidData)
```
Note the argument **stringsAsFactors = FALSE**. If this is TRUE instead, **R** will convert all strings in the input to factors and this may lead to problems. Make it a habit to turn this behaviour off, you can always turn a column of strings into factors when you actually mean to have factors.
You can view the data frame contents by clicking on the spreadsheet icon behind its name in the Environment Pane.
<p align="center"><img src="./images/plasmid_tsv.png" alt="plasmid" width="75%" align="center" /></p>
## Basic operations
Here are some basic operations with the data frame. Try them and experiment. If you break it by mistake, you can just recreate it by reading the source file again:
use column 1 as rownames
```{r}
rownames(plasmidData) <- plasmidData[ , 1]
```
```{r}
nrow(plasmidData)
ncol(plasmidData)
objectInfo(plasmidData)
```
assign one row to a variable. This is also a data frame! One row. It has to be, because it contains elements of type chr and of type int!
```{r}
x <- plasmidData[2, ]
objectInfo(x)
```
retrieve one row: different syntax, same thing
```{r}
plasmidData["pBR322", ]
```
retrieve one column - two different methods
```{r}
plasmidData[ , 2]
plasmidData[ , "Size"]
```
remove one row
```{r}
plasmidData <- plasmidData[-2, ]
objectInfo(plasmidData)
```
add it back at the end
```{r}
plasmidData <- rbind(plasmidData, x)
objectInfo(plasmidData)
```
add a new row from scratch:
```{r}
plasmidData <- rbind(plasmidData,
data.frame(Name = "pMAL-p5x", Size = 5752,
Marker = "Amp",Ori = "pMB1",
Sites = "SacI, AvaI, ..., HindIII",
stringsAsFactors = FALSE))
objectInfo(plasmidData)
```
`r task_counter <- task_counter + 1`
## Task `r task_counter` - modify data frame
```{block, type="rmd-task"}
The rowname of the new row of plasmidData is now "1". It should be "pMAL-p5x". Fix this.
```
## Self-evaluation
## Further reading, links and resources
**If in doubt, ask!**<br>
If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
```{block2, type="rmd-original-history"}
<br>**Author**: Boris Steipe <[email protected]> <br>
**Created**: 2017-08-05<br>
**Modified**: 2018-05-05<br>
Version: 1.0.1<br>
**Version history**:<br>
1.0.1 Maintenance<br>
1.0 Completed to first live version<br>
0.1 Material collected from previous tutorial<br>
```
### Updated Revision history
```{r echo=FALSE}
source("./bcb420_books_helper_functions.R")
knitr::kable(githistory2table(git2r::commits(repo=".",path=knitr::current_input())))
```
### Footnotes: