-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTeaching_Notes.Rmd
1184 lines (861 loc) · 49.7 KB
/
Teaching_Notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Teaching Notes for R-Genomics Data Carpentry Workshop"
subtitle: "Software Carpentry Workshop in the Dept. of Human Genetics at the University of Michigan"
author: "Marian L. Schmidt (Using the R-Genomics lessons written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams)"
date: "October 19 - 20, 2015"
output:
html_document:
theme: united
toc: yes
---
***
> **Discuss:**
> What's your current data analysis workflow?
> Chat with your neighbors!
***
# Before We Start
Lesson Can be found at <a href="http://tracykteal.github.io/R-genomics/00-before-we-start.html" target="_blank">http://tracykteal.github.io/R-genomics/00-before-we-start.html</a>
This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.
##Learning Objectives
- Introduce participants to the RStudio interface
- Set up participants to have a working directory with a data/ folder inside
- Introduce R syntax
- Point to relevant information on how to get help, and understand how to ask well formulated questions
##RStudio Interface
Go over the following:
- Console
- Script File
- Environment, History
- Files, Plots, Packages, Help
Code and workflow are more reproducible if we can document **EVERYTHING** that we do!
##Our Goal
Other people can easily and exactly replicate our workflow and results
## Getting Started with Our Working Directory
1. Under `File`, click on `New Project`, choose `New Directory` then `Empty project`.
2. Enter a name for this new folder called `SWC_HumanGenetics` (known as a **working directory**). This is where we will work for the rest of the workshop `(~/SWC_HumanGenetics)`.
3. Click on "Create Project".
4. Under the `Files` Tab on the right of the screen, click on `New Folder` and create a folder named `data` within `SWC_HumanGenetics` (i.e. `~/SWC_HumanGenetics/data`).
5. Create a New R Script called `Ecoli_Metagenomes.R` (`File` -> `New File` -> `R Script`) and save it within Your working directory.
6. Does everyone's working directory look like this?
## Interacting with R
Two Main ways with Interacting with R:
1. Using the Console - this is not recommended.
+ Bottom left panel
+ Will show commands run and results
+ Can type commands directly into the console. **However, they will not be saved when you close the session!**
2. Using Script Files (Plain text files that contain code). This is recommended.
+ Type the code/commands into a script file, send it to the console to check it, and then save the script file.
`">"` means that R is ready to receive commands. You can send code from a script file to the console in 2 ways.
1. Click the `Run` button in the top right of the script file.
2. Highlight the desired code and click `Ctrl-Enter`.
A stop sign might show up for longer code. This shows that R is working to execute the code.
`"+"` means that R is waiting for you to complete the code. R has not yet run the code. Usually this is because you forgot to'close' a parenthesis or quotation. Press `Esc` to get out of trouble.
## Basics of R
R is a versatile, open source programming/scripting language that's useful for data analysis/science. It is:
- Based on the language S
- Open Source Software under GPL (a specific type of license).
- Superior to most alternatives
- Has over 7,000 contributed packages
+ Widely used in academia and industry
- Available on all platforms
- Not just for statistics, but also for general purpose programming
- Large and growing community of peers
## Organizing Your Working Directory
You should always separate the original, raw data from intermediate datasets you create! For example your working directory might have folders such as:
- `raw_data`
- `intermediate_data`
- `figures`
- `original_analyses`
- and so on...
## Three ways to Seek Help
1. You know what function you would like to use but need details about it. You can type a single question mark:
```{r help with a function, eval=FALSE, echo=TRUE}
?mean # Reminding myself of the detail and structure of calling the mean function.
```
2. You cannot remember the function that you need to use, but you remember that it has to deal with calculating means. To search all functions within your R, you can use a double question mark. This can also help you remind yourself of which package the function is in.
```{r help search, eval=FALSE, echo=TRUE}
## First way
??mean # Searching for a function that has to do with calculting means of some sort.
## Second way
help.search("mean")
```
Another way to do this is to
3. If you need to remind yourself of the names of the arguments:
```{r arguments, eval=FALSE, echo=TRUE}
args(hist) # "hist" is a function that makes a histogram.
```
Another way is to hit tab.
### You Get an error message that you don't understand.
Google it! Be sure to use key words like **R** and **Error**. You can copy and paste the error into it.
### Asking For Help
- Be sure to make it possible for someone to grasp your problem rapidly. Make it easy to pinpoint the problem.
- Try to reproduce the problem with different very simplified data. You can create a simplified data set by using `dput()`.
```{r asking for help, eval = TRUE, echo = TRUE}
dput(head(iris)) #Iris is an example data.frame that comes with R. Head only shows the first 6 rows.
```
- Always include `sessionInfo()` so people can replicate your example.
```{r SessionInfo, eval=TRUE, echo=TRUE}
sessionInfo()
```
***
# Introduction to R
Lesson can be found at <a href="http://tracykteal.github.io/R-genomics/01-intro-to-R.html" target="_blank">http://tracykteal.github.io/R-genomics/01-intro-to-R.html</a>
This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.
##Learning Objectives
- Be familiar with R syntax
- Understand the concepts of objects and assignment
- Understand the concepts of vector and data types
- Get exposed to a few functions
## R Syntax
*Instructor will open up `Final_Analysis.R`. Let's look at the following:*
- The assignment operator `<-`
- The `=` for arguments within functions.
- `#` for comments
+ This is where you can tell your reader important "behind the scenes" information.
+ **IMPORTANT: You should ALWAYS heavily comment your code**
- The `$` operator for working with columns inside of a data frame.
- Indentation and spacing for easy reading - **Code is made for humans**.
### A principle: "Write data for computers and code for humans."
## Creating objects
We can do simple math with R.
```{r simple math uncommented, eval=FALSE, echo=TRUE}
3+5
12/7
```
**What is missing??** COMMENTS! ALWAYS comment code
```{r simple math commented, eval=FALSE, echo=TRUE}
# Adding 3 and 5. R is fun!
3+5
```
If we forget the `#`, **what happens?**
- *R is trying to run that sentence as a command*
- Now R has the `+`
+ How do we get out of this? *`Esc`*
To do more interesting things, we now need to learn about **variables and assignment operators. **
## Objects and Assignment Operators
For instance instead of adding `3 + 5` we can create objects, which are called **objects** We assign an object with the `<-` operator, which assigns the value on the right to the object on the left The `<-` can be read as **"goes into"**
> **Tip:** If you use `Alt -` (push `Alt` and then `-`), you can create the `<-` in a single keystroke.
```{r assign simple variables, eval = FALSE, echo=TRUE}
# assign 3 to a
a <- 3
# assign 5 to b
b <- 5
# what now is a
a
# what now is b
b
#Add a and b
a + b
#Add a and b and create c
c <- a + b
# what now is c
c
```
> **Tip:** The name of your object is *important*! It should be meaningful and not too long. The name of an object can never start with a number. The following are unacceptable names for objects:
> -`if`, `else`, `for`: These represent fundamental functions within R.
> -`c`, `mean`, `df`, `data`: These are also functions
> -You can use `_` and `-` but not `.`
> -Use nouns for objects and verbs for functions that you write.
> -Keep a consistent style of your coding. Remember, code is for humans! (Hadley Wickham and Google each have published style guides.)
##Functions and their arguments
**Functions** are "canned scripts" that automate a task. They:
- Requires one or more *arguments*, which are the inputs to the function.
+ Arguments can be anything, numbers, columns of a data.frame, a whole data file. It differs by the function and can be looked up in the help page.
+ Many functions have arguments that have *defaults*. This is a standard value that the author of the function thought would be "good enough in standard cases".
+ For example, what symbol to use in a plot.
+ You can always change the default to what you prefer.
- Often (but not always) return a *value*
- Executing a function is known as *calling* a function
For example, let's take the function `sqrt()`:
- **Argument/Input:** The input, a number.
- **Output/Return value:** The value input number squared.
Another example, let's take the function `round()`:
> **Exercise:**
> Round Pi: 3.14159
> 1. How many arguments does round have?
> 2. What is the default of round?
> 3. Change the default to a different value. What happens?
> 4. What happens when we round 2.5? 3.5? What does this tell you about the round function?
```{r round example, eval=FALSE, echo=TRUE}
# Round pi
round(3.14159)
# What are the arguments and default of round?
args(round)
# Changing the default number of digits to 3
round(3.14159, digits = 3) #Recommended!
round(3.14159, 3) ## NOT recommended for functions with more than one argument!!
```
We get 3, that's because the default is to round to the nearest whole number.
> **Exercise:**
> Create an object named `ecoli_genome_length_mb` with a value of 4.6.
> How many picograms does the genome weigh? (978 Mbp = 1 pg)
#####Solution
```{r genome length exercise, eval=FALSE, echo=TRUE}
ecoli_genome_length_mb <- 4.6 # This is for an e-coli genome
ecoli_genome_length_mb
ecoli_genome_weight_pg <- ecoli_genome_length_mb / 978.0
ecoli_genome_weight_pg # Ecoli genomes don't weigh very much.
```
#####Human Genome
```{r Human genome, eval=FALSE, echo=TRUE}
human_genome_length_mb <- 30000 # This is for an e-coli genome
human_genome_length_mb
human_genome_weight_pg <- human_genome_length_mb / 978.0 # To calculate the number of picograms that human genomes weight
human_genome_weight_pg # Human genomes don't weigh very much.
```
#####Comparing things
- `==` denotes equality. Reads as "is equal to"
- `!=` denotes inequality. Reads as "is not equal to"
- `<` "less than"
- `<=` "less than or equal to"
- `>` "greater than"
- `>=` "greater than or equal to"
```{r comparing things, eval=FALSE, echo=TRUE}
1 ==1 # is 1 equal to 1?
1 != 1 # is 1 inequal to 1?
1 < 2 # Is 1 less than 2?
```
> **Exercise:**
> Compare the lengths of `ecoli_genome_length_mb` and `human_genome_length_mb`.
```{r compare human and ecoli, eval=FALSE, echo=TRUE}
human_genome_length_mb == ecoli_genome_length_mb # Are the genome lengths equal?
human_genome_length_mb != ecoli_genome_length_mb # Are the genome lengths equal?
human_genome_length_mb > ecoli_genome_length_mb # Are human genome lengths greater than ecoli genome lengths?
```
## Vectors and Data Types
A **vector** is the most common and basic data structure in R. It's R's workhorse data type. A **vector** is a list of values, mainly numbers or characters. You can:
- Do math with vectors
- Can assign a list of values to a variable
Let's create a vector of genome lengths:
```{r genome lengths vector, echo=TRUE, eval=FALSE}
glengths <- c(4.6, 3000, 50000) # Create a vector of 3 genome lengths
glengths
glengths[1] # Return the first value
glengths[3] # Return the 3rd value
```
A vector can also contain characters:
```{r species character vector, echo=TRUE, eval=FALSE}
species <- c("ecoli", "human", "corn")
species
species[2] #What's the 2nd value in the species vector?
```
There are many functions to help us explore the vector:
```{r vector exploration, echo=TRUE, eval=FALSE}
length(glengths) # How many vlaues are in the genome lengths vector?
length(species) # how many values in the species vector?
length(glengths) == length(species) # Are they the same length?
```
Some math with the vectors:
```{r vector math, echo=TRUE, eval=FALSE}
5 * glengths
new_lengths <- 2 * glengths # multiply lengths by 2
new_lengths
```
This makes it easy to combine or work with data!
Let's take a look at our history in the top right of our console. To do so we can click on the history tab or type
```{r history, echo=TRUE, eval=FALSE}
history() #Opens up the history pane in the top right of RStudio
```
What type of data is in our vector?
```{r vector type, echo=TRUE, eval=FALSE}
class(glengths) # What type of data is in glengths?
class(species)
str(glengths) # What's the structure of glengths?
str(species)
```
> **Tip:** `str()` provides an overview of the object and the elements it contains. It is a really useful function when working with large and complex objects:
```{r add values to vector, echo=TRUE, eval=FALSE}
lengths <- c(glengths, 90) # adding at the end
lengths
lengths <- c(30, glengths) # adding at the beginning
lengths
```
We just saw 2 of the 6 data types that R uses: `character` and `numeric`. The other 4 are:
- `logical` for TRUE and FALSE (the boolean data type)
- `integer` for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)
- `complex` to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that’s all we’re going to say about them.
- "raw" that we won’t discuss further.
Vectors are one of the many data structures that R uses. Other important ones are lists (list), matrices (matrix), data frames (data.frame) and factors (factor).
> Exercise:
> 1. What's the difference between a *data structure* and *data type*?
> 2. Discuss the definitions of *vector*, *object*, *function*, *arguments*.
***
# Starting with Data
Lesson can be found at <a href="http://tracykteal.github.io/R-genomics/02-starting-with-data.html" target="_blank">http://tracykteal.github.io/R-genomics/02-starting-with-data.html</a>
This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.
##Learning Objectives
- Load external data (CSV files) in memory using the survey table (Ecoli_metadata.csv) as an example
- Explore the structure and the content of the data in R
- Understand what are factors and how to manipulate them
## Looking at Metadata
We are studying a population of E. coli (designated Ara-3):
- Propagated for more than 40,000 generations in a glucose-limited minimal medium.
- This medium was supplemented with citrate which E. coli cannot metabolize in the aerobic conditions of the experiment.
- Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared at around **31,000** generations.
- This metadata describes information on the Ara-3 clones and the columns represent:
Column | Description
------------- | -------------
Sample | Clone Name
generation | generation when sample frozen
clade | based on parsimony tree
strain | ancestral strain
cit | Citrate-using mutant status
run | Sequence read archive sample ID
genome_size | size in Mbp (Made up for this lesson)
1. Make sure you are in the correct working directory by typing `getwd()`.
2. Create a new directory within this working directory called data.
3. Move the downloaded file into this directory. Remember some of our unix commands:
```{r move files to directory, eval=FALSE, echo=TRUE}
cd # Go to home directory
cd Downloads/ # Go to downloads
ls *metadata.csv # list all files that end with "metadata.csv"
mv Ecoli_metadata.csv ../SWC_HumanGenetics/data/ # Move the file to our working directory!
```
4. Check the status of your working directory. Is the `Ecoli_metadata.csv` file within the `data` folder in the `SWC_HumanGenetics` working directory.
5. Time to load the data into R!
```{r load data1, eval=TRUE, echo=TRUE}
getwd() # R's pwd - where are we?!
# reads the Ecoli_metadata.csv file from the data sub-folder
metadata <- read.csv('data/Ecoli_metadata.csv') #provides no output
```
```{r load data2, eval=FALSE, echo = TRUE}
metadata # This will provide too much output
head(metadata) # first 6 rows
tail(metadata) # last 6 rows
View(metadata) # View it!
```
## What is a `data.frame`?
`data.frame` is the de facto data structure for most tabular data and what we use for statistics and plotting. It is:
- A collection of vectors of identical lengths.
- Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors).
- The `str()` function is useful to inspect the data types of the columns.
- Spreadsheets can be imported with the `read.csv()` or the `read.table` functions.
- By default, data.frame converts (= coerces) columns that contain characters (i.e., text) into the **factor** data type.
+ Depending on what you want to do, you might want to keep these columns as characters. If so, change `read.csv('data/Ecoli_metadata.csv', stringsAsFactors = FALSE)`
## Inspecting `data.frame` objects
- Size:
+ `dim()` - returns a vector with the number of rows in the first element, and the number of columns as the second element (the __dim__ensions of the object)
+ `nrow()` - returns the number of rows
+ `ncol()` - returns the number of columns
- Content:
+ `head()` - shows the first 6 rows
+ `tail()` - shows the last 6 rows
+ `View()` - view the data within RStudio
- Names:
+ `names()` or `colnames()` - returns the column names
+ `rownames()` - returns the row names
- Summary:
+ `str()` - structure of the object and information about the class, length and content of each column
+ `summary()` - summary statistics for each column
> **Exercises**
> What is the class of the object metadata?
> How many rows and how many columns are in this object?
> How many citrate+ mutants have been recorded in this population?
##Factors
**Factors** are used to represent categorical data.
- Can be ordered or unordered.
- Are an important class for statistical analysis and for plotting.
- Are stored as integers, and have labels associated with these unique integers.
- While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
- Once created, factors can only contain a pre-defined set values, known as **levels**.
+ By default, R always sorts levels in alphabetical order
Let’s now check the __str__ucture of our data.frame in more details with the function `str()`:
```{r structure of metadata, eval=TRUE, echo=TRUE}
str(metadata)
```
- Some columns say `Factor w/ 30 levels`, which means that each row has a unique value.
- `cit` is a `Factor w/ 3 levels`: `minus`, `plus` and `unknown`.
> **Exercise**
> By looking at the str(metadata) output:
> 1. What does the `$` operator mean?
> 2. What do the integers mean at the end of the description for the columns that are `factors`?
***
# Introducing `data.frame` - more on data frames
Lesson can be found at <a href="http://tracykteal.github.io/R-genomics/03-data-frames.html" target="_blank">http://tracykteal.github.io/R-genomics/03-data-frames.html</a>
This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.
##Learning Objectives
- Understand the concept of a `data.frame`
- Using sequences to index data
- Know how to access any element of a `data.frame`
##Indexing and sequences (within a vector)
If we want to extract one or several values from a vector, we must provide one or several indices in **square brackets**, just as we did with the vectors - this time with 2 numbers:
- R indexes start at 1.
- `data.frame` objects have 2 dimensions - rows and columns
+ Therefore, we need to specify the “coordinates” we want from it.
+ Row numbers come first, followed by column numbers (i.e. [row, column]).
```{r metadata indexing, eval=FALSE, echo=TRUE}
metadata[1, 2] # first element in the 2nd column of the data frame
metadata[1, 6] # first element in the 6th column
metadata[1:3, 7] # first three elements in the 7th column
metadata[3, ] # the 3rd element for all columns
metadata[, 7] # the entire 7th column
head_meta <- metadata[1:6, ] # metadata[1:6, ] is equivalent to head(metadata)
```
### However, this method is not recommended!
##Indexing and sequences (within a data.frame)
For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. **(Are species names in column 5 or 7? oh, right… they are in column 6)**. In some cases, in which column the variable will be can change if the script you are using adds or removes columns.
It’s therefore often **better to use column names** to refer to a particular variable, and it makes your code easier to read and your intentions clearer.
You can do operations on a particular column, by selecting it using the `$` sign.
- In this case, the entire column is a vector.
- You can use `names(metadata)` or `colnames(metadata)` to remind yourself of the column names.
For instance, to extract all the strain information from our datasets:
```{r metadata strains, eval=FALSE, echo=TRUE}
metadata$strain # Return all of the strains in the data frame
```
In some cases, you may want to select more than one column. You can do this using the square brackets. Suppose we wanted strain and clade information:
```{r metadata column subset, eval=FALSE, echo=TRUE}
metadata[, c("strain", "clade")] # Only select the strain and clade columns from metadata
```
You can even access columns by column name and select specific rows of interest. For example, if we wanted the strain and clade of just rows 4 through 7, we could do:
```{r metadata column and row subset, eval=FALSE, echo=TRUE}
metadata[4:7, c("strain", "clade")]
```
> **Exercise**
> Create a new object with a new name **without** the `strain` and `run` columns.
***
# Aggregating and analyzing data with dplyr
Lesson can be found at <a href="http://tracykteal.github.io/R-genomics/04-dplyr.html" target="_blank">http://tracykteal.github.io/R-genomics/04-dplyr.html</a>
This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.
## Data manipulation using dplyr
Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations.
Enter `dplyr`.
`dplyr` is a package for making data manipulation easier.
**Packages** in R are basically sets of additional functions that let you do more stuff in R. The functions we’ve been using, like str(), come built into R; packages give you access to more functions. You need to install a package and then load it to be able to use it.
```{r install dplyr, eval=FALSE, echo=TRUE}
install.packages("dplyr") ## install dplyr
```
You might get asked to choose a CRAN mirror – this is basically asking you to choose a site to download the package from. The choice doesn’t matter too much; I’d recommend choosing the RStudio mirror.
```{r load dplyr, eval=FALSE, echo=TRUE}
library("dplyr") ## load
```
You only need to install a package once per computer, but you need to load it every time you open a new R session and want to use that package.
## What is dplyr?
`dplyr` is:
- a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks.
- Built to work with data frames.
- Inspired by the `plyr` package, which has been in use for some time but suffered from being slow in some cases.
+ `dplyr` addresses this by porting much of the computation to C++.
- An additional feature is the ability to work with data stored directly in an external database.
- The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.
This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.
###Selecting columns and filtering rows
We’re going to learn some of the most common dplyr functions:
- `select()`
- `filter()`
- `arrange()`
- `mutate()`
- `group_by()`
- `summarize()`
To select columns of a data frame, use `select().`
The first argument to this function is the data frame (`metaadata`), and the subsequent arguments are the columns to keep.
```{r select dplyr, eval=FALSE, echo=TRUE}
select(metadata, sample, clade, cit, genome_size) #select 4 columns
```
To choose rows, use `filter()`:
```{r filter dplyr, eval=FALSE, echo=TRUE}
filter(metadata, cit == "plus") # select only rows that are the citrate plus mutant
```
> **Challenge:**
>
> 1. Filter out samples that have a genome size that is equal or greater than 4.74.
> 2. Filter out mutants that are "unknown" **and** are before generations 25,000.
```{r solution filter, echo=FALSE, eval= FALSE}
# 1. Filter out samples that have a genome size that is equal or greater than 4.74
filter(metadata, genome_size >= 4.74)
# 2. Filter out mutants that are "unknown" and are before generations 25,000.
filter(metadata, cit == "unknown", generation < 25000)
```
To re-order or arrange the output, use `arrange()`:
```{r arrange dplyr, eval = FALSE, echo = TRUE}
arrange(metadata, genome_size) # arrange by ascending genome_size
arrange(metadata, -genome_size) # arrange by -ascending genome_size
arrange(metadata, -generation) # arrange by descending generation
```
> **Challenge:**
>
> Create a new object named `arranged_metadata` where `metadata` is arranged first by descending generation and second by ascending clade
```{r solution arrange dplyr, eval = FALSE, echo = FALSE}
arranged_metadata <- arrange(metadata, -generation, clade)
```
Summary so far:
- `select()`: To subset certain **columns** out of the data.
- `filter()`: To subset specific **rows** out of the data.
- `arrange()`: To sort the data frame by columns.
### Pipes
But what if you wanted to select and filter?
There are three ways to do this:
1. **Use intermediate steps:** Create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects.
2. **Nested functions:** You can also nest functions by putting one function inside of another. This is handy, but can be difficult to read if too many functions are nested as the process from inside out.
3. **Pipes:** A fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same data set.
+ Pipes in R look like `%>%` and are made available via the `magrittr` package installed as part of `dplyr`.
Let's say we would like to do the following:
- Pick only the citrate plus mutants AND pick the sample, generation, and clade columns. All in one set of function calls. Enter pipes.
```{r pipes dplyr, eval=FALSE, echo=TRUE}
metadata %>%
filter(cit == "plus") %>%
select(sample, generation, clade)
```
In the above we use the pipe to send the metadata data set first through filter, to keep rows where cit was equal to ‘plus’, and then through select to keep the sample and generation and clade columns.
When the data frame is being passed to the `filter()` and `select()` functions through a pipe, we don’t need to include it as an argument to these functions anymore.
If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:
```{r pipes dplyr variable, eval=FALSE, echo=TRUE}
meta_citplus <- metadata %>%
filter(cit == "plus") %>%
select(sample, generation, clade)
meta_citplus
```
> **Challenge**
> Using pipes, subset the data to include rows where the clade is ‘unknown+’. Retain columns `sample`, `cit`, and `genome_size`. Finally arrange the output by descending genome_size.
```{r solution pipes dplyr, eval = FALSE, echo = FALSE}
meta_citplus_arranged <- metadata %>%
filter(cit == "unknown") %>%
select(sample, cit, genome_size) %>%
arrange(-genome_size)
meta_citplus_arranged
```
### Mutate
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. For this we’ll use `mutate()`.
To create a new column of genome size in bp:
```{r mutate dplyr, eval=FALSE, echo=TRUE}
metadata %>%
mutate(genome_bp = genome_size *1e6) # 1 Mbp = 10^6 bp
```
If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the `head()` of the data (pipes work with non-dplyr functions too, as long as the `dplyr` or `magrittr` packages are loaded).
```{r mutate head dplyr, eval=FALSE, echo=TRUE}
metadata %>%
mutate(genome_bp = genome_size *1e6) %>%
head
```
The row has a `NA` value for clade, so if we wanted to remove those we could insert a `filter()` in this chain:
We can do this by using the `is.na()` command.
`is.na()` is a function that determines whether something is or is not an `NA`. The `!` symbol negates it, so we’re asking for everything that is not an `NA`.
Remember the `!=` inequality operations, we can use this with `is.na()` by placing it before the command like this `!is.na()`.
```{r mutate & filter dplyr, eval=FALSE, echo=TRUE}
metadata %>%
mutate(genome_bp = genome_size *1e6) %>%
filter(!is.na(clade)) %>% # select all rows where clade is NOT an NA
head
```
### Split-apply-combine data analysis and the `summarize()` function
Many data analysis tasks can be approached using the “split-apply-combine” paradigm:
1. Split the data into groups.
2. Apply some analysis to each group, and
3. Combine the results.
dplyr makes this very easy through the use of the `group_by()` function. `group_by()` splits the data into groups upon which some operations can be run.
For example, if we wanted to group by citrate-using mutant status and find the number of rows of data for each status, we would do:
```{r group_by tally dplyr, eval=FALSE, echo=TRUE}
metadata %>%
group_by(cit) %>%
tally() # tallies each group.
```
`count()` is the same as tally but alrea will also do the same thing:
```{r group_by count dplyr, eval=FALSE, echo=TRUE}
metadata %>%
count(cit) # tallies each group but has the group.
```
There are several different summary statistics that can be generated from our data. The R base package provides many built-in functions such as:
- `mean()`
- `median()`
- `min()` & `max()`
- `range()`
By default, all R functions operating on vectors that contains missing data will return `NA`. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it.
When dealing with simple statistics like the `mean()`, the easiest way to ignore `NA` (the missing data) is to use `na.rm=TRUE` (rm stands for remove).
`group_by()` is often used together with `summarize()` which collapses each group into a single-row summary of that group. `summarize()` can then be used to apply a function such as those that compute simple stats to give an overview for the group. So to view mean genome_size by mutant status:
```{r summarize mean genome size, eval=FALSE, echo=TRUE}
metadata %>%
group_by(cit) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE))
```
You can group by multiple columns too:
```{r group_by multiple columns, eval=FALSE, echo=TRUE}
metadata %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE))
```
Looks like for one of these clones, the clade is missing. We could then discard those rows using `filter()`:
```{r dplyr2, eval=FALSE, echo=TRUE}
metadata %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE)) %>%
filter(!is.na(clade))
```
> **Challenge:**
> Let's calculate summary statistics! All in one go, let's do the following:
>
> 1. Group by `cit`
> 2. Calculate the `mean()` generation for each mutant.
> 3. Calculate the `median()` generation for each mutant.
> 4. Calculate the `min()` generation for each mutant.
> 5. calculate the `max()` generation for each mutant.
```{r dplyr capstone, eval = FALSE, echo = FALSE}
metadata %>%
group_by(cit) %>%
summarize(mean_gen = mean(generation, na.rm = TRUE),
median_gen = median(generation, na.rm = TRUE),
min_gen = min(generation, na.rm = TRUE),
max_gen = max(generation, na.rm = TRUE))
```
What do we notice about the output??
`dplyr` has changed our `data.frame` to a `tbl_df`. This is a data structure that’s very similar to a data frame; for our purposes the only difference is that it won’t automatically show tons of data going off the screen.
```{r glimpse, eval=FALSE, echo=TRUE}
glimpse(metadata) # dplyr version, very convenient for the screen
str(metadta)
```
Link to the RStudio <a href="https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf" target="_blank"> Data Wrangling Cheat Sheet</a>
***
# Data Visualization
Lesson can be found at <a href="http://tracykteal.github.io/R-genomics/05-data-visualization.html" target="_blank">http://tracykteal.github.io/R-genomics/05-data-visualization.html</a>
This lesson was written by Kate Hertweck, Susan McClatchey, Tracy Teal, Ryan Williams and is maintained by Tracy Teal.
##Learning Objectives
- Basic plots
- Advanced plots (introducing ggplot)
- Writing images (and other things) to file
## Basic Plots in R
The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers”, and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of R’s plotting packages.
When we are working with large sets of numbers it can be useful to display that information graphically. R has a number of built-in tools for basic graph types such as hisotgrams with `hist()`, scatter plots with `plot()`, bar charts with `barplot()`, boxplots with `boxplot()` and much,<a href="http://www.statmethods.net/graphs/" target="_blank">much more</a>
We’ll test a few of these out here on the genome_size vector from our metadata.
```{r genome size for plotting, echo=TRUE, eval = FALSE}
genome_size <- metadata$genome_size
```
### Scatterplot
Let’s start with a scatterplot. A scatter plot provides a graphical view of the relationship between two sets of numbers.
We don’t have a variable in our metadata that is a continous variable, so there is nothing to plot it against but we can plot the values against their **index** values just to demonstrate the function.
```{r genome size base scatterplot, echo=TRUE, eval = FALSE}
plot(genome_size) # Scatter plot
```
Each point represents a clone and the value on the x-axis is the clone index in the file, where the values on the y-axis correspond to the genome size for the clone. For any plot you can customize many features of your graphs (fonts, colors, axes, titles) through graphic options.
For example, we can change the shape and add a title of the data point using pch.
```{r pch scatterplot, echo=TRUE, eval = FALSE}
plot(genome_size, pch=8) # Change the symbol
plot(genome_size, pch=8, main="Scatter plot of genome sizes") # Add a title
```
### Histogram
Another way to visualize the distribution of genome sizes is to use a histogram, we can do this buy using the `hist()` function:
```{r base histogram, echo=TRUE, eval = FALSE}
#histograms
hist(genome_size) # Create histogram of genome sizes
hist(genome_size, breaks = 10)
```
### Boxplot
Using additional information from our metadata, we can use plots to compare values between the different citrate mutant status using a boxplot. A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set.
```{r base boxplot, echo=TRUE, eval = FALSE}
# Boxplot
boxplot(metadata$genome_size ~ metadata$cit) # Creates a box and whisker plot
```
To the **left** of the `~` is the *response variable* and to the **right** of the tilde is the *explanatory variable.*
```{r base colored boxplot, echo=TRUE, eval = FALSE}
boxplot(genome_size ~ cit, metadata, col=c("pink","purple", "darkgrey"),
main="Average expression differences between celltypes", ylab="Expression")
```
## Advanced figures with `ggplot2`
More recently, R users have moved away from base graphic options and towards a plotting package called `ggplot2` that adds a lot of functionality to the basic plots seen above.
An amazing resource for `ggplot2` is the <a href="http://www.cookbook-r.com/Graphs/" target="_blank">R Cookbook written by Winston Chang</a>
The syntax takes some getting used to but it’s extremely powerful and flexible. We can start by re-creating some of the above plots but using `ggplot2` functions to get a feel for the syntax.
``ggplot()`` is best used on data in the `data.frame` form, so we will work with metadata for the following figures. Let’s start by loading the ggplot2 library.
```{r install and load ggplot, echo=TRUE, eval = FALSE}
install.packages("ggplot2")
library(ggplot2)
```
The `ggplot()` function is used to initialize the basic graph structure, then we add to it. The basic idea is that you specify different parts of the plot, and add them together using the + operator.
We will start with a blank plot and will find that you will get an error, because you need to add layers.
```{r ggplot error, echo=TRUE, eval = FALSE}
ggplot(metadata) # note the error
```
Geometric objects are the actual marks we put on a plot. Examples include:
- points (geom_point, for scatter plots, dot plots, etc)
- lines (geom_line, for time series, trend lines, etc)
- boxplot (geom_boxplot, for, well, boxplots!)
A plot must have at least one `geom` (short for geometric objects); there is no upper limit. You can add a geom to a plot using the + operator
```{r ggplot error2, echo=TRUE, eval = FALSE}
ggplot(metadata) +
geom_point() # note what happens here
```
Each type of `geom` usually has a required set of aesthetics to be set, and usually accepts only a subset of all aesthetics –refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the `aes()` function. Examples include:
- position (i.e., on the x and y axes)
- color (“outside” color)
- fill (“inside” color) shape (of points)
- linetype
- size
<a href="https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf">ggplot Cheat Sheet</a>
```{r ggplots, echo=TRUE, eval = FALSE}
ggplot(metadata, aes(x = sample, y= genome_size)) +
geom_point()
# above is the same as below
ggplot(metadata) +
+ geom_point(aes(x = sample, y= genome_size))
```
The labels on the x-axis are quite hard to read. To do this we need to add an additional theme layer. The ggplot2 theme system handles non-data plot elements such as:
- Axis labels
- Plot background
- Facet label backround
- Legend appearance
There are built-in themes we can use, or we can adjust specific elements.
For our figure we will change:
- The x-axis labels to be plotted on a 45 degree angle with a small horizontal shift to avoid overlap.
- Add some additional aesthetics by mapping them to other variables in our dataframe.
+ For example, the color of the points will reflect the number of generations and the shape will reflect citrate mutant status.
- The size of the points can be adjusted within the geom_point but does not need to be included in `aes()` since the value is not mapping to a variable.
```{r ggplot color symbol scatterplot, echo=TRUE, eval = FALSE}
# First let's change the shape of the symbols
ggplot(metadata, aes(x = sample, y= genome_size, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1)) # put x-axis on 45 degree angle
# Then add the change in color due to generations
ggplot(metadata, aes(x = sample, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1))
## Let's look at genome_size by generation
ggplot(metadata, aes(x = generation, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1))
# Zoom in on a certain part of the plot with setting the xlim
summary(metadata$generation) # let's plot the mean to max
ggplot(metadata, aes(x = generation, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
xlim(mean(metadata$generation), max(metadata$generation))
## Flip the plot with coord_flip!
ggplot(metadata, aes(x = generation, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
coord_flip()
```
### Writing figures to file
There are two ways in which figures and plots can be output to a file (rather than simply displaying on screen).
1. **Easier way:** Export directly from the RStudio ‘Plots’ panel, by clicking on Export when the image is plotted. This will give you the option of `png` or `pdf` and selecting the directory to which you wish to save it to.
2. **More reproducible way:** Use R functions that allow you the flexibility to specify parameters to dictate the size and resolution of the output image. Some of the more popular formats include `pdf()`, `png()`, `jpeg()`, `tiff()`etc.
Initialize a plot that will be written directly to a file using `pdf()`, `png()`, `jpeg()`, `tiff()` etc.
Within the function you will need to: