forked from broadinstitute/2020_scWorkshop
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-Understanding-Sequencing-Raw-Data.Rmd
264 lines (178 loc) · 7.04 KB
/
02-Understanding-Sequencing-Raw-Data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---
title: "02-Understanding-Sequencing-Raw-Data"
output: html_document
---
# Understanding Sequencing Raw Data
## Class Environment
### Getting into AWS Instance
There is a nice breakdown from another Physalia course on instructions for different operating systems and accessing AWS. It is called [Connection to the Amazon EC2 service](https://gitlab.com/bfosso/metabarcoding_physalia/-/blob/master/unix_short_tutorial/how_to_connect.md). This will help with connecting to the AWS instance to run docker.
```{bash ssh_aws, eval = FALSE}
## Example
ssh -i berlin.pem ubuntu@<PUBLIC IP ADDRESS> (e.g.34.219.254.245)
## Actual Command Example
ssh -i berlin.pem [email protected]
```
## Shell and Unix commands
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vSavPAPY3WXe2EiqaRXJmAajFTeGcRbD1qJq4Sp4YlcF1_m7I1X89D1uC-8jDDGinqFLNO2oJNALbNx/embed?start=false&loop=false&delayms=3000" frameborder="0" width="760" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
### Common Linux Commands
#### Lab 1a
- check the your present directory
```{bash lab1a_pwd, eval = FALSE}
pwd
```
- check history
```{bash lab1a_hist, eval = FALSE}
history
```
- pipe history to grep to search for the cd command
```{bash, eval = FALSE}
history | grep cd
```
- put history into a history.txt file
```{bash lab1a_hist_txt, eval = FALSE}
history > history.txt
```
- make a directory called data
```{bash lab1a_mkdir, eval = FALSE}
mkdir data
```
- change into data directory
```{bash lab1a_cd, eval = FALSE}
cd data
```
- move history.txt file into data directory
```{bash lab1a_mv, eval = FALSE}
mv ../history.txt ./
```
- check manual page of wget command
```{bash lab1a_man, eval = FALSE}
man wget
```
- redirect wget maunual page output into a file called wget.txt
```{bash lab1a_man_wget, eval = FALSE}
man wget > wget.txt
```
- return the lines that contain output in the wget.txt file
```{bash lab1a_cat, eval = FALSE}
cat wget.txt | grep output
```
```{bash lab1a_grep, eval = FALSE}
grep -i output wget.txt
```
- Compress wget.txt file
```{bash lab1a_gzip, eval = FALSE}
gzip wget.txt
```
- View Compressed file
```{bash lab1a_catgz, eval = FALSE}
cat wget.txt.qz
```
```{bash lab1a_zcat, eval = FALSE}
zcat wget.txt.qz
```
```{bash lab1a_less, eval = FALSE}
zcat wget.txt.qz | less
```
#### Git Commands
Git is a distributed version-control system for tracking changes in source code during software development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files. Its goals include speed, data integrity, and support for distributed, non-linear workflows.
Go to your user directory and run the following command from git. This will create a directory of all the course material inside your user directory. After it is done cloning change directory into the 2020_scWorkshop directory where the course material is. The commands are below.
```{bash clone_repo, eval = FALSE}
## clone repository
git clone https://github.com/broadinstitute/2020_scWorkshop.git
cd 2020_scWorkshop
```
## File formats
- bcl
- fastq
- bam
- mtx, tsv
- hdf5 (.h5, .h5ad)
### View FASTQ Files
#### Viewing entire file
```{bash cat_fastq, eval = FALSE}
cat data/example_fastq.fastq
```
#### Viewing first 10 lines
```{bash head_fastq, eval = FALSE}
head data/example_fastq.fastq
```
#### Stream Viewing with less command
```{bash less_fastq, eval = FALSE}
less data/example_fastq.fastq
```
### View BAM Files
#### Viewing first 10 lines
```{bash sam_view_less, eval = FALSE}
samtools view data/ex1.bam | head
```
#### Stream Viewing with less command
```{bash sam_view_head, eval = FALSE}
samtools view data/ex1.bam | less
```
## Public data repositories
### Cellranger/10x
#### Lab 1b
10x PBMC data are hosted in https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
- get 10x PBMC data
- unzip data
- explore directory
- explore files
```{bash wget_pbmc, eval = FALSE}
wget https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
cd ..
```
### GEO
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81905
#### Lab 1c
**Get GEO Data**
- ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE81nnn/GSE81905/matrix/GSE81905-GPL19057_series_matrix.txt.gz
- ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE81nnn/GSE81905/matrix/GSE81905-GPL17021_series_matrix.txt.gz
- go into that directory
- get files and place them in the directory
- View files (try keeping in compressed format and view that way)
```{bash wget_geo, eval = FALSE}
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE81nnn/GSE81905/matrix/GSE81905-GPL19057_series_matrix.txt.gz
cd data; tar -xzf GSE81905-GPL19057_series_matrix.txt.gz
cd ..
```
### Single Cell Portal
- https://portals.broadinstitute.org/single_cell
- Study: Salk Institute - Single-cell Methylome Sequencing Identifies Distinct Neuronal Populations in Mouse Frontal Cortex
#### Lab 1d
- Get R2 fastq file from the Salk Institute study
- Look at files
#### Lab 1e
- Get Docker on your local computer for you to have
- Explore Single Cell Portal
- Explore GEO
## Docker Commands
Docker provides a consistent compute enviornment to ensure all software that you need is on the machine and able to be used. It will give you the version you need and help reduce software conflicts that may arise.
- make sure you are in the directory from the cloned repository directory
- run following command to start docker script
```{bash run_script_bash, eval = FALSE}
# chmod 700 ./docker/run_docker_bash.sh
./docker/run_docker_bash.sh
```
The full command inside the script is below. There is also an explaination of each part for your reference.
```{bash full_bash_cmd, eval = FALSE}
## if you are the super user on your computer
docker run --rm -it -v $PWD:/home/rstudio/materials kdgosik/2020scworkshop bash
## if you need to access permission to run the command
sudo docker run --rm -it -v $PWD:/home/rstudio/materials kdgosik/2020scworkshop bash
```
**Explaination of commands**
```{bash explain_cmd, eval = FALSE}
- docker: command to run docker
- run: asking docker to run a container
- --rm: flag to remove the container when you exit from it
- nothing will be saved from your session to access again later
- this flag can be removed to keep container
- -it: flag to run the container interactively
- this will keep all session output displaying on the terminal
- to stop container go to terminal and press Crtl+c
-v $PWD:/home/rstudio/materials: map your home directory to a directory inside docker container called home
- kdgosik/2020scworkshop: the image to run. It will be the image into a container if not already built on your computer
- [image link](https://hub.docker.com/r/kdgosik/2020scworkshop)
- bash: The entry point into the container. Start on the bash command line
```