How to process RNA-seq data.
Before processing, �make sure you have a correct understanding of what data you are analysing.
Basic information
- Sequencing date
- Contributor: department or labratory
- Accession: GEO/SRA/PRJNA numbers in ncbi or other dataset ID(clear and brief)
- Data source: download path or home-made
sample information
- Organism
- Disease state: cancer type or healthy
- Tissue or other source (cell)
- Molecular: DNA or RNA, total RNA , polyA RNA, or small RNA ?
Sequencing strategy
- Sequencing platform
- Library layout: Single-end V.S. Paired-end ?
- Insert length: 50/100/150 ?
- Strand specific ?
- Size selection ?
- Enrichment: Poly-A enriched or total RNA(ribosomal RNA removed) ?
- Cellular localization: whole cell or intranuclear of cytoplasmic ?
When analyses published datasets, you can get all these information from ncbi website.
Basic information
- Sequencing date: 2017.11-
- Contributor: Lulab Tsinghua University
- Accession: NA
- Data source: home-made
sample information
- Organism: Homo sapiens
- Disease state: hepatocellular carcinoma(HCC)
- Tissue or other source: plasma
- Molecular: total RNA(fragment)
Sequencing strategy
- Sequencing platform: Illunima HiSeq X
- Library layout: Paired-end 150
- Insert length(RNA length): ~30 nt
- Strand specific: Yes
- Size selection: none
- Enrichment: total RNA without ribosomal RNA removed
- Cellular localization: whole extracellular circulating RNA
Flow chart of exRNA-seq processing method
Input/output of each procedure
Step | Input | Tool/script | Output | Note |
---|---|---|---|---|
1.Preprocessing | 00.rawdata/*.fastq | - | 02.rRNA/*/*.no_rRNA.fastq | ~/projects/exRNA/ hcc_examples/ |
1.1 index | ~/genome/human_hg38/gtf/*.gtf | - | ~/genome/human_hg38/index/bowtie2_index/*.bt2 | - |
1.1.1 .gtf to .fa | ~/genome/human_hg38/sequence/GRCh38.p12.genome.fa ~/genome/human_hg38/gtf/*.gtf | bedtools | *.fa | - |
1.1.2 .fa to .bt2 | *.fa | bowtie2-build | *.bt2 | - |
1.2 fastqc | 00.rawdata/*.fastq | fastqc | 00.rawdata/*_fastqc.html | check raw reads' quality |
1.3 Remove 3' adapter and trim reads | 00.rawdata/*.fastq | cutadapt | 01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3.fastq | trim low quality ends (plus a hard cutoff: 50nt) |
1.4 2nd fastqc | 01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3.fastq | fastqc | 01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3_fastqc.html | make sure the low quality reads have been removed and/or trimmed |
1.5 Remove ribosomal RNA | 01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3.fastq + index file | bowtie2 | 02.rRNA/sampleID/sampleID.rRNA.sam 02.rRNA/sampleID/sampleID.no_rRNA.fastq | - |
2.Mapping | 02.rRNA/sampleID/sampleID.no_rRNA.fastq + index files | bowtie2 | 03.mapping/sampleID/01.miRNA/sampleID.miRNA.sam 03.mapping/sampleID/01.miRNA/sampleID.no_miRNA.fastq -> 03.mapping/sampleID/02.piRNA/sampleID.piRNA.sam 03.mapping/sampleID/02.piRNA/sampleID.no_piRNA.fastq -> ... | map to different RNA annotations step by step |
Quality control of raw reads, and extract the clean RNA sequence.
The Gene transfer format (GTF) is a tab-delimited file format used to hold information about gene annotation.
Example
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
- You can get Gencode27 annotation files at: /BioII/lulab_b/shared/genomes/human_hg38/gtf/
- Annotation file(.gtf/.gff) download source or prossed method, please refer to README file under this directory.
FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.
Example
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
Example file location
/BioII/lulab_b/shared/projects/exRNA/hcc_examples/
Present working directory:
your own project directory: ~/projects/exRNA/hcc_examples/
Create symbolic links:
mkdir 00.rawdata
cd 00.rawdata
ln -s /BioII/lulab_b/shared/projects/exRNA/hcc_examples/fastq/* .
Create a file under ~/projects/exRNA/hcc_examples/ to record name of these files, named as "sample_name", like:
AfterSurgery_1
AfterSurgery_2
AfterSurgery_3
BeforeSurgery_1
BeforeSurgery_2
BeforeSurgery_3
NC_1
NC_2
NC_3
1.1.1 Link genomic annotation file to your home directory
Present working directory: your own home.
mkdir -p genome/human_hg38/index/bowtie2_index
mkdir genome/human_hg38/gtf
mkdir genome/human_hg38/sequence
ln -s /BioII/lulab_b/shared/genomes/human_hg38/gtf/* ./genome/human_hg38/gtf/
ln -s /BioII/lulab_b/shared/genomes/human_hg38/sequence/GRCh38.p12.genome.fa ./genome/human_hg38/sequence/
ln -s /BioII/lulab_b/shared/genomes/human_hg38/index/bowtie2_index/bowtie2_hg38_index/ genome/human_hg38/index/bowtie2_index/
1.1.2 Get RNA sequence(.fa) from annotation file(.gtf)
Use rRNA as example.
- Tool: bedtools - getfasta
- Annotation file: rRNA_exon.gtf
Present working directory: ~/genome/human_hg38/index/bowtie2_index/
mkdir 00.bowtie2_rRNA_index
cd 00.bowtie2_rRNA_index/
bedtools getfasta -fi ../../../sequence/GRCh38.p12.genome.fa -bed ../../../gtf/rRNA_exon.gff -fo rRNA.fa
1.1.3 Build bowtie2 index
Use rRNA as example.
Present working directory: ~/genome/human_hg38/index/bowtie2_index/00.bowtie2_rRNA_index
bowtie2-build rRNA.fa rRNA
Output files:
rRNA.1.bt2
rRNA.2.bt2
rRNA.3.bt2
rRNA.4.bt2
rRNA.rev.1.bt2
rRNA.rev.2.bt2
NOTE: Use exon sequences to build index.
Use tool fastqc to test the quality of raw reads.
mkdir ../fastqc_output
fastqc -q -o ../fastqc_output AfterSurgery_1.fastq
fefe
Processing multiple files.
Write the following command to "fastqc.sh"
#!/bin/bash
for i in `cat ../sample_name`; do fastqc ${i}.fastq; done
Run "fastqc.sh"
bash fastqc.sh &
Check the fastqc results
- Confirm the fastqc results files *_fastqc.html and *_fastqc.zip;
- Copy *_fastqc.html to your computer and open in a web browser;
- Check the quality of raw reads;
- Filter the low quality samples, which median of "per base sequence quality" less than 28(median line under green area).
The raw reads always with adapters, which should be removed to get clean reads.
Use cutadapt toolkit to remove adapters and trim some low quality reads. _Present working directory: ~/projects/exRNA/hcc_examples/
mkdir -p 01.cutAdapter/NC_1
cutadapt -u -100 -q 30,30 --trim-n -m 15 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --too-short-output=./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.TooShort.fastq -o ./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.fastq ./00.rawdata/NC_1/NC_1.fastq >./01.cutAdapter/NC_1/NC_1.cutAdapt3.log
Above is for one sample, you can use bash script to process multiple samples.
You should check the cutadapt log file (sample.cutAdapt3.log) to get information about adapter removed reads. cutadapt log file.
Get more information about cutadapt toolkit, click here: [cutadapt.] (http://cutadapt.readthedocs.io/en/stable/index.html)
A second FastQC run is performed to ensure that the previous quality trimming and/or adapter removal steps successfully conserved high quality reads without being too stringent.
Run fastqc of the adapter removed reads
fastqc ./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.fastq
Differences in results with 1st fastQC:
- The per-base quality scores should be different.
- The Sequencing adapters are no longer identified as over-represented.
Why remove ribosomal RNA ?
- Ribosomal RNA(rRNA) hold more than 80-90% of total RNA.
- Large proportions of rRNA will have an effect on the usable number of effective reads obtained from the samples.
- rRNA over-expressed samples should be filtered.
Toolkit: Bowtie2 Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning short reads.
Mapping reads to rRNA
Present working directory: ~/projects/exRNA/hcc_examples
mkdir -p 02.rRNA/NC_1
bowtie2 --sensitive-local --norc --no-unal -x ~/genome/human_hg38/index/bowtie2_index/00.bowtie2_rRNA_index/rRNA -S 02.rRNA/NC_1/NC_1.rRNA.sam --un ./02.rRNA/NC_1/NC_1.no_rRNA.fastq ./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.fastq > 02.rRNA/NC_1/NC_1.rRNA.log
If want get a .bam file output, use samtools:
samtools view -bS in.sam >out.bam
Check the log file (NC_1.rRNA.log) to calculate the rRNA mapping ratio. bowtie2 log file. Above is for one sample, you can use bash script to process multiple samples.
Mapping reads to human different RNA types.
- Input: 02.rRNA/*/*.no_rRNA.fastq
- Tool: bowtie2.
- Working directory:~/projects/exRNA/hcc_examples/
mkdir -p 03.mapping/NC_1
- Ordered mapping to avoid multiple assign reads to different RNA types:
- 01.miRNA
- 02.piRNA
- 03.Y RNA
- 04.snRNA
- 05.srpRNA
- 06.tRNA
- 07.lncRNA
- 08.mRNA
- 09.other human genome
- 10.non-human genome
- Use the unmapped .fastq file from last step as input to the ordered mapping.
- Methods: you can follow "1.4 mapping reads to rRNA" above.
01. Use rRNA removed reads as input to mapping miRNA
mkdir 03.mapping/NC_1/01.miRNA
bowtie2 --sensitive-local -x ~/genome/human_hg38/index/bowtie2_index/01.bowtie2_miRNA_index/miRNA -S 03.mapping/NC_1/01.miRNA/NC_1.miRNA.sam --un ./03.mapping/NC_1/01.miRNA/NC_1.no_miRNA.fastq ./02.rRNA/NC_1/NC_1.no_rRNA.fastq > 03.mapping/NC_1/01.miRNA/NC_1.miRNA.log
02. Use miRNA removed reads as input to mapping piRNA
mkdir 03.mapping/NC_1/02.piRNA
bowtie2 --sensitive-local -x ~/genome/human_hg38/index/bowtie2_index/02.bowtie2_piRNA_index/piRNA -S 03.mapping/NC_1/02.piRNA/NC_1.piRNA.sam --un ./03.mapping/NC_1/02.piRNA/NC_1.no_miRNA.fastq ./02.rRNA/NC_1/02.piRNA/NC_1.no_rRNA.fastq > 03.mapping/NC_1/02.piRNA/NC_1.miRNA.log
03.Y_RNA, 04.snRNA, 05.srpRNA, 06.tRNA, 07.lncRNA, 08.mRNA, 09.hg38
- Follow the methods above.
- Reads can't mapped to these 8 types RNA would be assigned as "other human genome".
- Reads can't mapped to human genome would be assigned as "non-human".
When processing the real data from some experiments, there are many things can affect the RNA-seq results. So it is important to take all possibilities into consideration, and filter the low quality sample in downstream analyses.
After preprocessing and mapping, you can do some basic statistics of the prossesed datasets.
3.1.1 Summarize the pre-processing and mapping results before downstream analyses.
The summary should contain the following part at least :
Sample_info
- sample_id
- sequenced date or batch number
- Raw reads number (raw fastq file or cutadapt log file)
Preprocessing
- Clean reads number and percentage of raw reads (cutadapt log file)
- rRNA number and percentage of clean reads (rRNA mapping log file of bowtie2)
- kept reads (rRNA mapping log file of bowtie2)
Mapping ratio
- human genome (human genome mapping log file of bowtie2)
- miRNA
- piRNA
- Y RNA
- snRNA
- srpRNA
- tRNA
- lncRNA
- mRNA
- Other human genome region (difference between whole genome and mapped RNA)
- Non-human genome (human genome mapping log file of bowtie2)
Reference.! You can combine some bash command or use your own code to summarize the data processing information. Here we shared a simple bash script for your reference. Github link.
3.1.2 Visualize the reads length distribution by scatter plot/bar chart.
3.1.3 Check whether the mapping ratio are consistent with priori knowledge or published reports. If not, there might have some bias in your sample, which should be filtered.
You can combine some bash command or use your own code to summarise reads length. Here we shared a simple bash script for your reference.
3.2.1 Calculate the reads length :
-
raw reads (raw fastq file).
-
clean reads (rRNA removed and low-quality trimmed reads).
-
mapped reads of miRNA, piRNA, Y RNA, snRNA, srpRNA, tRNA, lncRNA, mRNA (mapped result sam file).
NOTE: The sam/bam files record mapping information of all reads, include mapped and unmapped. Take care of the "flag" in 2nd column of sam/bam file. file format.
You can combine some bash command or use your own code to summarize reads length. Here we shared a simple bash script for your reference. Github link.
3.2.2 Visualize the reads length distribution by scatter plot/bar chart.
3.2.3 Check whether the reads length distribution are consistent with priori knowledge or published reports. If not, there might have some bias in your sample, which should be filtered.
You can define a set of criteria to QC and filter your data. Here we shared our QC criteria for your reference.
Check point | Threshold | Input |
---|---|---|
Raw reads quality | reads quality >28 (median lines in green area) | Check fastqc results(*.html) |
Clean reads number | > 10 million | cutadapt log file |
rRNA ratio | < 10% | bowtie2 rRNA mapping log file |
human genome mapped ratio | > 65% | bowtie2 human genome mapping log file |
other genome region ratio | < 10% | difference between whole genome and mapped RNA |
-
Snakemake is a good workflow management system to create reproducible and scalable data analyses. Snakemake.
-
There is a Snakemake example to identify circRNA,location: /BioII/lulab_b/shared/shared_scripts/Snakemake/examples.Identify_circRNA/
-
We shared our snakemake package used to exRNA-seq basic analyses.Github link.
*
sharing programming tips
-
terminal: grey background, Monaco 18 (aequilate font, suitable for programming);
-
take care of file path and name when using shared scripts.
-
5) Homework
Level 1
-
Learn more about Illunima RNA-seq types, library preparation and analysis methods.
-
hcc_example samples:
-
Follow this tutorial step by step, and check your results by comparing with Siqi's results(location:wangsiqi/project/hcc_example)
-
hcc_lulab samples, summarize 61 samples(shared in github) and show your team's results:
-
Data summary table (ref: 3.1.1).
-
Mapping ratio plot (ref: 3.1.2).
-
Reads length distribution plot (ref: 3.2.2).
-
Filter low quality samples (ref: 3.3).
-
(Advanced) More plots to show you results. Upload your scripts if using R package.
-
(Advanced) Remove 5` adapter from sample_R2.fastq file, compare with R1 clean reads. Think the potentaial reason if they are inconsistent.
-
(Advanced) comparing gft generated fasta file and miRbase 22 fasta file.download
grep hsa ~/Downloads/hairpin.fa |wc -l
1917
grep -c ">" RNAs_fasta/miRNA.fa
1869
Level 2:
- Write and upload a bash script to complete above analysis.
Level 3:
- Learn snakemake usage and create your own snakefile.