RNAIndel calls coding indels and classifies them into somatic, germline, and artifact from tumor RNA-Seq data. Users can also classify indels called by their own callers by supplying a VCF file. RNAIndel supports GRCh38 as well as GRCh37.
Please make sure that the dependencies are satisfied before installing RNAIndel.
Install RNAIndel.
pip install rnaindel
Download datafile: data_dir_37.tar.gz for GRCh37 and data_dir_38.tar.gz for GRCh38. Place the gzipped file under a directory of your choice and unpack it.
tar xzvf data_dir_37.tar.gz # for GRCh37
tar xzvf data_dir_38.tar.gz # for GRCh38
Usage (demo)
Indels are called by the built-in caller Bambino, which is optimized for RNA-Seq indel calling, and classified into somatic, germline, and artifact.
rnaindel -b BAM -o OUTPUT_VCF -f FASTA -d DATA_DIR [other options]
Users can also classify indel entries in a VCF file generated by their callers (indel calling by the built-in caller will not be performed).
Specify the input VCF file by -c.
rnaindel -b BAM -c INPUT_VCF -o OUTPUT_VCF -f FASTA -d DATA_DIR [other options]
-b
input STAR-mapped BAM file (required)-c
VCF file from other caller (required for using other callers, e.g., GATK)-o
output VCF file (required)-f
reference genome (GRCh37 or 38) FASTA file (required)-d
data directory contains trained models and databases (required) Data directory set up-q
STAR mapping quality MAPQ for unique mappers (default=255)-p
number of cores (default=1)-m
maximum heap space (default 6000m)-n
user-defined panel of non-somatic indels in VCF format-l
direcotry to store log files-h
print usage message--version
print version
cwl-runner rnaindel.cwl INPUT_YML
A sample input YAML file is here.
Please prepare your BAM file as follows:
- Map your reads with the STAR 2-pass mode to GRCh37 or 38.
- Add read groups, sort, mark duplicates, and index the BAM file with Picard.
Please input the BAM file from Step 2 without caller-specific preprocessing such as indel realignment.
Additional processing steps may prevent desired behavior.
Somatic prediction can be refined by applying a user-defined indel panel. Putative somatic indels found in the panel will be reclassified to germline or artifact, whichever has the higher probability. Indels predicted germline or artifact are not subject to reclassification by PONS. Such panels can be compiled:
RNA-Seq data may be a (ideally matched) single or a pooled dataset.
- Perform variant calling on the RNA-Seq data and generate a VCF file.
- Index the VCF with Tabix.
In this approah, non-somatic indels recurrently misclassified as somatic are collected using a large cohort.
- Apply RNAIndel on the RNA-Seq data.
- Validate indels predicted as somatic (putative somatic indels) with the DNA-Seq data.
- Collect putative somatic indels which are validated as germline or artifact in N samples or more (recurrent non-somatic indels).
- Format the recurrent non-somatic indels in a VCF file and index with Tabix.
A sample panel by the second approach is included in the data package, which is compiled from a
a cohort of 330 samples with RNA-Seq and T/N-paired WES & PCR-free WGS. When no custom panel is available, apply this panel by appending the following option:
-n path/to/data_dir/non_somatic/non_somatic.vcf.gz
- Hagiwara, K., Ding, L., Edmonson, M.N., Rice, S.V., Newman, S., Meshinchi, S., Ries, R.E., Rusch, M., Zhang, J. RNAIndel: a machine-learning framework for discovery of somatic coding indels using tumor RNA-Seq data. (preprint)