-
Notifications
You must be signed in to change notification settings - Fork 1
Home
- [GRC-docs](GRC programs documentation)
This project contains a set of programs that are used to inspect BAM files and derive genotypes relevant to the Genetic Report Cards (GRC). The tools have been initially developed for the analysis of aligned P. falciparum Illumina paired short reads files, but will be extended to other organisms in the future. Most of the organism-specific and locus-specific parameters are encoded in configuration files for flexibility.
There are four sets of tools at present:
- The GRC tools produce genotypes for the drug-resistance loci that are included in the GRC. These genotypes are (currently) expressed in terms of amino acid alleles
- The Sample Classification tools search for reads that can be used to classify a sample, for example to identify the presence in the samples of co-infecting strains of species other than P. falciparum by identifying reads with alleles only found in those species.
- The Heteroallelic Genotyping tools search across a coding region for any nonsynonymous mutation, classifying a sample as "wild type" unless it contains at least one such mutation. These genotypes are expressed in terms of amino acid alleles, and the tools handle heterozygous samples.
- The Barcoding tools produce genetic barcodes based on the concatenation of nucleotide alleles genotyped at multiple sites, specified as a list. These tools emulate the Sequenome genetic barcoding assays currently used by SpotMalaria.
These tools take as primary input BAM files from alignments against the V3 P. falciparum 3D7 reference genome. At a future stage, the choice of reference genome may be configurable. The tools can be invoked either on an individual sample, or on a set of samples. Genotyping tools are provided to produce one file per sample containing the genotyping data. There is also an aggregation tool that merges all results, producing a genotype file for a whole sample set.
The tools mostly (except for Barcoding tools) share a common model for how they determine results. Each process is controlled by a Task Configuration File, which specifies a set of loci where genotyping is to be performed. A locus is a region of the reference genome upon which reads covering the positions to be genotyped are likely to be mapped. Typically, for genotyping a single codon, we may specify a locus which contains all positions within +/- 200bp on either side.
For each locus we specify one or more anchors, i.e. nucleotide sequence patterns that must be matched in each read to be used for genotyping. An anchor is specified as a regular expression (regex), and the position of the first nucleotide in the expression. The tools will attempt to match the anchor regex in both reads that are mapped to the locus, and also unmapped reads. If an anchor is matched, the read will be aligned at the anchor’s starting position for further processing. No gaps are introduced in this alignment step.
If genotypes are to be produced at some position in the locus (this is the case for the GRC tools, but not for the sample classification tools which simply match alleles), then one or more targets must be specified for the locus. A target is specified as a names interval of nucleotide positions, currently assumed to be codons (hence the target length must be a multiple of 3). The tools will establish the codon(s) at the target location for all reads aligned against the anchor(s) in order to determine the genotype for the sample.
In the sample classification tools, targets are used somewhat differently: for each target (which need not contain an exact number of codons), a set of alleles is associated with a class (e.g. the name of a species) which is assigned to the reads. In other words, here the targets are used for exact sequence matching against a known set of alleles.
Clone the GeneticReportCard repository. This contains two projects:
-
AnalysisCommon
contains library that are shared by multiple projects (and which will probably be packaged separately eventually). -
SequencingReadsAnalysis
contains the sources for the reads analysis tools (grc and sample classification)
Ensure that Java JRE 1.8 or higher is installed; test using
java –version
Set working directory to the root folder of SequencingReadsAnalysis
project and type
ant build
The Ant task should build both projects.
Running the tools requires Java JRE 1.8 or higher (see above). The tools use several third-party libraries, which are stored in the /lib
folders of each project. These must be added to the Java class path, as must the /bin
folders of both projects. The following is a set of commands that sets the classpath:
GRCC =/path/to/clone/of/GeneticReportCard/AnalysisCommon
GRCA =/path/to/clone/of/GeneticReportCard/SequencingReadsAnalysis
CLASSPATH=$GRCA/bin:$GRCC/bin:\
$GRCC/lib/commons-logging-1.1.1.jar:\
$GRCA/lib/apache-ant-1.8.2-bzip2.jar:\
$GRCA/lib/commons-compress-1.4.1.jar:\
$GRCA/lib/commons-jexl-2.1.1.jar:\
$GRCA/lib/htsjdk-2.1.0.jar:\
$GRCA/lib/ngs-java-1.2.2.jar:\
$GRCA/lib/snappy-java-1.0.3-rc3.jar:\
$GRCA/lib/xz-1.5.jar
You need to ensure that the Java tools can allocate sufficient memory. Unfortunately I have not been able to do rigorous testing, and memory usage seems to be largely determined by the Broad BAM processing libraries. An allocation of 2GB seems to work; set this with
java -Xms512m -Xmx2000m <class> <params...>
The GRC tools and Sample Classification tools are run separately. Please refer to the documentation at their respective pages: