-
Notifications
You must be signed in to change notification settings - Fork 189
Genome build
We have included a suite of tools including genome size survey, genetic map and Hi-C heatmap concordance to check for quality of genome build.
Tip
Download the test dataset here.
The raw sequencing data provides a way to estimate the size, ploidy, heterozygosity and repeat content of a genome, similar to GenomeScope. Let's say that you have a kmer count histogram (commonly generated by Jellyfish, or other kmer counter), in a file reads.histo
.
1 1281576854
2 89292133
3 21588481
4 9347716
5 5569400
6 4705214
With 1st column the frequency of kmer in the sequencing data, and 2nd column the abundance of kmer with a given frequency. It is easy to infer all the genome statistics and annotate directly on the kmer histogram.
python -m jcvi.assembly.kmer histogram reads.histo "*S. species* ‘Variety 1’" 21
This takes the kmer counts and the species name that goes in the tile. Finally the size K
when used to generate the kmer histogram. Behind the scenes, a negative binomial mixture model is applied to approximate the various genome statistics, including the ploidy of the genome.
You can then simply read various genome statistics from the plot, and that the genome is a tetraploid.
After genome assembly, we would often like to perform quality control. One of the QC is to compare to the genetic maps of the organism. Assume that you have the genetic map input matrix (MSTMap format), in file geneticmap.matrix
.
With first column indicating the position in the current genome assembly, in the format of chr1.12345
, and the following columns indicating the genotypes of each mapping individual.
Our genetic quality control map can then be visualized as a heatmap with one command:
python -m jcvi.assembly.geneticmap heatmap geneticmap.matrix
Entries in the heatmap corresponding to the linkage disequilibrium (chr4
and chr6
, suggesting a potential mis-assembly (or could be a rearrangement between the mapping parents).
Similarly, the genome quality can also be assessed using a Hi-C heatmap. This can be more common nowadays compared to using genetic map.
Assume that you have the Hi-C reads mapped to the genome assembly, in hic.bam
.
python -m jcvi.assembly.hic bam2mat hic.bam
This will generate two files - hic.resolution_500000.npy
and hic.resolution_500000.json
, which can be visualized.
python -m jcvi.assembly.hic heatmap \
hic.resolution_500000.npy \
hic.resolution_500000.json \
--title="*S. species* Hi-C contact map" \
--groups=groups
Aside from the configurable title, the groups
file can control if certain chromosomes should be highlighted together with specific colors. For example,
Chr01_A,Chr01_B g
Chr02_A,Chr02_B g
Chr03_A,Chr03_B g
This allows Chr01_A
and Chr01_B
to be plotted together with a green (g
) highlight.
© Haibao Tang, 2010-2024