diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100755 index 0000000..d2f8b9a --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,34 @@ +# v0.3.0: 03/21/24 +## Changes +- New tool `call-reads` to call from HiFi reads (limited to class I) +- Extend reports for read-based calls +- Update IPD-IMGT/HLA to version 3.55 (2024-01) +- Add option to limit calls to genomic (full-length) records +- Update README + +# v0.2.3: 12/22/23 +## Changes +- Fix reporting bug for cdna calls +- Add option to require exon 2 to make a call +- Improved thread control and error reporting + +# v0.2.2: 11/17/23 +## Changes +- Update database to IPD-IMGT/HLA Version: 3.54 (2023-10) + +# v0.2.1: 11/15/23 +## Changes +- Call-contigs hap2 change to optional +- Improve output determinism +- Update README +- Fix validation bugs + +# v0.2.0: 9/25/23 +## Changes +- Update README for initial release +- Latest database version v3.53 +- Bug fixes + +# v0.1.0: 5/31/23 +## Changes +- Report results in tsv and json format diff --git a/README.md b/README.md index bdf253d..6845a7a 100755 --- a/README.md +++ b/README.md @@ -2,482 +2,22 @@

HiFiHLA

-An HLA star-calling tool for PacBio HiFi data types. +*** +An HLA star-calling tool for PacBio HiFi data types. Authors: [John Harting](https://github.com/jrharting), [Zev Kronenberg](https://github.com/zeeev), [Daniel Baker](https://github.com/dnbaker), [Matt Holt](https://github.com/holtjma) -## Table of Contents ## -1. [Install](#install) -2. [Genes](#genes) -3. [Usage](#usage) -4. [Output](#output) -5. [Examples](#examples) -6. [Changelog](#changelog) -7. [References](#references) - -## Install +## Availability +* [Latest release with binary](https://github.com/PacificBiosciences/HiFiHLA/releases/latest) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/hifihla/README.html) -Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda) for information on Installation, Support, License, Copyright, and Disclaimer. - -`hifihla` can be installed from bioconda -``` -conda install -c bioconda hifihla -``` -Binaries are also availible in the github releases. - -## Genes - -Genes called by `hifihla`: -``` -HLA-A HLA-DQA1 HLA-DRB7 HLA-P -HLA-B HLA-DQA2 HLA-DRB8 HLA-S -HLA-C HLA-DQB1 HLA-E HLA-T -HLA-DMA HLA-DQB2 HLA-F HLA-U -HLA-DMB HLA-DRA HLA-G HLA-V -HLA-DOA HLA-DRB1 HLA-H HLA-W -HLA-DOB HLA-DRB2 HLA-HFE HLA-Y -HLA-DPA1 HLA-DRB3 HLA-J MICA -HLA-DPA2 HLA-DRB4 HLA-K MICB -HLA-DPB1 HLA-DRB5 HLA-L TAP1 -HLA-DPB2 HLA-DRB6 HLA-N -``` - -## Usage - -`hifihla` has three subcommands: -* `call-contigs` -* `call-consensus` -* `align-imgt` - -``` -Usage: hifihla - -Commands: - call-contigs Extract HLA loci from assembled MHC contigs & call star alleles on extracted sequences - call-consensus Call HLA Star (*) alleles from consensus sequences - align-imgt Align queries to IMGT/HLA genomic accession sequences - help Print this message or the help of the given subcommand(s) - -Options: - -h, --help Print help - -V, --version Print version -``` - -### Type consensus sequences -`call-consensus` accepts a FASTA file of sequences and attempts to find the closest matching allele in the [IMGT/HLA database](https://www.ebi.ac.uk/ipd/imgt/hla/). It is assumed that each FASTA record is a single locus. -Optionally call star alleles using exon sequence only. -``` -Call HLA Star (*) alleles from consensus sequences - -Usage: hifihla call-consensus [OPTIONS] --fasta --outdir - -Options: - -f, --fasta Input fasta file of consensus sequences - -o, --outdir Output directory - -c, --cdna Enable cDNA-only calling - -e, --exon2 Require Exon2 in query sequence - -j, --threads Analysis threads [default: 1] - -x, --max_matches Maximum equivalent matches per query in report [default: 10] - -v, --verbose... Enable verbose output - --log-level Alternative to repeated -v/--verbose: set log level via key. - Equivalence to -v/--verbose: - => "Warn" - -v => "Info" - -vv => "Debug" - -vvv => "Trace" [default: Warn] - -h, --help Print help - -V, --version Print version -``` - - -### Type HLA from MHC assembled contigs -`call-contigs` requires 2 or 3 input files and an output directory: - * Assembled contigs must be aligned to [GRCH38 no alts](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz) - * Assembled haplotype fasta[.gz] files (1 or 2) - -The following command using [minimap2](https://github.com/lh3/minimap2/) is recommended for assembly alignment: -``` -minimap2 -t 12 \ - -L --secondary=no --eqx -ax asm5\ - -R '@RG\\tID:{sample}_hifiasm\\tSM:{sample}' \ - GRCh38_no_alts.fasta \ - assembly.hap1.fasta [assembly.hap2.fasta] | \ - samtools sort -@ 3 \ - -T {temp_dir} \ - -m 100_000 \ - -O BAM > {output} -samtools index {output} -``` - -Options: - * `hap2` is optional -- the fasta argument to `hap1` may contain one or two MHC haplotig sets. If `hap2` is not set, alleles will be phased by contig name. - * Define the loci to be extracted and called (see [IMGT genes](https://hla.alleles.org/genes/index.html)) [default all loci] - * Limit equivalent matches in report -- This is useful to limit reporting of common alleles with shared cDNA. - * Require assembly alleles to be of a minimum length -``` -Extract HLA loci from assembled MHC contigs & call star alleles on extracted sequences - -Usage: hifihla call-contigs [OPTIONS] --abam --hap1 --outdir - -Options: - -a, --abam Input assembly aligned to GRCh38 - -p, --hap1 Input hap1 assembly fa(.gz) - -m, --hap2 Input hap2 assembly fa(.gz) (optional) - -o, --outdir Output directory - -l, --loci [...] Input comma-sep loci to extract [default: all] - -s, --min_length Minimum length of extracted targets [default: 1000] - -e, --exon2 Require Exon2 in query sequence - -x, --max_matches Maximum equivalent matches per query in report [default: 10] - -j, --threads Analysis threads [default: 1] - -v, --verbose... Enable verbose output - --log-level Alternative to repeated -v/--verbose: set log level via key. - Equivalence to -v/--verbose: - => "Warn" - -v => "Info" - -vv => "Debug" - -vvv => "Trace" [default: Warn] - -h, --help Print help - -V, --version Print version -``` - -### Align queries to specific alleles in IPD-IMGT/HLA -`align-imgt` is meant to check differences between alleles in the database and any query sequence (including other database accessions). The output is json format. -``` -Align queries to IMGT/HLA genomic accession sequences - -Usage: hifihla align-imgt [OPTIONS] - -Options: - -f, --fasta Fasta with query sequence(s) - -q, --qids [...] Comma-sep query IDs - -t, --targets [...] Comma-sep target IDs (map refs) - -n, --tnames [...] Comma-sep target Names (map refs) - -e, --exact Exact target name matches only (default starts-with) - -j, --threads Analysis threads [default: 1] - -v, --verbose... Enable verbose output - --log-level Alternative to repeated -v/--verbose: set log level via key. - Equivalence to -v/--verbose: - => "Warn" - -v => "Info" - -vv => "Debug" - -vvv => "Trace" [default: Warn] - -h, --help Print help - -V, --version Print version -``` - -## Output -`call-consensus` and `call-contigs` both generate three reports containing HLA star-allele type calls. Additionally, `call-contigs` produces fasta files of extracted sequences from the assembly. - -| File | Description | -| -------------------------------------------- | ----------- | -| {output_dir}/hifihla_summary.tsv | Detailed file listing best call for each locus | -| {output_dir}/hifihla_report.tsv | Simple tsv file listing calls for each locus | -| {output_dir}/hifihla_report.json | Detailed results file, see below for example | -| {output_dir}/asm.contigs.h[12].fasta | Extracted (full) assembly contigs aligning to MHC | -| {output_dir}/asm.contigs.h[12].fasta.fai | FASTA index for contigs | -| {output_dir}/extracted.targets.h[12].fasta | Extracted targets used for star-typing | - -### Detailed summary tsv -This file reports the single best call and statistics for each query sequence in the sample. - -| Column Name | Description | -|--------------|------------------------------------------------------| -| queryId | Query sequence name | -| qLen | Length of query sequence | -| nMatches | Number of equivalent matches with same edit distance | -| gType | Closest matching genomic accession | -| gPctId | Percent matching bases of closest genomic match | -| gPctCov | Percent coverage of closest genomic match | -| gEdit | Edit distance to closest genomic match | -| cdnaType | Closest matching coding sequence | -| exCovered | Exons included in alignment | -| exEdit | Edit distance to cDNA match | -| Type | Final type call | - -Final type call is determined by edit distance and exact matches to the query (nMatches). The final call is four field if the query sequences matches exactly and is the only exact match returned. - -``` -queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit Type -HLA-A_01-01-01-01 3140 3 HLA-A*01:01:01:01 100.0 89.60 0 HLA-A*01:01:01 1,2,3,4,5,6,7,8 0 HLA-A*01:01:01 -HLA-A_26-01-01 3154 1 HLA-A*26:01:01:01 100.0 89.65 0 HLA-A*26:01:01 1,2,3,4,5,6,7,8 0 HLA-A*26:01:01:01 -HLA-B_35-08-01 3401 1 HLA-B*35:08:01:01 99.97 87.10 1 HLA-B*35:08:01 1,2,3,4,5,6,7 0 HLA-B*35:08:01 -HLA-B_38-01-01 3386 2 HLA-B*38:01:01:01 100.0 82.13 0 HLA-B*38:01:01 1,2,3,4,5,6,7 0 HLA-B*38:01:01 -HLA-C_04-01-01 3427 1 HLA-C*04:01:01:06 100.0 80.30 0 HLA-C*04:01:01 1,2,3,4,5,6,7,8 0 HLA-C*04:01:01:06 -HLA-C_12-03-01-01 3427 1 HLA-C*12:03:01:01 100.0 79.57 0 HLA-C*12:03:01 1,2,3,4,5,6,7,8 0 HLA-C*12:03:01:01 -HLA-DPB1_04-01-01 5850 7 HLA-DPB1*04:01:01:03 100.0 50.80 0 HLA-DPB1*04:01:01 2,3,4,5 0 HLA-DPB1*04:01:01 -HLA-DPB1_04-01-01-01 5866 31 HLA-DPB1*04:01:01:01 100.0 50.87 0 HLA-DPB1*04:01:01 2,3,4,5 0 HLA-DPB1*04:01:01 -HLA-DQB1_05-01-01 3745 1 HLA-DQB1*05:01:01:05 99.97 52.79 1 HLA-DQB1*05:01:01 2,3,4 0 HLA-DQB1*05:01:01 -HLA-DQB1_03-02-01 3770 11 HLA-DQB1*03:02:01:01 99.97 52.87 1 HLA-DQB1*03:02:01 2,3,4 0 HLA-DQB1*03:02:01 -HLA-DRB1_04-02-01 4819 1 HLA-DRB1*04:02:01 99.91 33.16 4 HLA-DRB1*04:02:01 2,3,4 0 HLA-DRB1*04:02:01 -HLA-DRB1_10-01-01 3659 2 HLA-DRB1*10:01:01:03 100.0 26.50 0 HLA-DRB1*10:01:01 2,3,4 0 HLA-DRB1*10:01:01 -``` - - -### Simple tsv format -This file is formatted for a quick view of top-level results. It is appropriate as input for downstream interpretation tools like [pharmCAT](https://pharmcat.org/). - -``` -$ column -t hifihla_report.tsv -HLA-A *01:01:01:01/*26:01:01:01 -HLA-B *38:01:01:01/*35:08:01:01 -HLA-C *12:03:01:01/*04:01:01:06 -``` - -### Detailed results json -Detailed results contain information on calls for each sequence typed. The json is structured in the following way: - -``` - { sample_id: , - hla_calls: { - : { - : [ - { - allele_id: , - star_name: , - length: , - match_name: , - query_start: , - query_end: , - covered_feat: , - not_covered: , - coding_diffs: , - noncode_eddist: , - differences: [ - { - kind: , - pos: , - size: , - feat: - }, - ... - ] - } - ], - ... - } - ... - } - } -``` - -Extracting detailed results for a given locus can be done with the [jq](https://jqlang.github.io/jq/) tool: -``` -$ jq '.hla_calls | with_entries(select(.key == "HLA-DRB1"))' hifihla_report.json -{ - "HLA-DRB1": { - "h1tg000036l_26660055_26675279": [ - { - "allele_id": "HLA00687", - "star_name": "HLA-DRB1*04:02:01", - "length": 14499, - "match_name": "*04:02:01", - "query_start": 0, - "query_end": 14499, - "covered_feat": [ - "UTR_5", - "Exon_1", - "Intron_1", - "Exon_2", - "Intron_2", - "Exon_3", - "Intron_3", - "Exon_4", - "Intron_4", - "Exon_5", - "Intron_5", - "Exon_6", - "UTR_3" - ], - "not_covered": [], - "coding_diffs": 0, - "noncode_eddist": 4, - "differences": [ - { - "kind": "Insertion", - "pos": 5394, - "size": 1, - "feat": "Intron_1" - }, - { - "kind": "Insertion", - "pos": 6607, - "size": 1, - "feat": "Intron_1" - }, - { - "kind": "Insertion", - "pos": 8819, - "size": 2, - "feat": "Intron_2" - } - ] - } - ], - "h2tg000067l_26524014_26537902": [ - { - "allele_id": "HLA17236", - "star_name": "HLA-DRB1*10:01:01:03", - "length": 13800, - "match_name": "*10:01:01:03", - "query_start": 0, - "query_end": 13800, - "covered_feat": [ - "UTR_5", - "Exon_1", - "Intron_1", - "Exon_2", - "Intron_2", - "Exon_3", - "Intron_3", - "Exon_4", - "Intron_4", - "Exon_5", - "Intron_5", - "Exon_6", - "UTR_3" - ], - "not_covered": [], - "coding_diffs": 0, - "noncode_eddist": 0, - "differences": [] - } - ] - } -} -``` - -## Examples -Type [HLA consensus sequences](https://downloads.pacbcloud.com/public/dataset/pbAmpliconAnalysis_HLA/pbaa/12878-HG001/pbaa_12878-HG001_passed_cluster_sequences.fasta) and print info to stdout: -``` -$ hifihla call-consensus \ - --fasta pbaa_12878-HG001_passed_cluster_sequences.fasta \ - --outdir my_output_dir/ \ - -v -``` -Type PacBio [HiFi WGS dataset](https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=working/HPRC_PLUS/HG002/) and print details including alignment information to stdout: -``` -$ hifihla call-contigs \ - --abam HG002.asm.GRCh38.bam \ - --hap1 HG002.asm.bp.hap1.p_ctg.fasta.gz \ - --hap2 HG002.asm.bp.hap2.p_ctg.fasta.gz \ - --outdir my_output_dir \ - -vv -``` -Compare query to specific alleles by name or accession number -``` -$ hifihla align-imgt \ - --fasta HG001_HLA_A_11_01_01_01.fasta \ - -n HLA-A*11:01:01:03 \ - -t HLA00043 -{ - "id": "Query HLA-A_11-01-01-01; Targets HLA00043,HLA15502", - "hla_alignments": { - "HLA-A_11-01-01-01": { - "HLA00043": { - "allele_id": "HLA00043", - "star_name": "HLA-A*11:01:01:01", - "length": 3503, - "match_name": "*11:01:01:01", - "query_start": 162, - "query_end": 3301, - "covered_feat": [ - "UTR_5", - "Exon_1", - "Intron_1", - "Exon_2", - "Intron_2", - "Exon_3", - "Intron_3", - "Exon_4", - "Intron_4", - "Exon_5", - "Intron_5", - "Exon_6", - "Intron_6", - "Exon_7", - "Intron_7", - "Exon_8", - "UTR_3" - ], - "not_covered": [], - "coding_diffs": 0, - "noncode_eddist": 0, - "differences": [] - }, - "HLA15502": { - "allele_id": "HLA15502", - "star_name": "HLA-A*11:01:01:03", - "length": 3503, - "match_name": "*11:01:01", - "query_start": 162, - "query_end": 3301, - "covered_feat": [ - "UTR_5", - "Exon_1", - "Intron_1", - "Exon_2", - "Intron_2", - "Exon_3", - "Intron_3", - "Exon_4", - "Intron_4", - "Exon_5", - "Intron_5", - "Exon_6", - "Intron_6", - "Exon_7", - "Intron_7", - "Exon_8", - "UTR_3" - ], - "not_covered": [], - "coding_diffs": 0, - "noncode_eddist": 1, - "differences": [ - { - "kind": "Mismatch", - "pos": 849, - "size": 1, - "feat": "Intron_2" - } - ] - } - } - } -} -``` - -## Changelog -Changelog - PacBio HiFi HLA Typing - hifihla - -## v0.1.0: 5/31/23 -### Changes -- Report results in tsv and json format - -## v0.2.0: 9/25/23 -### Changes -- Update README for initial release -- Latest database version v3.53 -- Bug fixes - -## v0.2.1: 11/15/23 -### Changes -- Call-contigs hap2 change to optional -- Improve output determinism -- Update README -- Fix validation bugs - -## v0.2.2: 11/17/23 -### Changes -- Update database to IPD-IMGT/HLA Version: 3.54 (2023-10) - -## v0.2.3: 12/22/23 -### Changes -- Fix reporting bug for cdna calls -- Add option to require exon 2 to make a call -- Improved thread control and error reporting +## Documentation +1. [Installation](docs/install.md) +2. [Genes](docs/genes.md) +3. [Usage and Examples](docs/usage.md) +4. [Output](docs/output.md) +6. [Changelog](CHANGELOG.md) ## References Barker DJ, Maccari G, Georgiou X, Cooper MA, Flicek P, Robinson J, Marsh SGE. _The IPD-IMGT/HLA Database_. Nucleic Acids Research (2023) 51:D1053-60. diff --git a/docs/genes.md b/docs/genes.md new file mode 100755 index 0000000..f4d2ece --- /dev/null +++ b/docs/genes.md @@ -0,0 +1,16 @@ +## Genes + +Genes called by `hifihla`: +``` +HLA-A HLA-DQA1 HLA-DRB7 HLA-P +HLA-B HLA-DQA2 HLA-DRB8 HLA-S +HLA-C HLA-DQB1 HLA-E HLA-T +HLA-DMA HLA-DQB2 HLA-F HLA-U +HLA-DMB HLA-DRA HLA-G HLA-V +HLA-DOA HLA-DRB1 HLA-H HLA-W +HLA-DOB HLA-DRB2 HLA-HFE HLA-Y +HLA-DPA1 HLA-DRB3 HLA-J MICA +HLA-DPA2 HLA-DRB4 HLA-K MICB +HLA-DPB1 HLA-DRB5 HLA-L TAP1 +HLA-DPB2 HLA-DRB6 HLA-N +``` diff --git a/docs/install.md b/docs/install.md new file mode 100755 index 0000000..07ae595 --- /dev/null +++ b/docs/install.md @@ -0,0 +1,11 @@ +## Install + +[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/hifihla/README.html) + +Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda) for information on Installation, Support, License, Copyright, and Disclaimer. + +`hifihla` can be installed from bioconda +``` +conda install -c bioconda hifihla +``` +Binaries are also availible in the github [releases](https://github.com/PacificBiosciences/HiFiHLA/releases/latest). diff --git a/docs/output.md b/docs/output.md new file mode 100755 index 0000000..796aebc --- /dev/null +++ b/docs/output.md @@ -0,0 +1,198 @@ +## Output +`call-reads`, `call-consensus` and `call-contigs` all generate three reports containing HLA star-allele type calls. Additionally, `call-contigs` produces fasta files of extracted sequences from the assembly. + +| File | Description | +| -------------------------------------------- | ----------- | +| {output_dir}/hifihla_summary.tsv | Detailed file listing best call for each locus | +| {output_dir}/hifihla_report.tsv | Simple tsv file listing calls for each locus | +| {output_dir}/hifihla_report.json | Detailed results file, see below for example | +| {output_dir}/asm.contigs.h[12].fasta | Extracted (full) assembly contigs aligning to MHC | +| {output_dir}/asm.contigs.h[12].fasta.fai | FASTA index for contigs | +| {output_dir}/extracted.targets.h[12].fasta | Extracted targets used for star-typing | + +### Detailed summary tsv +This file reports the single best call and statistics for each query sequence in the sample. + +| Column Name | Description | +|--------------|------------------------------------------------------| +| queryId | Query sequence name | +| qLen | Length of query sequence | +| nMatches | Number of equivalent matches with same edit distance | +| gType | Closest matching genomic accession | +| gPctId | Percent matching bases of closest genomic match | +| gPctCov | Percent coverage of closest genomic match | +| gEdit | Edit distance to closest genomic match | +| cdnaType | Closest matching coding sequence | +| exCovered | Exons included in alignment | +| exEdit | Edit distance to cDNA match | +| coverage | HiFi read count of allele (call-reads only) | +| errRate | Total error rate of HiFi reads (call-reads only) | +| Type | Final type call | + +Final type call is determined by edit distance and exact matches to the query (nMatches). The final call is four field if the query sequences matches exactly and is the only exact match returned. + +`call-reads` summary for HG002 (Class I): +``` +queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type +HLA-A_1 3503 1 HLA-A*01:01:01:01 100.0 100.0 0 HLA-A*01:01:01 1,2,3,4,5,6,7,8 0 15 0.00276 HLA-A*01:01:01:01 +HLA-A_0 3517 1 HLA-A*26:01:01:01 100.0 100.0 0 HLA-A*26:01:01 1,2,3,4,5,6,7,8 0 13 0.00249 HLA-A*26:01:01:01 +HLA-B_0 3855 1 HLA-B*35:08:01:01 100.0 100.0 0 HLA-B*35:08:01 1,2,3,4,5,6,7 0 18 0.00231 HLA-B*35:08:01:01 +HLA-B_1 4070 1 HLA-B*38:01:01:01 100.0 100.0 0 HLA-B*38:01:01 1,2,3,4,5,6,7 0 10 0.00204 HLA-B*38:01:01:01 +HLA-C_0 4264 1 HLA-C*04:01:01:06 100.0 100.0 0 HLA-C*04:01:01 1,2,3,4,5,6,7,8 0 12 0.00135 HLA-C*04:01:01:06 +HLA-C_1 4303 1 HLA-C*12:03:01:01 100.0 100.0 0 HLA-C*12:03:01 1,2,3,4,5,6,7,8 0 8 0.00215 HLA-C*12:03:01:01 +``` +`call-contigs` summary for HG00733: +``` +queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type +HG00733#1#h1tg000070l_29911131_29915604 4474 1 HLA-A*24:02:01:01 100.0 100.0 0 HLA-A*24:02:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-A*24:02:01:01 +HG00733#2#h2tg000008l_1854867_1859341 4475 1 HLA-A*30:02:01:01 100.0 100.0 0 HLA-A*30:02:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-A*30:02:01:01 +HG00733#1#h1tg000070l_31324234_31329311 5078 1 HLA-B*35:02:01:02 100.0 100.0 0 HLA-B*35:02:01 1,2,3,4,5,6,7 0 1 N/A HLA-B*35:02:01:02 +HG00733#2#h2tg000008l_440231_445304 5074 1 HLA-B*18:01:01:01 100.0 100.0 0 HLA-B*18:01:01 1,2,3,4,5,6,7 0 1 N/A HLA-B*18:01:01:01 +HG00733#1#h1tg000070l_31239869_31245142 5274 1 HLA-C*04:01:01:06 100.0 100.0 0 HLA-C*04:01:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-C*04:01:01:06 +HG00733#2#h2tg000008l_525397_530669 5273 1 HLA-C*05:01:01:01 100.0 100.0 0 HLA-C*05:01:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-C*05:01:01:01 +HG00733#1#h1tg000070l_33003286_33014048 10763 1 HLA-DPA1*01:03:01:02 100.0 100.0 0 HLA-DPA1*01:03:01 1,2,3,4 0 1 N/A HLA-DPA1*01:03:01:02 +HG00733#2#h2tg000060l_26807522_26818289 10768 1 HLA-DPA1*02:01:01:01 100.0 100.0 0 HLA-DPA1*02:01:01 1,2,3,4 0 1 N/A HLA-DPA1*02:01:01:01 +HG00733#1#h1tg000070l_33014638_33027155 12518 1 HLA-DPB1*04:01:01:01 100.0 100.0 0 HLA-DPB1*04:01:01 1,2,3,4,5 0 1 N/A HLA-DPB1*04:01:01:01 +HG00733#2#h2tg000060l_26794480_26806932 12453 1 HLA-DPB1*11:01:01:01 100.0 100.0 0 HLA-DPB1*11:01:01 1,2,3,4,5 0 1 N/A HLA-DPB1*11:01:01:01 +HG00733#1#h1tg000070l_32583472_32591044 7573 1 HLA-DQA1*05:05:01:01 100.0 100.0 0 HLA-DQA1*05:05:01 1,2,3,4 0 1 N/A HLA-DQA1*05:05:01:01 +HG00733#2#h2tg000060l_27233443_27240904 7462 1 HLA-DQA1*05:01:01:01 100.0 100.0 0 HLA-DQA1*05:01:01 1,2,3,4 0 1 N/A HLA-DQA1*05:01:01:01 +HG00733#1#h1tg000070l_32600790_32608991 8202 1 HLA-DQB1*03:01:01:02 100.0 100.0 0 HLA-DQB1*03:01:01 1,2,3,4,5,6 0 1 N/A HLA-DQB1*03:01:01:02 +HG00733#2#h2tg000060l_27214264_27222723 8460 1 HLA-DQB1*02:01:01:01 100.0 100.0 0 HLA-DQB1*02:01:01 1,2,3,4,5,6 0 1 N/A HLA-DQB1*02:01:01:01 +HG00733#1#h1tg000070l_32526722_32541633 14912 1 HLA-DRB1*11:04:01:01 100.0 100.0 0 HLA-DRB1*11:04:01 1,2,3,4,5,6 0 1 N/A HLA-DRB1*11:04:01:01 +HG00733#2#h2tg000060l_27282742_27297652 14911 1 HLA-DRB1*03:01:01:02 100.0 100.0 0 HLA-DRB1*03:01:01 1,2,3,4,5,6 0 1 N/A HLA-DRB1*03:01:01:02 +HG00733#1#h1tg000070l_32462971_32477576 14606 1 HLA-DRB3*02:02:01:04 99.99 100.0 1 HLA-DRB3*02:02:01 1,2,3,4,5,6 0 1 N/A HLA-DRB3*02:02:01 +HG00733#2#h2tg000060l_27346699_27361303 14605 1 HLA-DRB3*02:02:01:01 100.0 100.0 0 HLA-DRB3*02:02:01 1,2,3,4,5,6 0 1 N/A HLA-DRB3*02:02:01:01 +``` + +### Simple tsv format +This file is formatted for a quick view of top-level results. It is appropriate as input for downstream interpretation tools like [pharmCAT](https://pharmcat.org/). + +``` +$ column -t hifihla_report.tsv +HLA-A *01:01:01:01/*26:01:01:01 +HLA-B *38:01:01:01/*35:08:01:01 +HLA-C *12:03:01:01/*04:01:01:06 +``` + +### Detailed results json +Detailed results contain information on calls for each sequence typed. The json is structured in the following way: + +``` + { sample_id: , + hla_calls: { + : { + : [ + { + allele_id: , + star_name: , + length: , + match_name: , + query_start: , + query_end: , + covered_feat: , + not_covered: , + coding_diffs: , + noncode_eddist: , + error_rate: , + coverage: , + reads: , + differences: [ + { + kind: , + pos: , + size: , + feat: + }, + ... + ] + } + ], + ... + } + ... + } + } +``` + +Extracting detailed results for a given locus can be done with the [jq](https://jqlang.github.io/jq/) tool: +``` +jq '.hla_calls | with_entries(select(.key == "HLA-DRB3"))' tmp2/hifihla_report.json + +{ + "HLA-DRB3": [ + { + "HG00733#1#h1tg000070l_32462971_32477576": [ + { + "allele_id": "HLA22697", + "star_name": "HLA-DRB3*02:02:01:04", + "length": 12905, + "match_name": "*02:02:01", + "query_start": 0, + "query_end": 12905, + "covered_feat": [ + "UTR_5", + "Exon_1", + "Intron_1", + "Exon_2", + "Intron_2", + "Exon_3", + "Intron_3", + "Exon_4", + "Intron_4", + "Exon_5", + "Intron_5", + "Exon_6", + "UTR_3" + ], + "not_covered": [], + "coding_diffs": 0, + "noncode_eddist": 1, + "error_rate": null, + "coverage": 1, + "reads": null, + "differences": [ + { + "kind": "Mismatch", + "pos": 8158, + "size": 1, + "feat": "Intron_2" + } + ] + } + ], + "HG00733#2#h2tg000060l_27346699_27361303": [ + { + "allele_id": "HLA00895", + "star_name": "HLA-DRB3*02:02:01:01", + "length": 13588, + "match_name": "*02:02:01:01", + "query_start": 0, + "query_end": 13588, + "covered_feat": [ + "UTR_5", + "Exon_1", + "Intron_1", + "Exon_2", + "Intron_2", + "Exon_3", + "Intron_3", + "Exon_4", + "Intron_4", + "Exon_5", + "Intron_5", + "Exon_6", + "UTR_3" + ], + "not_covered": [], + "coding_diffs": 0, + "noncode_eddist": 0, + "error_rate": null, + "coverage": 1, + "reads": null, + "differences": [] + } + ] + } + ] +} +``` diff --git a/docs/usage.md b/docs/usage.md new file mode 100755 index 0000000..84f8732 --- /dev/null +++ b/docs/usage.md @@ -0,0 +1,30 @@ +## Usage + +`hifihla` has four subcommands: +* [call-reads](usage_call-reads.md) +* [call-consensus](usage_call-consensus.md) +* [call-contigs](usage_call-contigs.md) +* [align-imgt](usage_align-imgt.md) + +``` +Usage: hifihla + +Commands: + call-reads Call HLA loci from an aligned BAM of HiFi reads + call-contigs Extract HLA loci from assembled MHC contigs & call star alleles on extracted sequences + call-consensus Call HLA Star (*) alleles from consensus sequences + align-imgt Align queries to IMGT/HLA genomic accession sequences + help Print this message or the help of the given subcommand(s) + +Options: + -h, --help Print help + -V, --version Print version +``` + +## Subcommand Inputs +| Subcommand | Input Type | File types |Description | +|----------------|-------------------------------------|-----------------|------------| +| call-reads | Aligned HiFi reads | BAM | Call Class I (ABC) from HiFi reads aligned to GRCh38 (no alts) | +| call-contigs | Aligned assembly; unaligned contigs | BAM, FASTA(.gz) | Extract and Call HLA loci from assembled MHC contigs | +| call-consensus | Amplicon/Isoseq consensus | FASTA | Call HLA alleles from consensus sequences (e.g. amplicon assays) | +| align-imgt | Sequence/IMGT accessions | FASTA | Compare sequences in fasta format or database sequences to specific IMGT/HLA genomic alleles | diff --git a/docs/usage_align-imgt.md b/docs/usage_align-imgt.md new file mode 100755 index 0000000..9cfff99 --- /dev/null +++ b/docs/usage_align-imgt.md @@ -0,0 +1,168 @@ +### Align queries to specific alleles in IPD-IMGT/HLA +`align-imgt` is meant to check differences between alleles in the database and any query sequence (including other database accessions). The output is json format. +``` +Align queries to IMGT/HLA genomic accession sequences + +Usage: hifihla align-imgt [OPTIONS] + +Options: + -f, --fasta Fasta with query sequence(s) + -q, --qids [...] Comma-sep query IDs + -t, --targets [...] Comma-sep target IDs (map refs) + -n, --tnames [...] Comma-sep target Names (map refs) + -e, --exact Exact target name matches only (default starts-with) + -j, --threads Analysis threads [default: 1] + -v, --verbose... Enable verbose output + --log-level Alternative to repeated -v/--verbose: set log level via key. + Equivalence to -v/--verbose: + => "Warn" + -v => "Info" + -vv => "Debug" + -vvv => "Trace" [default: Warn] + -h, --help Print help + -V, --version Print version +``` + +### Examples +Compare query to specific alleles by name or accession number +``` +hifihla align-imgt \ + --fasta HG001_HLA_A_11_01_01_01.fasta \ + -n HLA-A*11:01:01:03 \ + -t HLA00043 +{ + "id": "Query HLA-A_11-01-01-01; Targets HLA00043,HLA15502", + "hla_alignments": { + "HLA-A_11-01-01-01": { + "HLA00043": { + "allele_id": "HLA00043", + "star_name": "HLA-A*11:01:01:01", + "length": 3503, + "match_name": "*11:01:01:01", + "query_start": 162, + "query_end": 3301, + "covered_feat": [ + "UTR_5", + "Exon_1", + "Intron_1", + "Exon_2", + "Intron_2", + "Exon_3", + "Intron_3", + "Exon_4", + "Intron_4", + "Exon_5", + "Intron_5", + "Exon_6", + "Intron_6", + "Exon_7", + "Intron_7", + "Exon_8", + "UTR_3" + ], + "not_covered": [], + "coding_diffs": 0, + "noncode_eddist": 0, + "error_rate": null, + "coverage": 1, + "reads": null, + "differences": [] + }, + "HLA15502": { + "allele_id": "HLA15502", + "star_name": "HLA-A*11:01:01:03", + "length": 3503, + "match_name": "*11:01:01", + "query_start": 162, + "query_end": 3301, + "covered_feat": [ + "UTR_5", + "Exon_1", + "Intron_1", + "Exon_2", + "Intron_2", + "Exon_3", + "Intron_3", + "Exon_4", + "Intron_4", + "Exon_5", + "Intron_5", + "Exon_6", + "Intron_6", + "Exon_7", + "Intron_7", + "Exon_8", + "UTR_3" + ], + "not_covered": [], + "coding_diffs": 0, + "noncode_eddist": 1, + "error_rate": null, + "coverage": 1, + "reads": null, + "differences": [ + { + "kind": "Mismatch", + "pos": 849, + "size": 1, + "feat": "Intron_2" + } + ] + } + } + } +} +``` +Compare two IMGT alleles by accession ID: +``` +hifihla align-imgt -q HLA15502 -t HLA00043 + +{ + "id": "Query HLA15502; Targets HLA00043", + "hla_alignments": { + "HLA15502": { + "HLA00043": { + "allele_id": "HLA00043", + "star_name": "HLA-A*11:01:01:01", + "length": 3503, + "match_name": "*11:01:01", + "query_start": 0, + "query_end": 3503, + "covered_feat": [ + "UTR_5", + "Exon_1", + "Intron_1", + "Exon_2", + "Intron_2", + "Exon_3", + "Intron_3", + "Exon_4", + "Intron_4", + "Exon_5", + "Intron_5", + "Exon_6", + "Intron_6", + "Exon_7", + "Intron_7", + "Exon_8", + "UTR_3" + ], + "not_covered": [], + "coding_diffs": 0, + "noncode_eddist": 1, + "error_rate": null, + "coverage": 1, + "reads": null, + "differences": [ + { + "kind": "Mismatch", + "pos": 849, + "size": 1, + "feat": "Intron_2" + } + ] + } + } + } +} +``` diff --git a/docs/usage_call-consensus.md b/docs/usage_call-consensus.md new file mode 100755 index 0000000..43fb3dc --- /dev/null +++ b/docs/usage_call-consensus.md @@ -0,0 +1,169 @@ +### Type consensus sequences +`call-consensus` accepts a FASTA file of sequences and attempts to find the closest matching allele in the [IMGT/HLA database](https://www.ebi.ac.uk/ipd/imgt/hla/). It is assumed that each FASTA record is a single locus. +Optionally call star alleles using exon sequence only. +``` +Call HLA Star (*) alleles from consensus sequences + +Usage: hifihla call-consensus [OPTIONS] --fasta --outdir + +Options: + -f, --fasta Input fasta file of consensus sequences + -o, --outdir Output directory + -c, --cdna Enable cDNA-only calling + -e, --exon2 Require Exon2 in query sequence + -j, --threads Analysis threads [default: 1] + -x, --max_matches Maximum equivalent matches per query in report [default: 10] + -v, --verbose... Enable verbose output + --log-level Alternative to repeated -v/--verbose: set log level via key. + Equivalence to -v/--verbose: + => "Warn" + -v => "Info" + -vv => "Debug" + -vvv => "Trace" [default: Warn] + -h, --help Print help + -V, --version Print version +``` +#### Options Description +* `--fasta` Fasta file of consensus query sequences. Only one allele per query sequence. +* `--outdir` Output directory. +* `--cdna` Call and report only coding regions (cdna). Can be used for either DNA or RNA sequences. +* `--exon2` Require exon 2 in query. This may reduce search space. +* `--max_matches` Only report up to this number of matches in the json report. + +### Example +Type HLA consensus sequences, for example from [HiFi amplicon consensus with pbaa](https://downloads.pacbcloud.com/public/dataset/pbAmpliconAnalysis_HLA/pbaa/12878-HG001): +``` +hifihla call-consensus \ + --fasta pbaa_12878-HG001_passed_cluster_sequences.fasta \ + --outdir my_output_dir/ + +column -t my_output_dir/hifihla_summary.tsv + +queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type +sample-12878-HG001_guide-HLA-A_cluster-0_ReadCount-1023 3098 5 HLA-A*01:01:01:01 100.0 88.43 0 HLA-A*01:01:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-A*01:01:01 +sample-12878-HG001_guide-HLA-A_cluster-1_ReadCount-999 3098 3 HLA-A*11:01:01:01 100.0 88.43 0 HLA-A*11:01:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-A*11:01:01 +sample-12878-HG001_guide-HLA-B_cluster-1_ReadCount-399 3354 7 HLA-B*08:01:01:01 100.0 81.69 0 HLA-B*08:01:01 1,2,3,4,5,6,7 0 1 N/A HLA-B*08:01:01 +sample-12878-HG001_guide-HLA-B_cluster-0_ReadCount-481 3363 1 HLA-B*56:01:01:04 100.0 86.53 0 HLA-B*56:01:01 1,2,3,4,5,6,7 0 1 N/A HLA-B*56:01:01:04 +sample-12878-HG001_guide-HLA-C_cluster-0_ReadCount-541 3384 6 HLA-C*01:02:01:01 100.0 78.62 0 HLA-C*01:02:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-C*01:02:01 +sample-12878-HG001_guide-HLA-C_cluster-1_ReadCount-484 3389 6 HLA-C*07:01:01:01 100.0 78.48 0 HLA-C*07:01:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-C*07:01:01 +sample-12878-HG001_guide-HLA-DPB1_cluster-1_ReadCount-348 5824 33 HLA-DPB1*04:01:01:01 100.0 50.52 0 HLA-DPB1*04:01:01 2,3,4,5 0 1 N/A HLA-DPB1*04:01:01 +sample-12878-HG001_guide-HLA-DPB1_cluster-0_ReadCount-475 5760 7 HLA-DPB1*14:01:01:01 100.0 50.23 0 HLA-DPB1*14:01:01 2,3,4,5 0 1 N/A HLA-DPB1*14:01:01 +sample-12878-HG001_guide-HLA-DQB1_cluster-1_ReadCount-2032 4076 12 HLA-DQB1*02:01:01:01 100.0 54.49 0 HLA-DQB1*02:01:01 2,3,4 0 1 N/A HLA-DQB1*02:01:01 +sample-12878-HG001_guide-HLA-DQB1_cluster-0_ReadCount-3740 3703 14 HLA-DQB1*05:01:01:02 100.0 52.22 0 HLA-DQB1*05:01:01 2,3,4 0 1 N/A HLA-DQB1*05:01:01 +sample-12878-HG001_guide-HLA-DRB1_cluster-1_ReadCount-1011 3900 15 HLA-DRB1*01:01:01:01 100.0 35.13 0 HLA-DRB1*01:01:01 2,3,4 0 1 N/A HLA-DRB1*01:01:01 +sample-12878-HG001_guide-HLA-DRB1_cluster-0_ReadCount-2022 3646 14 HLA-DRB1*03:01:01:01 100.0 26.21 0 HLA-DRB1*03:01:01 2,3,4 0 1 N/A HLA-DRB1*03:01:01 +``` +Note that query sequences that match >1 allele are not labeled as four-field matches because the actual call is ambiguous. This can happen when differences between alleles are outside the amplified region. +For example, HLA-A_cluster-1 above matches three alleles perfectly over the amplified range. All three matches are listed in the json. + +``` +jq '.. | objects | to_entries | .[] | select(.key == "sample-12878-HG001_guide-HLA-A_cluster-1_ReadCount-999")' hifihla_report.json + +{ + "key": "sample-12878-HG001_guide-HLA-A_cluster-1_ReadCount-999", + "value": [ + { + "allele_id": "HLA00043", + "star_name": "HLA-A*11:01:01:01", + "length": 3503, + "match_name": "*11:01:01", + "query_start": 182, + "query_end": 3280, + "covered_feat": [ + "UTR_5", + "Exon_1", + "Intron_1", + "Exon_2", + "Intron_2", + "Exon_3", + "Intron_3", + "Exon_4", + "Intron_4", + "Exon_5", + "Intron_5", + "Exon_6", + "Intron_6", + "Exon_7", + "Intron_7", + "Exon_8", + "UTR_3" + ], + "not_covered": [], + "coding_diffs": 0, + "noncode_eddist": 0, + "error_rate": null, + "coverage": 1, + "reads": null, + "differences": [] + }, + { + "allele_id": "HLA32930", + "star_name": "HLA-A*11:01:01:75", + "length": 3503, + "match_name": "*11:01:01", + "query_start": 182, + "query_end": 3280, + "covered_feat": [ + "UTR_5", + "Exon_1", + "Intron_1", + "Exon_2", + "Intron_2", + "Exon_3", + "Intron_3", + "Exon_4", + "Intron_4", + "Exon_5", + "Intron_5", + "Exon_6", + "Intron_6", + "Exon_7", + "Intron_7", + "Exon_8", + "UTR_3" + ], + "not_covered": [], + "coding_diffs": 0, + "noncode_eddist": 0, + "error_rate": null, + "coverage": 1, + "reads": null, + "differences": [] + }, + { + "allele_id": "HLA38435", + "star_name": "HLA-A*11:01:01:88", + "length": 3503, + "match_name": "*11:01:01", + "query_start": 182, + "query_end": 3280, + "covered_feat": [ + "UTR_5", + "Exon_1", + "Intron_1", + "Exon_2", + "Intron_2", + "Exon_3", + "Intron_3", + "Exon_4", + "Intron_4", + "Exon_5", + "Intron_5", + "Exon_6", + "Intron_6", + "Exon_7", + "Intron_7", + "Exon_8", + "UTR_3" + ], + "not_covered": [], + "coding_diffs": 0, + "noncode_eddist": 0, + "error_rate": null, + "coverage": 1, + "reads": null, + "differences": [] + } + ] +} +``` diff --git a/docs/usage_call-contigs.md b/docs/usage_call-contigs.md new file mode 100755 index 0000000..d0743a3 --- /dev/null +++ b/docs/usage_call-contigs.md @@ -0,0 +1,109 @@ +### Type HLA from MHC assembled contigs +`call-contigs` requires 2 or 3 input files and an output directory: + * Assembled contigs must be aligned to [GRCH38 no alts](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz) + * Assembled haplotype fasta[.gz] files (1 or 2) + +Options: + * `hap2` is optional -- the fasta argument to `hap1` may contain one or two MHC haplotig sets. If `hap2` is not set, alleles will be phased by contig name. + * Define the `loci` to be extracted and called (see [IMGT genes](https://hla.alleles.org/genes/index.html)) [default all loci] + * Limit equivalent matches in report with `--max_matches` -- This is useful to limit reporting of common alleles with shared cDNA. + * Require assembly alleles to be of a `--min_length`. + * Limit output matches to full genomic IMGT accessions with `--full_length`. +``` +Extract HLA loci from assembled MHC contigs & call star alleles on extracted sequences + +Usage: hifihla call-contigs [OPTIONS] --abam --hap1 --outdir + +Options: + -a, --abam Input assembly aligned to GRCh38 + -p, --hap1 Input hap1 assembly fa(.gz) + -m, --hap2 Input hap2 assembly fa(.gz) (optional) + -o, --outdir Output directory + -l, --loci [...] Input comma-sep loci to extract [default: all] + -s, --min_length Minimum length of extracted targets [default: 1000] + -f, --full_length Full length IMGT records only + -x, --max_matches Maximum equivalent matches per query in report [default: 10] + -j, --threads Analysis threads [default: 1] + -v, --verbose... Enable verbose output + --log-level Alternative to repeated -v/--verbose: set log level via key. + Equivalence to -v/--verbose: + => "Warn" + -v => "Info" + -vv => "Debug" + -vvv => "Trace" [default: Warn] + -h, --help Print help + -V, --version Print version +``` + +### Examples +Type HG00733 HPRC [assembly](https://s3-us-west-2.amazonaws.com/human-pangenomics/working/HPRC_PLUS/HG00733/assemblies/year1_freeze_assembly_v2/): + +#### Align contigs to reference +The following command using [minimap2](https://github.com/lh3/minimap2/) is recommended for assembly alignment: +``` +minimap2 -t 12 \ + -L --secondary=no --eqx -ax asm5\ + -R '@RG\tID:HG00733\tSM:HG00733' \ + GRCh38_no_alts.fasta \ + HG00733.paternal.f1_assembly_v2.fa.gz HG00733.maternal.f1_assembly_v2.fa.gz | \ + samtools sort -@ 3 \ + -T /scratch \ + -m 100M \ + -O BAM > HG00733.asm.GRCh38_no_alts.bam +samtools index HG00733.asm.GRCH38_no_alts.bam +``` +#### Call HLA alleles +Call all available loci from separated haplotigs: +``` +hifihla call-contigs \ + --abam HG00733.asm.GRCh38_no_alts.bam \ + --hap1 HG00733.paternal.f1_assembly_v2.fa.gz \ + --hap2 HG00733.maternal.f1_assembly_v2.fa.gz \ + --outdir my_output_dir + +head -7 my_output_dir/hifihla_summary.tsv | column -t + +queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type +HG00733#1#h1tg000070l_29911131_29915604 4474 1 HLA-A*24:02:01:01 100.0 100.0 0 HLA-A*24:02:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-A*24:02:01:01 +HG00733#2#h2tg000008l_1854867_1859341 4475 1 HLA-A*30:02:01:01 100.0 100.0 0 HLA-A*30:02:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-A*30:02:01:01 +HG00733#1#h1tg000070l_31324234_31329311 5078 1 HLA-B*35:02:01:02 100.0 100.0 0 HLA-B*35:02:01 1,2,3,4,5,6,7 0 1 N/A HLA-B*35:02:01:02 +HG00733#2#h2tg000008l_440231_445304 5074 1 HLA-B*18:01:01:01 100.0 100.0 0 HLA-B*18:01:01 1,2,3,4,5,6,7 0 1 N/A HLA-B*18:01:01:01 +HG00733#1#h1tg000070l_31239869_31245142 5274 1 HLA-C*04:01:01:06 100.0 100.0 0 HLA-C*04:01:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-C*04:01:01:06 +HG00733#2#h2tg000008l_525397_530669 5273 1 HLA-C*05:01:01:01 100.0 100.0 0 HLA-C*05:01:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-C*05:01:01:01 +``` +Call subset of loci from contigs of just one haplotype: +``` +hifihla call-contigs \ + --abam HG00733.asm.GRCh38_no_alts.bam \ + --hap1 HG00733.paternal.f1_assembly_v2.fa.gz \ + --loci HLA-DQA1,HLA-DPA1,HLA-DRB1 \ + --outdir my_output_dir + +column -t my_output_dir/hifihla_summary.tsv + +queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type +HG00733#1#h1tg000070l_33003286_33014048 10763 1 HLA-DPA1*01:03:01:02 100.0 100.0 0 HLA-DPA1*01:03:01 1,2,3,4 0 1 N/A HLA-DPA1*01:03:01:02 +HG00733#1#h1tg000070l_32583472_32591044 7573 1 HLA-DQA1*05:05:01:01 100.0 100.0 0 HLA-DQA1*05:05:01 1,2,3,4 0 1 N/A HLA-DQA1*05:05:01:01 +HG00733#1#h1tg000070l_32526722_32541633 14912 1 HLA-DRB1*11:04:01:01 100.0 100.0 0 HLA-DRB1*11:04:01 1,2,3,4,5,6 0 1 N/A HLA-DRB1*11:04:01:01 +HG00733#1#h1tg000070l_32462971_32477576 14606 1 HLA-DRB3*02:02:01:04 99.99 100.0 1 HLA-DRB3*02:02:01 1,2,3,4,5,6 0 1 N/A HLA-DRB3*02:02:01 +``` +Call class I from un-separated contigs: +``` +cat HG00733.paternal.f1_assembly_v2.fa.gz HG00733.maternal.f1_assembly_v2.fa.gz > HG00733.both.fasta.gz + +hifihla call-contigs \ + --abam HG00733.asm.GRCh38_no_alts.bam \ + --hap1 HG00733.both.fasta.gz \ + --loci HLA-A,HLA-B,HLA-C \ + --outdir my_output_dir + +column -t my_output_dir/hifihla_summary.tsv + +queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type +HG00733#1#h1tg000070l_29911131_29915604 4474 1 HLA-A*24:02:01:01 100.0 100.0 0 HLA-A*24:02:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-A*24:02:01:01 +HG00733#2#h2tg000008l_1854867_1859341 4475 1 HLA-A*30:02:01:01 100.0 100.0 0 HLA-A*30:02:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-A*30:02:01:01 +HG00733#1#h1tg000070l_31324234_31329311 5078 1 HLA-B*35:02:01:02 100.0 100.0 0 HLA-B*35:02:01 1,2,3,4,5,6,7 0 1 N/A HLA-B*35:02:01:02 +HG00733#2#h2tg000008l_440231_445304 5074 1 HLA-B*18:01:01:01 100.0 100.0 0 HLA-B*18:01:01 1,2,3,4,5,6,7 0 1 N/A HLA-B*18:01:01:01 +HG00733#1#h1tg000070l_31239869_31245142 5274 1 HLA-C*04:01:01:06 100.0 100.0 0 HLA-C*04:01:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-C*04:01:01:06 +HG00733#2#h2tg000008l_525397_530669 5273 1 HLA-C*05:01:01:01 100.0 100.0 0 HLA-C*05:01:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-C*05:01:01:01 +``` diff --git a/docs/usage_call-reads.md b/docs/usage_call-reads.md new file mode 100755 index 0000000..6bb5d2d --- /dev/null +++ b/docs/usage_call-reads.md @@ -0,0 +1,161 @@ +### Type from aligned HiFi reads +`call-reads` accepts aligned HiFi reads in BAM format and calls HLA alleles directly from reads, without assembly. +``` +Call HLA loci from an aligned BAM of HiFi reads + +Usage: hifihla call-reads [OPTIONS] --abam --outdir + +Options: + -j, --threads Analysis threads [default: 1] + -v, --verbose... Enable verbose output + --log-level Alternative to repeated -v/--verbose: set log level via key. [default: Warn] + -h, --help Print help + -V, --version Print version + +Input Options: + -a, --abam Input assembly aligned to GRCh38 + -l, --loci [...] Input comma-sep loci to extract [default: all] + -d, --max_depth Maximum reads per locus [default: 50] + -p, --partial Include partially-spanning reads + -t, --haplotypes Haplotypes in sample [default: 2] [possible values: 1, 2] + -s, --seed Random number seed for downsampling to max_depth [default: 42] + +Output Options: + -o, --outdir Output directory + -f, --full_length Full length IMGT records only (exclude exon-only records) + -x, --max_matches Maximum matches in output report [default: 10] + -m, --min_allele_freq Minimum allele frequency [default: 0.1] + -b, --min_cdf Minimum binomial CDF to call het/hom [default: 0.001] + +Presets: + --preset Sequence type presets [possible values: te, wgs] +``` +#### Input Options Description +* `--abam` HiFi reads aligned to GRCh38 (no alts). +* `--loci` HLA loci to call. Currently limited to HLA-A,HLA-B,HLA-C. +* `--max_depth` Maximim reads to use per locus. Reads are randomly downsampled if coverage > d. +* `--partial` Include HiFi reads that do not fully span locus, but still span exon 2 (minimum requirement). +* `--haplotypes` Expected number of haploytypes in sample. +* `--seed` Random number seed for downsampling and clustering. + +### Output Options Description +* `--outdir` Output directory. +* `--full_length` Restrict allele matches to full length IMGT records (exclude exon-only accessions). +* `--max_matches` Maximum number of equivalent matches to list per query sequence. +* `--min_allele_freq` Minimum fraction of reads for minor allele. Clusters with lower frequency will be ignored. +* `--min_cdf` Minimum binomial CDF for minor allele. Clusters with lower CDF will be ignored. + +Presets are convenience options for whole genome sequencing with long (>10kb) HiFi Reads`wgs` or shorter (~5kb) Target Enrichment HiFi Reads `te`. +The only preset option at this time sets the `-p` flag for wgs, and no `-p` for te. Target Enrichment datasets tend to have higher coverage, so we filter for fully spanning reads. + +### Examples +Type Class I (HLA-A/-B/-C) alleles from [HPRC](https://github.com/human-pangenomics/HPP_Year1_Data_Freeze_v1.0) HG00733 aligned WGS HiFi reads using 8 threads: +``` +hifihla call-reads \ + --preset wgs \ + -j 8 \ + -a HG00733.GRCh38_no_alts.bam \ + -o my_output_dir + +column -t my_output_dir/hifihla_summary.tsv + +queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type +HLA-A_1 3502 1 HLA-A*24:02:01:01 100.0 100.0 0 HLA-A*24:02:01 1,2,3,4,5,6,7,8 0 9 0.00346 HLA-A*24:02:01:01 +HLA-A_0 3503 1 HLA-A*30:02:01:01 100.0 100.0 0 HLA-A*30:02:01 1,2,3,4,5,6,7,8 0 25 0.00293 HLA-A*30:02:01:01 +HLA-B_1 4081 1 HLA-B*18:01:01:01 100.0 100.0 0 HLA-B*18:01:01 1,2,3,4,5,6,7 0 16 0.00281 HLA-B*18:01:01:01 +HLA-B_0 4085 1 HLA-B*35:02:01:02 100.0 100.0 0 HLA-B*35:02:01 1,2,3,4,5,6,7 0 18 0.00250 HLA-B*35:02:01:02 +HLA-C_0 4264 1 HLA-C*04:01:01:06 100.0 100.0 0 HLA-C*04:01:01 1,2,3,4,5,6,7,8 0 20 0.00280 HLA-C*04:01:01:06 +HLA-C_1 4303 1 HLA-C*05:01:01:01 100.0 100.0 0 HLA-C*05:01:01 1,2,3,4,5,6,7,8 0 12 0.00335 HLA-C*05:01:01:01 +``` +Type Class I alleles from targeted [Twist Alliance Panels](https://www.pacb.com/wp-content/uploads/Application-Brief-HiFi-Target-Enrichment-Best-Practices.pdf) sequenced with [PacBio HiFi](https://downloads.pacbcloud.com/public/dataset/HiFiTE_SqIIe/Oct_2022/TwistAllianceLongReadPGx/): +``` +hifihla call-reads \ + --preset te \ + -j 8 \ + -a HG01190.GRCh38_noalt.deepvariant.haplotagged..bam \ + -o my_output_dir + +column -t my_output_dir/hifihla_summary.tsv + +queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type +HLA-A_1 3517 1 HLA-A*02:01:01:01 100.0 100.0 0 HLA-A*02:01:01 1,2,3,4,5,6,7,8 0 19 0.00108 HLA-A*02:01:01:01 +HLA-A_0 3502 1 HLA-A*24:02:01:01 100.0 100.0 0 HLA-A*24:02:01 1,2,3,4,5,6,7,8 0 28 0.00213 HLA-A*24:02:01:01 +HLA-B_1 3327 1 HLA-B*15:20 100.0 100.0 0 HLA-B*15:20 1,2,3,4,5,6,7 0 18 0.00107 HLA-B*15:20 +HLA-B_0 4081 1 HLA-B*18:01:01:83 100.0 100.0 0 HLA-B*18:01:01 1,2,3,4,5,6,7 0 10 0.00059 HLA-B*18:01:01:83 +HLA-C_0 4304 1 HLA-C*01:02:01:01 100.0 100.0 0 HLA-C*01:02:01 1,2,3,4,5,6,7,8 0 24 0.00121 HLA-C*01:02:01:01 +HLA-C_1 4318 1 HLA-C*07:01:01:16 100.0 100.0 0 HLA-C*07:01:01 1,2,3,4,5,6,7,8 0 26 0.00205 HLA-C*07:01:01:16 +``` +Force call only a single, full-length HLA-A haplotype using at most 15 reads: +``` +hifihla call-reads \ + -d 15 \ + -t 1 \ + -f \ + -l HLA-A \ + -a NA12889.GRCH38.haplotagged.bam \ + -o my_output_dir + +cat hifihla_report.json +{ + "sample_id": "NA12889.GRCh38.haplotagged", + "version": "hifihla 0.3.0;IPD-IMGT/HLA 3.55 (2024-01)", + "hla_calls": { + "HLA-A": [ + { + "HLA-A_0": [ + { + "allele_id": "HLA00037", + "star_name": "HLA-A*03:01:01:01", + "length": 3502, + "match_name": "*03:01:01:01", + "query_start": 0, + "query_end": 3502, + "covered_feat": [ + "UTR_5", + "Exon_1", + "Intron_1", + "Exon_2", + "Intron_2", + "Exon_3", + "Intron_3", + "Exon_4", + "Intron_4", + "Exon_5", + "Intron_5", + "Exon_6", + "Intron_6", + "Exon_7", + "Intron_7", + "Exon_8", + "UTR_3" + ], + "not_covered": [], + "coding_diffs": 0, + "noncode_eddist": 0, + "error_rate": 0.0023986294, + "coverage": 15, + "reads": [ + "m54329U_230312_191124/80677293/ccs", + "m84046_230327_223715_s3/52499911/ccs", + "m54329U_230314_044350/92145583/ccs", + "m54329U_230311_094504/134349275/ccs", + "m84046_230327_223715_s3/41027584/ccs", + "m84046_230327_223715_s3/80090755/ccs", + "m84046_230327_223715_s3/132322013/ccs", + "m84046_230327_223715_s3/80871474/ccs", + "m84046_230327_223715_s3/41226855/ccs", + "m54329U_230312_191124/63046676/ccs", + "m54329U_230311_094504/31064736/ccs", + "m54329U_230312_191124/56625320/ccs", + "m84046_230327_223715_s3/38012954/ccs", + "m84046_230327_223715_s3/23856426/ccs", + "m54329U_230314_044350/129892658/ccs" + ], + "differences": [] + } + ] + } + ] + } +} + ```