diff --git a/README.md b/README.md index 6845a7a..691f1f9 100755 --- a/README.md +++ b/README.md @@ -3,7 +3,9 @@

HiFiHLA

*** -An HLA star-calling tool for PacBio HiFi data types. +**An HLA star-calling tool for PacBio HiFi data types** + +HiFiHLA generates high resolution (4-field) HLA allele calls from PacBio HiFi data. HiFiHLA identifies the closest matching allele(s) and any differences between a sample and the IPD-IMGT/HLA database. Acceptable data types include aligned HiFi reads, assembly contigs, and amplicon consensus. Authors: [John Harting](https://github.com/jrharting), [Zev Kronenberg](https://github.com/zeeev), [Daniel Baker](https://github.com/dnbaker), [Matt Holt](https://github.com/holtjma) @@ -17,7 +19,19 @@ Authors: [John Harting](https://github.com/jrharting), [Zev Kronenberg](https:// 2. [Genes](docs/genes.md) 3. [Usage and Examples](docs/usage.md) 4. [Output](docs/output.md) -6. [Changelog](CHANGELOG.md) +6. [Changelog](docs/changelog.md) + +## Need help? +If you notice any missing features, bugs, or need assistance with analyzing the output of HiFiHLA, +please don't hesitate to open a GitHub issue. + +## Support information +HiFiHLA is a pre-release software intended for research use only and not for use in diagnostic procedures. +While efforts have been made to ensure that HiFiHLA lives up to the quality that PacBio strives for, we make no warranty regarding this software. + +As HiFiHLA is not covered by any service level agreement or the like, please do not contact a PacBio Field Applications Scientists or PacBio Customer Service for assistance with any HiFiHLA release. +Please report all issues through GitHub instead. +We make no warranty that any such issue will be addressed, to any extent or within any time frame. ## References Barker DJ, Maccari G, Georgiou X, Cooper MA, Flicek P, Robinson J, Marsh SGE. _The IPD-IMGT/HLA Database_. Nucleic Acids Research (2023) 51:D1053-60. diff --git a/CHANGELOG.md b/docs/changelog.md similarity index 71% rename from CHANGELOG.md rename to docs/changelog.md index d2f8b9a..b077ee6 100755 --- a/CHANGELOG.md +++ b/docs/changelog.md @@ -1,3 +1,10 @@ +# v0.3.1: 04/05/24 +## Changes +- Add output prefix option (takes directory or directory + prefix name) +- Deprecate `outdir` (maintain backwards compatibility until v1.0) +- Fix bug in call-reads where a read with partial exon2 (only) coverage blows up candidate pool +- Catch error from aligned inputs with wrong reference + # v0.3.0: 03/21/24 ## Changes - New tool `call-reads` to call from HiFi reads (limited to class I) diff --git a/docs/output.md b/docs/output.md index 796aebc..b56aa35 100755 --- a/docs/output.md +++ b/docs/output.md @@ -1,14 +1,14 @@ ## Output -`call-reads`, `call-consensus` and `call-contigs` all generate three reports containing HLA star-allele type calls. Additionally, `call-contigs` produces fasta files of extracted sequences from the assembly. +`call-reads`, `call-consensus` and `call-contigs` all generate three reports containing HLA star-allele type calls. Additionally, `call-contigs` produces fasta files of extracted sequences from the assembly. If `out_prefix` is given as a directory _+ prefix/samplename_, output files will be joined to the prefix with underscore `_`. | File | Description | | -------------------------------------------- | ----------- | -| {output_dir}/hifihla_summary.tsv | Detailed file listing best call for each locus | -| {output_dir}/hifihla_report.tsv | Simple tsv file listing calls for each locus | -| {output_dir}/hifihla_report.json | Detailed results file, see below for example | -| {output_dir}/asm.contigs.h[12].fasta | Extracted (full) assembly contigs aligning to MHC | -| {output_dir}/asm.contigs.h[12].fasta.fai | FASTA index for contigs | -| {output_dir}/extracted.targets.h[12].fasta | Extracted targets used for star-typing | +| {out_prefix}\[_/\]hifihla_summary.tsv | Detailed file listing best call for each locus | +| {out_prefix}\[_/\]hifihla_report.tsv | Simple tsv file listing calls for each locus | +| {out_prefix}\[_/\]hifihla_report.json | Detailed results file, see below for example | +| {out_prefix}\[_/\]asm.contigs.h[12].fasta | Extracted (full) assembly contigs aligning to MHC | +| {out_prefix}\[_/\]asm.contigs.h[12].fasta.fai | FASTA index for contigs | +| {out_prefix}\[_/\]extracted.targets.h[12].fasta | Extracted targets used for star-typing | ### Detailed summary tsv This file reports the single best call and statistics for each query sequence in the sample. diff --git a/docs/usage.md b/docs/usage.md index 84f8732..e8a1cd5 100755 --- a/docs/usage.md +++ b/docs/usage.md @@ -24,7 +24,7 @@ Options: ## Subcommand Inputs | Subcommand | Input Type | File types |Description | |----------------|-------------------------------------|-----------------|------------| -| call-reads | Aligned HiFi reads | BAM | Call Class I (ABC) from HiFi reads aligned to GRCh38 (no alts) | +| call-reads | Aligned HiFi reads | BAM | Call Class I (ABC) from HiFi reads aligned to [GRCH38 no alts](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz) | | call-contigs | Aligned assembly; unaligned contigs | BAM, FASTA(.gz) | Extract and Call HLA loci from assembled MHC contigs | | call-consensus | Amplicon/Isoseq consensus | FASTA | Call HLA alleles from consensus sequences (e.g. amplicon assays) | | align-imgt | Sequence/IMGT accessions | FASTA | Compare sequences in fasta format or database sequences to specific IMGT/HLA genomic alleles | diff --git a/docs/usage_call-consensus.md b/docs/usage_call-consensus.md index 43fb3dc..85f04c7 100755 --- a/docs/usage_call-consensus.md +++ b/docs/usage_call-consensus.md @@ -4,28 +4,26 @@ Optionally call star alleles using exon sequence only. ``` Call HLA Star (*) alleles from consensus sequences -Usage: hifihla call-consensus [OPTIONS] --fasta --outdir +Usage: hifihla call-consensus [OPTIONS] --fasta Options: - -f, --fasta Input fasta file of consensus sequences - -o, --outdir Output directory - -c, --cdna Enable cDNA-only calling - -e, --exon2 Require Exon2 in query sequence - -j, --threads Analysis threads [default: 1] - -x, --max_matches Maximum equivalent matches per query in report [default: 10] - -v, --verbose... Enable verbose output - --log-level Alternative to repeated -v/--verbose: set log level via key. - Equivalence to -v/--verbose: - => "Warn" - -v => "Info" - -vv => "Debug" - -vvv => "Trace" [default: Warn] - -h, --help Print help - -V, --version Print version + -f, --fasta Input fasta file of consensus sequences + -o, --out_prefix Output prefix + --outdir Output directory [deprecated] + -c, --cdna Enable cDNA-only calling + -e, --exon2 Require Exon2 in query sequence + -l, --full_length Full length IMGT records only + -j, --threads Analysis threads [default: 1] + -x, --max_matches Maximum equivalent matches per query in report [default: 10] + -v, --verbose... Enable verbose output + --log-level Alternative to repeated -v/--verbose: set log level via key. + -h, --help Print help + -V, --version Print version ``` #### Options Description * `--fasta` Fasta file of consensus query sequences. Only one allele per query sequence. -* `--outdir` Output directory. +* `--out_prefix` Output prefix, accepts a directory or a directory + prefix. +* `--outdir` Output directory \[deprecated\]. * `--cdna` Call and report only coding regions (cdna). Can be used for either DNA or RNA sequences. * `--exon2` Require exon 2 in query. This may reduce search space. * `--max_matches` Only report up to this number of matches in the json report. @@ -35,9 +33,9 @@ Type HLA consensus sequences, for example from [HiFi amplicon consensus with pba ``` hifihla call-consensus \ --fasta pbaa_12878-HG001_passed_cluster_sequences.fasta \ - --outdir my_output_dir/ + --out_prefix out_dir/my_sample -column -t my_output_dir/hifihla_summary.tsv +column -t out_dir/my_sample_hifihla_summary.tsv queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type sample-12878-HG001_guide-HLA-A_cluster-0_ReadCount-1023 3098 5 HLA-A*01:01:01:01 100.0 88.43 0 HLA-A*01:01:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-A*01:01:01 @@ -57,7 +55,7 @@ Note that query sequences that match >1 allele are not labeled as four-field mat For example, HLA-A_cluster-1 above matches three alleles perfectly over the amplified range. All three matches are listed in the json. ``` -jq '.. | objects | to_entries | .[] | select(.key == "sample-12878-HG001_guide-HLA-A_cluster-1_ReadCount-999")' hifihla_report.json +jq '.. | objects | to_entries | .[] | select(.key == "sample-12878-HG001_guide-HLA-A_cluster-1_ReadCount-999")' my_sample_hifihla_report.json { "key": "sample-12878-HG001_guide-HLA-A_cluster-1_ReadCount-999", diff --git a/docs/usage_call-contigs.md b/docs/usage_call-contigs.md index d0743a3..e041bfa 100755 --- a/docs/usage_call-contigs.md +++ b/docs/usage_call-contigs.md @@ -11,14 +11,15 @@ Options: * Limit output matches to full genomic IMGT accessions with `--full_length`. ``` Extract HLA loci from assembled MHC contigs & call star alleles on extracted sequences - -Usage: hifihla call-contigs [OPTIONS] --abam --hap1 --outdir + +Usage: hifihla call-contigs [OPTIONS] --abam --hap1 Options: -a, --abam Input assembly aligned to GRCh38 -p, --hap1 Input hap1 assembly fa(.gz) -m, --hap2 Input hap2 assembly fa(.gz) (optional) - -o, --outdir Output directory + -o, --out_prefix Output prefix + --outdir Output directory [deprecated] -l, --loci [...] Input comma-sep loci to extract [default: all] -s, --min_length Minimum length of extracted targets [default: 1000] -f, --full_length Full length IMGT records only @@ -26,11 +27,6 @@ Options: -j, --threads Analysis threads [default: 1] -v, --verbose... Enable verbose output --log-level Alternative to repeated -v/--verbose: set log level via key. - Equivalence to -v/--verbose: - => "Warn" - -v => "Info" - -vv => "Debug" - -vvv => "Trace" [default: Warn] -h, --help Print help -V, --version Print version ``` @@ -59,9 +55,9 @@ hifihla call-contigs \ --abam HG00733.asm.GRCh38_no_alts.bam \ --hap1 HG00733.paternal.f1_assembly_v2.fa.gz \ --hap2 HG00733.maternal.f1_assembly_v2.fa.gz \ - --outdir my_output_dir + --out_prefix out_dir/my_sample -head -7 my_output_dir/hifihla_summary.tsv | column -t +head -7 out_dir/my_sample_hifihla_summary.tsv | column -t queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type HG00733#1#h1tg000070l_29911131_29915604 4474 1 HLA-A*24:02:01:01 100.0 100.0 0 HLA-A*24:02:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-A*24:02:01:01 @@ -77,9 +73,9 @@ hifihla call-contigs \ --abam HG00733.asm.GRCh38_no_alts.bam \ --hap1 HG00733.paternal.f1_assembly_v2.fa.gz \ --loci HLA-DQA1,HLA-DPA1,HLA-DRB1 \ - --outdir my_output_dir + -o out_dir/my_sample -column -t my_output_dir/hifihla_summary.tsv +column -t out_dir/my_sample_hifihla_summary.tsv queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type HG00733#1#h1tg000070l_33003286_33014048 10763 1 HLA-DPA1*01:03:01:02 100.0 100.0 0 HLA-DPA1*01:03:01 1,2,3,4 0 1 N/A HLA-DPA1*01:03:01:02 diff --git a/docs/usage_call-reads.md b/docs/usage_call-reads.md index 6bb5d2d..0ae2cdb 100755 --- a/docs/usage_call-reads.md +++ b/docs/usage_call-reads.md @@ -3,7 +3,7 @@ ``` Call HLA loci from an aligned BAM of HiFi reads -Usage: hifihla call-reads [OPTIONS] --abam --outdir +Usage: hifihla call-reads [OPTIONS] --abam Options: -j, --threads Analysis threads [default: 1] @@ -13,7 +13,7 @@ Options: -V, --version Print version Input Options: - -a, --abam Input assembly aligned to GRCh38 + -a, --abam Input assembly aligned to GRCh38 (no alts) -l, --loci [...] Input comma-sep loci to extract [default: all] -d, --max_depth Maximum reads per locus [default: 50] -p, --partial Include partially-spanning reads @@ -21,17 +21,18 @@ Input Options: -s, --seed Random number seed for downsampling to max_depth [default: 42] Output Options: - -o, --outdir Output directory - -f, --full_length Full length IMGT records only (exclude exon-only records) - -x, --max_matches Maximum matches in output report [default: 10] - -m, --min_allele_freq Minimum allele frequency [default: 0.1] - -b, --min_cdf Minimum binomial CDF to call het/hom [default: 0.001] + -o, --out_prefix Output prefix + --outdir Output directory [deprecated] + -f, --full_length Full length IMGT records only (exclude exon-only records) + -x, --max_matches Maximum matches in output report [default: 10] + -m, --min_allele_freq Minimum allele frequency [default: 0.1] + -b, --min_cdf Minimum binomial CDF to call het/hom [default: 0.001] Presets: --preset Sequence type presets [possible values: te, wgs] ``` #### Input Options Description -* `--abam` HiFi reads aligned to GRCh38 (no alts). +* `--abam` HiFi reads aligned to [GRCH38 no alts](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz). * `--loci` HLA loci to call. Currently limited to HLA-A,HLA-B,HLA-C. * `--max_depth` Maximim reads to use per locus. Reads are randomly downsampled if coverage > d. * `--partial` Include HiFi reads that do not fully span locus, but still span exon 2 (minimum requirement). @@ -39,7 +40,8 @@ Presets: * `--seed` Random number seed for downsampling and clustering. ### Output Options Description -* `--outdir` Output directory. +* `--out_prefix` Output prefix, accepts a directory or a directory + prefix. +* `--outdir` Output directory \[deprecated\]. * `--full_length` Restrict allele matches to full length IMGT records (exclude exon-only accessions). * `--max_matches` Maximum number of equivalent matches to list per query sequence. * `--min_allele_freq` Minimum fraction of reads for minor allele. Clusters with lower frequency will be ignored. @@ -55,9 +57,9 @@ hifihla call-reads \ --preset wgs \ -j 8 \ -a HG00733.GRCh38_no_alts.bam \ - -o my_output_dir + -o out_dir/my_sample -column -t my_output_dir/hifihla_summary.tsv +column -t out_dir/my_sample_hifihla_summary.tsv queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type HLA-A_1 3502 1 HLA-A*24:02:01:01 100.0 100.0 0 HLA-A*24:02:01 1,2,3,4,5,6,7,8 0 9 0.00346 HLA-A*24:02:01:01 @@ -93,9 +95,9 @@ hifihla call-reads \ -f \ -l HLA-A \ -a NA12889.GRCH38.haplotagged.bam \ - -o my_output_dir + -o out_dir/my_sample -cat hifihla_report.json +cat out_dir/my_sample_hifihla_report.json { "sample_id": "NA12889.GRCh38.haplotagged", "version": "hifihla 0.3.0;IPD-IMGT/HLA 3.55 (2024-01)",