Merge pull request #2 from PacificBiosciences/update

update docs for version bump
PacificBiosciences · Apr 5, 2024 · cf67053 · cf67053
2 parents d7b358f + 28090f1
commit cf67053
Show file tree

Hide file tree

Showing 7 changed files with 72 additions and 55 deletions.
diff --git a/README.md b/README.md
@@ -3,7 +3,9 @@
 <h1 align="center">HiFiHLA</h1>
 
 ***
-An HLA star-calling tool for PacBio HiFi data types.   
+**An HLA star-calling tool for PacBio HiFi data types** 
+
+HiFiHLA generates high resolution (4-field) HLA allele calls from PacBio HiFi data.  HiFiHLA identifies the closest matching allele(s) and any differences between a sample and the IPD-IMGT/HLA database. Acceptable data types include aligned HiFi reads, assembly contigs, and amplicon consensus.
 
 Authors: [John Harting](https://github.com/jrharting), [Zev Kronenberg](https://github.com/zeeev), [Daniel Baker](https://github.com/dnbaker), [Matt Holt](https://github.com/holtjma)
 
@@ -17,7 +19,19 @@ Authors: [John Harting](https://github.com/jrharting), [Zev Kronenberg](https://
 2. [Genes](docs/genes.md)
 3. [Usage and Examples](docs/usage.md)
 4. [Output](docs/output.md)
-6. [Changelog](CHANGELOG.md)
+6. [Changelog](docs/changelog.md)
+
+## Need help?
+If you notice any missing features, bugs, or need assistance with analyzing the output of HiFiHLA, 
+please don't hesitate to open a GitHub issue.
+
+## Support information
+HiFiHLA is a pre-release software intended for research use only and not for use in diagnostic procedures. 
+While efforts have been made to ensure that HiFiHLA lives up to the quality that PacBio strives for, we make no warranty regarding this software.
+
+As HiFiHLA is not covered by any service level agreement or the like, please do not contact a PacBio Field Applications Scientists or PacBio Customer Service for assistance with any HiFiHLA release. 
+Please report all issues through GitHub instead. 
+We make no warranty that any such issue will be addressed, to any extent or within any time frame.
 
 ## References <a name="references"></a>
 Barker DJ, Maccari G, Georgiou X, Cooper MA, Flicek P, Robinson J, Marsh SGE. _The IPD-IMGT/HLA Database_. Nucleic Acids Research (2023) 51:D1053-60.

diff --git a/CHANGELOG.md → docs/changelog.md b/CHANGELOG.md → docs/changelog.md
@@ -1,3 +1,10 @@
+# v0.3.1: 04/05/24
+## Changes
+- Add output prefix option (takes directory or directory + prefix name)
+- Deprecate `outdir` (maintain backwards compatibility until v1.0)
+- Fix bug in call-reads where a read with partial exon2 (only) coverage blows up candidate pool  
+- Catch error from aligned inputs with wrong reference
+
 # v0.3.0: 03/21/24 
 ## Changes
 - New tool `call-reads` to call from HiFi reads (limited to class I)

diff --git a/docs/output.md b/docs/output.md
@@ -1,14 +1,14 @@
 ## Output <a name="output"></a>
-`call-reads`, `call-consensus` and `call-contigs` all generate three reports containing HLA star-allele type calls.  Additionally, `call-contigs` produces fasta files of extracted sequences from the assembly.
+`call-reads`, `call-consensus` and `call-contigs` all generate three reports containing HLA star-allele type calls.  Additionally, `call-contigs` produces fasta files of extracted sequences from the assembly.  If `out_prefix` is given as a directory _+ prefix/samplename_, output files will be joined to the prefix with underscore `_`. 
 
 | File                                         | Description |
 | -------------------------------------------- | ----------- |
-| {output_dir}/hifihla_summary.tsv             | Detailed file listing best call for each locus    |
-| {output_dir}/hifihla_report.tsv              | Simple tsv file listing calls for each locus      |
-| {output_dir}/hifihla_report.json             | Detailed results file, see below for example      |
-| {output_dir}/asm.contigs.h[12].fasta         | Extracted (full) assembly contigs aligning to MHC |
-| {output_dir}/asm.contigs.h[12].fasta.fai     | FASTA index for contigs                           |
-| {output_dir}/extracted.targets.h[12].fasta   | Extracted targets used for star-typing            |
+| {out_prefix}\[_/\]hifihla_summary.tsv             | Detailed file listing best call for each locus    |
+| {out_prefix}\[_/\]hifihla_report.tsv              | Simple tsv file listing calls for each locus      |
+| {out_prefix}\[_/\]hifihla_report.json             | Detailed results file, see below for example      |
+| {out_prefix}\[_/\]asm.contigs.h[12].fasta         | Extracted (full) assembly contigs aligning to MHC |
+| {out_prefix}\[_/\]asm.contigs.h[12].fasta.fai     | FASTA index for contigs                           |
+| {out_prefix}\[_/\]extracted.targets.h[12].fasta   | Extracted targets used for star-typing            |
 
 ### Detailed summary tsv
 This file reports the single best call and statistics for each query sequence in the sample.

diff --git a/docs/usage.md b/docs/usage.md
@@ -24,7 +24,7 @@ Options:
 ## Subcommand Inputs
 | Subcommand     | Input Type                          | File types      |Description |
 |----------------|-------------------------------------|-----------------|------------|
-| call-reads     | Aligned HiFi reads                  | BAM             | Call Class I (ABC) from HiFi reads aligned to GRCh38 (no alts)                               | 
+| call-reads     | Aligned HiFi reads                  | BAM             | Call Class I (ABC) from HiFi reads aligned to [GRCH38 no alts](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz) | 
 | call-contigs   | Aligned assembly; unaligned contigs | BAM, FASTA(.gz) | Extract and Call HLA loci from assembled MHC contigs                                         |
 | call-consensus | Amplicon/Isoseq consensus           | FASTA           | Call HLA alleles from consensus sequences (e.g. amplicon assays)                             |
 | align-imgt     | Sequence/IMGT accessions            | FASTA           | Compare sequences in fasta format or database sequences to specific IMGT/HLA genomic alleles |
diff --git a/docs/usage_call-consensus.md b/docs/usage_call-consensus.md
@@ -4,28 +4,26 @@ Optionally call star alleles using exon sequence only.
 ```
 Call HLA Star (*) alleles from consensus sequences
 
-Usage: hifihla call-consensus [OPTIONS] --fasta <FASTA> --outdir <OUTDIR>
+Usage: hifihla call-consensus [OPTIONS] --fasta <FASTA>
 
 Options:
-  -f, --fasta <FASTA>          Input fasta file of consensus sequences
-  -o, --outdir <OUTDIR>        Output directory
-  -c, --cdna                   Enable cDNA-only calling
-  -e, --exon2                  Require Exon2 in query sequence
-  -j, --threads <THREADS>      Analysis threads [default: 1]
-  -x, --max_matches <MATCHES>  Maximum equivalent matches per query in report [default: 10]
-  -v, --verbose...             Enable verbose output
-      --log-level <LOG_LEVEL>  Alternative to repeated -v/--verbose: set log level via key.
-                               Equivalence to -v/--verbose:
-                                     => "Warn"
-                                  -v => "Info"
-                                 -vv => "Debug"
-                                -vvv => "Trace" [default: Warn]
-  -h, --help                   Print help
-  -V, --version                Print version
+  -f, --fasta <FASTA>            Input fasta file of consensus sequences
+  -o, --out_prefix <OUT_PREFIX>  Output prefix
+      --outdir <OUTDIR>          Output directory [deprecated]
+  -c, --cdna                     Enable cDNA-only calling
+  -e, --exon2                    Require Exon2 in query sequence
+  -l, --full_length              Full length IMGT records only
+  -j, --threads <THREADS>        Analysis threads [default: 1]
+  -x, --max_matches <MATCHES>    Maximum equivalent matches per query in report [default: 10]
+  -v, --verbose...               Enable verbose output
+      --log-level <LOG_LEVEL>    Alternative to repeated -v/--verbose: set log level via key.
+  -h, --help                     Print help
+  -V, --version                  Print version
 ```
 #### Options Description
 * `--fasta` Fasta file of consensus query sequences. Only one allele per query sequence.
-* `--outdir` Output directory.
+* `--out_prefix` Output prefix, accepts a directory or a directory + prefix.
+* `--outdir` Output directory \[deprecated\].
 * `--cdna` Call and report only coding regions (cdna).  Can be used for either DNA or RNA sequences.
 * `--exon2` Require exon 2 in query. This may reduce search space.
 * `--max_matches` Only report up to this number of matches in the json report.
@@ -35,9 +33,9 @@ Type HLA consensus sequences, for example from [HiFi amplicon consensus with pba
 ```
 hifihla call-consensus \
         --fasta pbaa_12878-HG001_passed_cluster_sequences.fasta \
-        --outdir my_output_dir/
+        --out_prefix out_dir/my_sample
 
-column -t my_output_dir/hifihla_summary.tsv
+column -t out_dir/my_sample_hifihla_summary.tsv
 
 queryId                                                     qLen  nMatches  gType                 gPctId  gPctCov  gEdit  cdnaType           exCovered        exEdit  coverage  errRate  Type
 sample-12878-HG001_guide-HLA-A_cluster-0_ReadCount-1023     3098  5         HLA-A*01:01:01:01     100.0   88.43    0      HLA-A*01:01:01     1,2,3,4,5,6,7,8  0       1         N/A      HLA-A*01:01:01
@@ -57,7 +55,7 @@ Note that query sequences that match >1 allele are not labeled as four-field mat
 For example, HLA-A_cluster-1 above matches three alleles perfectly over the amplified range.  All three matches are listed in the json.
 
 ```
-jq '.. | objects | to_entries | .[] | select(.key == "sample-12878-HG001_guide-HLA-A_cluster-1_ReadCount-999")' hifihla_report.json
+jq '.. | objects | to_entries | .[] | select(.key == "sample-12878-HG001_guide-HLA-A_cluster-1_ReadCount-999")' my_sample_hifihla_report.json
 
 {
   "key": "sample-12878-HG001_guide-HLA-A_cluster-1_ReadCount-999",

diff --git a/docs/usage_call-contigs.md b/docs/usage_call-contigs.md
@@ -11,26 +11,22 @@ Options:
  * Limit output matches to full genomic IMGT accessions with `--full_length`.
 ```
 Extract HLA loci from assembled MHC contigs & call star alleles on extracted sequences
- 
-Usage: hifihla call-contigs [OPTIONS] --abam <ALIGNED_ASSEMBLY> --hap1 <HAP1_FA> --outdir <OUTDIR>
+
+Usage: hifihla call-contigs [OPTIONS] --abam <ALIGNED_ASSEMBLY> --hap1 <HAP1_FA>
 
 Options:
   -a, --abam <ALIGNED_ASSEMBLY>  Input assembly aligned to GRCh38
   -p, --hap1 <HAP1_FA>           Input hap1 assembly fa(.gz)
   -m, --hap2 <HAP2_FA>           Input hap2 assembly fa(.gz) (optional)
-  -o, --outdir <OUTDIR>          Output directory
+  -o, --out_prefix <OUT_PREFIX>  Output prefix
+      --outdir <OUTDIR>          Output directory [deprecated]
   -l, --loci [<LOCI>...]         Input comma-sep loci to extract [default: all]
   -s, --min_length <MINLENGTH>   Minimum length of extracted targets [default: 1000]
   -f, --full_length              Full length IMGT records only
   -x, --max_matches <MATCHES>    Maximum equivalent matches per query in report [default: 10]
   -j, --threads <THREADS>        Analysis threads [default: 1]
   -v, --verbose...               Enable verbose output
       --log-level <LOG_LEVEL>    Alternative to repeated -v/--verbose: set log level via key.
-                                 Equivalence to -v/--verbose:
-                                       => "Warn"
-                                    -v => "Info"
-                                   -vv => "Debug"
-                                  -vvv => "Trace" [default: Warn]
   -h, --help                     Print help
   -V, --version                  Print version
 ```
@@ -59,9 +55,9 @@ hifihla call-contigs \
         --abam HG00733.asm.GRCh38_no_alts.bam \
         --hap1 HG00733.paternal.f1_assembly_v2.fa.gz \
         --hap2 HG00733.maternal.f1_assembly_v2.fa.gz \
-        --outdir my_output_dir
+        --out_prefix out_dir/my_sample
 
-head -7 my_output_dir/hifihla_summary.tsv | column -t
+head -7 out_dir/my_sample_hifihla_summary.tsv | column -t
  
 queryId                                  qLen  nMatches  gType              gPctId  gPctCov  gEdit  cdnaType        exCovered        exEdit  coverage  errRate  Type
 HG00733#1#h1tg000070l_29911131_29915604  4474  1         HLA-A*24:02:01:01  100.0   100.0    0      HLA-A*24:02:01  1,2,3,4,5,6,7,8  0       1         N/A      HLA-A*24:02:01:01
@@ -77,9 +73,9 @@ hifihla call-contigs \
         --abam HG00733.asm.GRCh38_no_alts.bam \
         --hap1 HG00733.paternal.f1_assembly_v2.fa.gz \
         --loci HLA-DQA1,HLA-DPA1,HLA-DRB1 \
-        --outdir my_output_dir
+        -o out_dir/my_sample
 
-column -t my_output_dir/hifihla_summary.tsv
+column -t out_dir/my_sample_hifihla_summary.tsv
 
 queryId                                  qLen   nMatches  gType                 gPctId  gPctCov  gEdit  cdnaType           exCovered    exEdit  coverage  errRate  Type
 HG00733#1#h1tg000070l_33003286_33014048  10763  1         HLA-DPA1*01:03:01:02  100.0   100.0    0      HLA-DPA1*01:03:01  1,2,3,4      0       1         N/A      HLA-DPA1*01:03:01:02

diff --git a/docs/usage_call-reads.md b/docs/usage_call-reads.md
@@ -3,7 +3,7 @@
 ```
 Call HLA loci from an aligned BAM of HiFi reads
 
-Usage: hifihla call-reads [OPTIONS] --abam <ALIGNED_READS> --outdir <OUTDIR>
+Usage: hifihla call-reads [OPTIONS] --abam <ALIGNED_READS>
 
 Options:
   -j, --threads <THREADS>      Analysis threads [default: 1]
@@ -13,33 +13,35 @@ Options:
   -V, --version                Print version
 
 Input Options:
-  -a, --abam <ALIGNED_READS>     Input assembly aligned to GRCh38
+  -a, --abam <ALIGNED_READS>     Input assembly aligned to GRCh38 (no alts)
   -l, --loci [<LOCI>...]         Input comma-sep loci to extract [default: all]
   -d, --max_depth <MAX_DEPTH>    Maximum reads per locus [default: 50]
   -p, --partial                  Include partially-spanning reads
   -t, --haplotypes <HAPLOTYPES>  Haplotypes in sample [default: 2] [possible values: 1, 2]
   -s, --seed <SEED>              Random number seed for downsampling to max_depth [default: 42]
 
 Output Options:
-  -o, --outdir <OUTDIR>        Output directory
-  -f, --full_length            Full length IMGT records only (exclude exon-only records)
-  -x, --max_matches <MATCHES>  Maximum matches in output report [default: 10]
-  -m, --min_allele_freq <MAF>  Minimum allele frequency [default: 0.1]
-  -b, --min_cdf <MINCDF>       Minimum binomial CDF to call het/hom [default: 0.001]
+  -o, --out_prefix <OUT_PREFIX>  Output prefix
+      --outdir <OUTDIR>          Output directory [deprecated]
+  -f, --full_length              Full length IMGT records only (exclude exon-only records)
+  -x, --max_matches <MATCHES>    Maximum matches in output report [default: 10]
+  -m, --min_allele_freq <MAF>    Minimum allele frequency [default: 0.1]
+  -b, --min_cdf <MINCDF>         Minimum binomial CDF to call het/hom [default: 0.001]
 
 Presets:
   --preset <PRESET>  Sequence type presets [possible values: te, wgs]
 ```
 #### Input Options Description
-* `--abam` HiFi reads aligned to GRCh38 (no alts).
+* `--abam` HiFi reads aligned to [GRCH38 no alts](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz).
 * `--loci` HLA loci to call. Currently limited to HLA-A,HLA-B,HLA-C.
 * `--max_depth` Maximim reads to use per locus. Reads are randomly downsampled if coverage > d.
 * `--partial` Include HiFi reads that do not fully span locus, but still span exon 2 (minimum requirement).
 * `--haplotypes` Expected number of haploytypes in sample.  
 * `--seed` Random number seed for downsampling and clustering.
 
 ### Output Options Description
-* `--outdir` Output directory.
+* `--out_prefix` Output prefix, accepts a directory or a directory + prefix. 
+* `--outdir` Output directory \[deprecated\].
 * `--full_length` Restrict allele matches to full length IMGT records (exclude exon-only accessions).
 * `--max_matches` Maximum number of equivalent matches to list per query sequence.
 * `--min_allele_freq` Minimum fraction of reads for minor allele.  Clusters with lower frequency will be ignored.
@@ -55,9 +57,9 @@ hifihla call-reads \
         --preset wgs \
         -j 8 \
         -a HG00733.GRCh38_no_alts.bam \
-        -o my_output_dir
+        -o out_dir/my_sample
 
-column -t my_output_dir/hifihla_summary.tsv
+column -t out_dir/my_sample_hifihla_summary.tsv
 
 queryId  qLen  nMatches  gType              gPctId  gPctCov  gEdit  cdnaType        exCovered        exEdit  coverage  errRate  Type
 HLA-A_1  3502  1         HLA-A*24:02:01:01  100.0   100.0    0      HLA-A*24:02:01  1,2,3,4,5,6,7,8  0       9         0.00346  HLA-A*24:02:01:01
@@ -93,9 +95,9 @@ hifihla call-reads \
         -f \
         -l HLA-A \
         -a NA12889.GRCH38.haplotagged.bam \
-        -o my_output_dir
+        -o out_dir/my_sample
 
-cat hifihla_report.json
+cat out_dir/my_sample_hifihla_report.json
 {
   "sample_id": "NA12889.GRCh38.haplotagged",
   "version": "hifihla 0.3.0;IPD-IMGT/HLA 3.55 (2024-01)",