Skip to content

Commit

Permalink
Merge pull request #1 from PacificBiosciences/read_calling_update
Browse files Browse the repository at this point in the history
update README for v0.3.0 release
  • Loading branch information
jrharting authored Mar 22, 2024
2 parents 70a36a8 + b08f894 commit d7b358f
Show file tree
Hide file tree
Showing 10 changed files with 906 additions and 470 deletions.
34 changes: 34 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# v0.3.0: 03/21/24
## Changes
- New tool `call-reads` to call from HiFi reads (limited to class I)
- Extend reports for read-based calls
- Update IPD-IMGT/HLA to version 3.55 (2024-01)
- Add option to limit calls to genomic (full-length) records
- Update README

# v0.2.3: 12/22/23
## Changes
- Fix reporting bug for cdna calls
- Add option to require exon 2 to make a call
- Improved thread control and error reporting

# v0.2.2: 11/17/23
## Changes
- Update database to IPD-IMGT/HLA Version: 3.54 (2023-10)

# v0.2.1: 11/15/23
## Changes
- Call-contigs hap2 change to optional
- Improve output determinism
- Update README
- Fix validation bugs

# v0.2.0: 9/25/23
## Changes
- Update README for initial release
- Latest database version v3.53
- Bug fixes

# v0.1.0: 5/31/23
## Changes
- Report results in tsv and json format
480 changes: 10 additions & 470 deletions README.md

Large diffs are not rendered by default.

16 changes: 16 additions & 0 deletions docs/genes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
## Genes <a name="genes"></a>

Genes called by `hifihla`:
```
HLA-A HLA-DQA1 HLA-DRB7 HLA-P
HLA-B HLA-DQA2 HLA-DRB8 HLA-S
HLA-C HLA-DQB1 HLA-E HLA-T
HLA-DMA HLA-DQB2 HLA-F HLA-U
HLA-DMB HLA-DRA HLA-G HLA-V
HLA-DOA HLA-DRB1 HLA-H HLA-W
HLA-DOB HLA-DRB2 HLA-HFE HLA-Y
HLA-DPA1 HLA-DRB3 HLA-J MICA
HLA-DPA2 HLA-DRB4 HLA-K MICB
HLA-DPB1 HLA-DRB5 HLA-L TAP1
HLA-DPB2 HLA-DRB6 HLA-N
```
11 changes: 11 additions & 0 deletions docs/install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
## Install <a name="install"></a>

[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/hifihla/README.html)

Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda) for information on Installation, Support, License, Copyright, and Disclaimer.

`hifihla` can be installed from bioconda
```
conda install -c bioconda hifihla
```
Binaries are also availible in the github [releases](https://github.com/PacificBiosciences/HiFiHLA/releases/latest).
198 changes: 198 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
## Output <a name="output"></a>
`call-reads`, `call-consensus` and `call-contigs` all generate three reports containing HLA star-allele type calls. Additionally, `call-contigs` produces fasta files of extracted sequences from the assembly.

| File | Description |
| -------------------------------------------- | ----------- |
| {output_dir}/hifihla_summary.tsv | Detailed file listing best call for each locus |
| {output_dir}/hifihla_report.tsv | Simple tsv file listing calls for each locus |
| {output_dir}/hifihla_report.json | Detailed results file, see below for example |
| {output_dir}/asm.contigs.h[12].fasta | Extracted (full) assembly contigs aligning to MHC |
| {output_dir}/asm.contigs.h[12].fasta.fai | FASTA index for contigs |
| {output_dir}/extracted.targets.h[12].fasta | Extracted targets used for star-typing |

### Detailed summary tsv
This file reports the single best call and statistics for each query sequence in the sample.

| Column Name | Description |
|--------------|------------------------------------------------------|
| queryId | Query sequence name |
| qLen | Length of query sequence |
| nMatches | Number of equivalent matches with same edit distance |
| gType | Closest matching genomic accession |
| gPctId | Percent matching bases of closest genomic match |
| gPctCov | Percent coverage of closest genomic match |
| gEdit | Edit distance to closest genomic match |
| cdnaType | Closest matching coding sequence |
| exCovered | Exons included in alignment |
| exEdit | Edit distance to cDNA match |
| coverage | HiFi read count of allele (call-reads only) |
| errRate | Total error rate of HiFi reads (call-reads only) |
| Type | Final type call |

Final type call is determined by edit distance and exact matches to the query (nMatches). The final call is four field if the query sequences matches exactly and is the only exact match returned.

`call-reads` summary for HG002 (Class I):
```
queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type
HLA-A_1 3503 1 HLA-A*01:01:01:01 100.0 100.0 0 HLA-A*01:01:01 1,2,3,4,5,6,7,8 0 15 0.00276 HLA-A*01:01:01:01
HLA-A_0 3517 1 HLA-A*26:01:01:01 100.0 100.0 0 HLA-A*26:01:01 1,2,3,4,5,6,7,8 0 13 0.00249 HLA-A*26:01:01:01
HLA-B_0 3855 1 HLA-B*35:08:01:01 100.0 100.0 0 HLA-B*35:08:01 1,2,3,4,5,6,7 0 18 0.00231 HLA-B*35:08:01:01
HLA-B_1 4070 1 HLA-B*38:01:01:01 100.0 100.0 0 HLA-B*38:01:01 1,2,3,4,5,6,7 0 10 0.00204 HLA-B*38:01:01:01
HLA-C_0 4264 1 HLA-C*04:01:01:06 100.0 100.0 0 HLA-C*04:01:01 1,2,3,4,5,6,7,8 0 12 0.00135 HLA-C*04:01:01:06
HLA-C_1 4303 1 HLA-C*12:03:01:01 100.0 100.0 0 HLA-C*12:03:01 1,2,3,4,5,6,7,8 0 8 0.00215 HLA-C*12:03:01:01
```
`call-contigs` summary for HG00733:
```
queryId qLen nMatches gType gPctId gPctCov gEdit cdnaType exCovered exEdit coverage errRate Type
HG00733#1#h1tg000070l_29911131_29915604 4474 1 HLA-A*24:02:01:01 100.0 100.0 0 HLA-A*24:02:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-A*24:02:01:01
HG00733#2#h2tg000008l_1854867_1859341 4475 1 HLA-A*30:02:01:01 100.0 100.0 0 HLA-A*30:02:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-A*30:02:01:01
HG00733#1#h1tg000070l_31324234_31329311 5078 1 HLA-B*35:02:01:02 100.0 100.0 0 HLA-B*35:02:01 1,2,3,4,5,6,7 0 1 N/A HLA-B*35:02:01:02
HG00733#2#h2tg000008l_440231_445304 5074 1 HLA-B*18:01:01:01 100.0 100.0 0 HLA-B*18:01:01 1,2,3,4,5,6,7 0 1 N/A HLA-B*18:01:01:01
HG00733#1#h1tg000070l_31239869_31245142 5274 1 HLA-C*04:01:01:06 100.0 100.0 0 HLA-C*04:01:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-C*04:01:01:06
HG00733#2#h2tg000008l_525397_530669 5273 1 HLA-C*05:01:01:01 100.0 100.0 0 HLA-C*05:01:01 1,2,3,4,5,6,7,8 0 1 N/A HLA-C*05:01:01:01
HG00733#1#h1tg000070l_33003286_33014048 10763 1 HLA-DPA1*01:03:01:02 100.0 100.0 0 HLA-DPA1*01:03:01 1,2,3,4 0 1 N/A HLA-DPA1*01:03:01:02
HG00733#2#h2tg000060l_26807522_26818289 10768 1 HLA-DPA1*02:01:01:01 100.0 100.0 0 HLA-DPA1*02:01:01 1,2,3,4 0 1 N/A HLA-DPA1*02:01:01:01
HG00733#1#h1tg000070l_33014638_33027155 12518 1 HLA-DPB1*04:01:01:01 100.0 100.0 0 HLA-DPB1*04:01:01 1,2,3,4,5 0 1 N/A HLA-DPB1*04:01:01:01
HG00733#2#h2tg000060l_26794480_26806932 12453 1 HLA-DPB1*11:01:01:01 100.0 100.0 0 HLA-DPB1*11:01:01 1,2,3,4,5 0 1 N/A HLA-DPB1*11:01:01:01
HG00733#1#h1tg000070l_32583472_32591044 7573 1 HLA-DQA1*05:05:01:01 100.0 100.0 0 HLA-DQA1*05:05:01 1,2,3,4 0 1 N/A HLA-DQA1*05:05:01:01
HG00733#2#h2tg000060l_27233443_27240904 7462 1 HLA-DQA1*05:01:01:01 100.0 100.0 0 HLA-DQA1*05:01:01 1,2,3,4 0 1 N/A HLA-DQA1*05:01:01:01
HG00733#1#h1tg000070l_32600790_32608991 8202 1 HLA-DQB1*03:01:01:02 100.0 100.0 0 HLA-DQB1*03:01:01 1,2,3,4,5,6 0 1 N/A HLA-DQB1*03:01:01:02
HG00733#2#h2tg000060l_27214264_27222723 8460 1 HLA-DQB1*02:01:01:01 100.0 100.0 0 HLA-DQB1*02:01:01 1,2,3,4,5,6 0 1 N/A HLA-DQB1*02:01:01:01
HG00733#1#h1tg000070l_32526722_32541633 14912 1 HLA-DRB1*11:04:01:01 100.0 100.0 0 HLA-DRB1*11:04:01 1,2,3,4,5,6 0 1 N/A HLA-DRB1*11:04:01:01
HG00733#2#h2tg000060l_27282742_27297652 14911 1 HLA-DRB1*03:01:01:02 100.0 100.0 0 HLA-DRB1*03:01:01 1,2,3,4,5,6 0 1 N/A HLA-DRB1*03:01:01:02
HG00733#1#h1tg000070l_32462971_32477576 14606 1 HLA-DRB3*02:02:01:04 99.99 100.0 1 HLA-DRB3*02:02:01 1,2,3,4,5,6 0 1 N/A HLA-DRB3*02:02:01
HG00733#2#h2tg000060l_27346699_27361303 14605 1 HLA-DRB3*02:02:01:01 100.0 100.0 0 HLA-DRB3*02:02:01 1,2,3,4,5,6 0 1 N/A HLA-DRB3*02:02:01:01
```

### Simple tsv format
This file is formatted for a quick view of top-level results. It is appropriate as input for downstream interpretation tools like [pharmCAT](https://pharmcat.org/).

```
$ column -t hifihla_report.tsv
HLA-A *01:01:01:01/*26:01:01:01
HLA-B *38:01:01:01/*35:08:01:01
HLA-C *12:03:01:01/*04:01:01:06
```

### Detailed results json
Detailed results contain information on calls for each sequence typed. The json is structured in the following way:

```
{ sample_id: <sample_id>,
hla_calls: {
<locus>: {
<Record_id>: [
{
allele_id: <imgt_accession>,
star_name: <full_star_name>,
length: <accession_seq_length>,
match_name: <match_up_to_fields>,
query_start: <query_aln_start>,
query_end: <query_aln_end>,
covered_feat: <aligned_features>,
not_covered: <features_no_coverage>,
coding_diffs: <total_exon_differences>,
noncode_eddist: <edit_dist_introns_utr>,
error_rate: <total HiFi error rate (call-reads only)>,
coverage: <Number of reads covering allele (call-reads only)>,
reads: <HiFi read names used in analysis of allele (call-reads only)>,
differences: [
{
kind: <mismatch/insertion/deletion>,
pos: <position>,
size: <variant_length>,
feat: <feature_containing_variant>
},
...
]
}
],
...
}
...
}
}
```

Extracting detailed results for a given locus can be done with the [jq](https://jqlang.github.io/jq/) tool:
```
jq '.hla_calls | with_entries(select(.key == "HLA-DRB3"))' tmp2/hifihla_report.json
{
"HLA-DRB3": [
{
"HG00733#1#h1tg000070l_32462971_32477576": [
{
"allele_id": "HLA22697",
"star_name": "HLA-DRB3*02:02:01:04",
"length": 12905,
"match_name": "*02:02:01",
"query_start": 0,
"query_end": 12905,
"covered_feat": [
"UTR_5",
"Exon_1",
"Intron_1",
"Exon_2",
"Intron_2",
"Exon_3",
"Intron_3",
"Exon_4",
"Intron_4",
"Exon_5",
"Intron_5",
"Exon_6",
"UTR_3"
],
"not_covered": [],
"coding_diffs": 0,
"noncode_eddist": 1,
"error_rate": null,
"coverage": 1,
"reads": null,
"differences": [
{
"kind": "Mismatch",
"pos": 8158,
"size": 1,
"feat": "Intron_2"
}
]
}
],
"HG00733#2#h2tg000060l_27346699_27361303": [
{
"allele_id": "HLA00895",
"star_name": "HLA-DRB3*02:02:01:01",
"length": 13588,
"match_name": "*02:02:01:01",
"query_start": 0,
"query_end": 13588,
"covered_feat": [
"UTR_5",
"Exon_1",
"Intron_1",
"Exon_2",
"Intron_2",
"Exon_3",
"Intron_3",
"Exon_4",
"Intron_4",
"Exon_5",
"Intron_5",
"Exon_6",
"UTR_3"
],
"not_covered": [],
"coding_diffs": 0,
"noncode_eddist": 0,
"error_rate": null,
"coverage": 1,
"reads": null,
"differences": []
}
]
}
]
}
```
30 changes: 30 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## Usage <a name="usage"></a>

`hifihla` has four subcommands:
* [call-reads](usage_call-reads.md)
* [call-consensus](usage_call-consensus.md)
* [call-contigs](usage_call-contigs.md)
* [align-imgt](usage_align-imgt.md)

```
Usage: hifihla <COMMAND>
Commands:
call-reads Call HLA loci from an aligned BAM of HiFi reads
call-contigs Extract HLA loci from assembled MHC contigs & call star alleles on extracted sequences
call-consensus Call HLA Star (*) alleles from consensus sequences
align-imgt Align queries to IMGT/HLA genomic accession sequences
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
```

## Subcommand Inputs
| Subcommand | Input Type | File types |Description |
|----------------|-------------------------------------|-----------------|------------|
| call-reads | Aligned HiFi reads | BAM | Call Class I (ABC) from HiFi reads aligned to GRCh38 (no alts) |
| call-contigs | Aligned assembly; unaligned contigs | BAM, FASTA(.gz) | Extract and Call HLA loci from assembled MHC contigs |
| call-consensus | Amplicon/Isoseq consensus | FASTA | Call HLA alleles from consensus sequences (e.g. amplicon assays) |
| align-imgt | Sequence/IMGT accessions | FASTA | Compare sequences in fasta format or database sequences to specific IMGT/HLA genomic alleles |
Loading

0 comments on commit d7b358f

Please sign in to comment.