-
Notifications
You must be signed in to change notification settings - Fork 49
refine
Documentation In Progress
As described in the phab documentation, a constraint on Truvari bench
finding matches is that there needs to be some consistency in how the variants are represented. To help automate the process of running Truvari phab
on a benchmarking result, we present the tool refine
.
After making a bench
result:
truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/
Use refine
on the result/
truvari refine -r subset.bed -f ref.fa result/
-
refine.variant_summary.json
- result of re-evaluating calls within the specified regions. Same structure as summary.json -
refine.regions.txt
- Tab-delimited file with per-region variant counts -
refine.region_summary.json
- Per-region performance metrics -
phab/
- Per-region results from variant re-evaluation (optional) -
phab_bench/
- Bench results on the harmonized variants
To see an example output, look at test data
Column | Description |
---|---|
chrom | Region's chromosome |
start | Region's start |
end | Region's end |
in_tpbase | Input's True Positive base count |
in_tp | Input's True Positive comparison count |
in_fp | Input's false positive count |
in_fn | Input's false negative count |
refined | Boolean for if region was re-evaluated |
out_tpbase | Output's true positive base count |
out_tp | Output's true positive comparison count |
out_fn | Outputs false positive count |
out_fp | Output's false negative count |
state | True/False state of the region |
Because truvari phab
can alter variant counts during harmonization, one may wish to assess the performance on a per-region basis rather than the per-variant basis. In the refine.regions.txt
, a column state
will have a TP/FN/FP value as defined by the following rules:
false_pos = (data['out_fp'] != 0)
false_neg = (data['out_fn'] != 0)
any_false = false_pos | false_neg
true_positives = (data['out_tp'] != 0) & (data['out_tpbase'] != 0) & ~any_false
true_negatives = (data[['out_tpbase', 'out_tp', 'out_fn', 'out_fp']] == 0).all(axis=1)
This logic has two edge cases to consider. 1) a region with at least one false-positive and one false-negative will be counted as both a false-positive and a false-negative. 2) Regions within --refdist
may experience 'variant bleed' where they e.g. have an out_tp, but no other variants because a neighboring region actually contains the the corresponding out_tpbase
. For the first case, we simply count the region twice and set its state in refine.regions.txt
to "FP,FN". For the second case, we set the state to 'UNK' and ignore it when calculating the region summary. Future versions may figure out exactly how to handle (prevent?) 'UNK' regions.
These by-region state counts are summarized and written to refine.region_summary.json
. The definition of metrics inside this json are:
Key | Definition | Formula |
---|---|---|
TP | True Positive region count | |
TN | True Negative region count | |
FP | False Positive region count | |
FN | False Negative region count | |
condition P | Regions with base variant(s) | TP + FN |
condition N | Regions without base variant(s) | TN + FP |
test P | Regions with comparison variant(s) | TP + FP |
test N | Regions without comparison variant(s) | TN + FN |
PPV | Positive Predictive Value (a.k.a. precision) | TP / test P |
TPR | True Positive Rate (a.k.a. recall) | TP / condition P |
TNR | True Negative Rate (a.k.a. specificity) | TN / condition N |
NPV | Negative Predictive Value | TN / test N |
ACC | Accuracy | (TP + TN) / (condition P + condition N) |
BA | Balanced Accuracy | (TPR + TNR) / 2 |
F1 | f1 score | 2 * ((PPV * TPR) / (PPV + TPR)) |
Even though PPV is synonymous with precision, we use these abbreviated names when dealing with per-region performance in order to help users differentiate the by-variant performance reports from the by-region performance reports.
By default, refine
will use the base/comparison variants from the bench
results tp-base.vcf.gz
, fn.vcf.gz
, tp-comp.vcf.gz
, and fp.vcf.gz
as input for phab
. However, this typically contains a filtered subset of variants originally provided to bench
since it does filtering such as --sizemin
and --passonly
. With the --use-original
parameter, all of the original calls from the input vcfs are fetched.
This parameter specifies which regions to re-evaluate. If this is not provided, the original bench
result's --includebed
is used. If both --regions
and --includebed
are provided, the --regions
are subset to only those intersecting --includebed
.
Note that the larger these regions are the slower MAFFT (used by phab
) will run. Also, when performing the intersection as described above, there may be edge effects in the reported refine.variant_summary.json
. For example, if a --region
partially overlaps an --includebed
region, you may not be analyzing a subset of calls looked at during the original bench
run. Therefore, the *summary.json
should be compared with caution.
When intersecting includebed with regions, use includebed coordinates. This is helpful for when the original bench result's --includebed
boundaries should be used instead of the --regions
By default, the reference is pulled from the original bench
result's params.json
. If a reference wasn't used with bench
, it must be specified with refine
as it's used by phab
to realign variants.