Note
This repository contains code that runs inside a VarFish Server. If you are looking to run your own VarFish Server, look here at bihealth/varfish-server.
This repository contains the worker used by VarFish Server to execute certain background task. They are written in the Rust programming language to speed up the execution of certain tasks. At the moment, the following sub commands exist:
db
-- subcommands to build binary (protobuf) database filesseqvars
-- subcommands for processing sequence (aka small/SNV/indel) variantsseqvars ingest
-- convert single VCF file into internal format for use withseqvars query
seqvars query
-- perform sequence variant filtration and on-the-fly annotationseqvars prefilter
-- limit the result ofseqvars prefilter
by population frequency and/or distance to exonseqvars aggregate
-- read through multiple VCF files written byseqvars ingest
and computes a carrier counts table.
strucvars
-- subcommands for processing structural (aka large variants, CNVs, etc.) variantsstrucvars ingest
-- convert one or more structural variant files for use withstrucvars query
strucvars aggregate
-- compile per-case structural variant into an in-house database, to be converted to.bin
withstrucvars txt-to-bin
.strucvars txt-to-bin
-- convert text files downloaded by varfish-db-downloader to binary for fast use instrucvars query
commandsstrucvars query
-- perform structural variant filtration and on-the-fly annotation
For running queries, the worker tool is installed into the VarFish Server image and are run as executables. Internally, VarFish server works on VCF files stored in an S3 storage.
For import, the user gives the server access to the VCF files to import.
The server will then use the worker executable to ingest the data into the internal format using {seqvars,strucvars} ingest
.
These files are then stored in the internal S3 storage.
For queries, the server will create a query JSON file and then pass this query JSON file together with the internal file to the worker executable. The worker will create a result file that can be directly imported by the server to be displayed to the user.
Future versions may provide persistently running HTTP/REST servers that provide functionality without startup cost.
This command takes as the input a single VCF file from a (supported) variant caller and converts it into a file for further querying. The command interprets the following fields which are written out by the commonly used variant callers such as GATK UnifiedGenotyper, GATK HaplotypeCaller, and Illumina Dragen.
FORMAT/GT
-- genotype- the following
GT
values are written out as0/0
,0/1
,1/0
,1/1
,0|0
,0|1
,1|0
,1|1
,./.
,.|.
,.
- no combination of no-call (
.
) and called allele is written out
- the following
FORMAT/GQ
-- genotype qualityFORMAT/DP
-- total read coverageFORMAT/AD
-- allelic depth, one value per allele (including reference0)FORMAT/PS
-- physical phasing information as written out by GATK HaplotypeCaller in GVCF workflow and Dragen variant callerFORMAT/SQ
-- "somatic quality" for each alternate allele, as written out by Illumina Dragen variant caller- this field will be written as
FORMAT/GQ
- this field will be written as
The seqvars ingest
command will annotate the variants with the following information:
- gnomAD genomes and exomes allele frequencies
- gnomAD-mtDNA and HelixMtDb allele frequencies
- functional annotation following the VCF ANN field standard
Gene_Name
is writen as HGNC symbolGene_ID
is written as HGNC ID
The command will emit one output line for each variant allele from the input and each affected gene. That is, if two variant alleles affect two genes, four records will be written to the output file. The annotation will be written out for one highest impact.
Overall, the command will emit the following header rows in addition to the ##contig=<ID=.,length=.>
lines.
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=gnomad_exomes_an,Number=1,Type=Integer,Description="Number of alleles in gnomAD exomes">
##INFO=<ID=gnomad_exomes_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_exomes_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_exomes_hemi,Number=1,Type=Integer,Description="Number of hemi. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_genomes_an,Number=1,Type=Integer,Description="Number of alleles in gnomAD genomes">
##INFO=<ID=gnomad_genomes_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in gnomAD genomes">
##INFO=<ID=gnomad_genomes_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in gnomAD genomes">
##INFO=<ID=gnomad_genomes_hemi,Number=1,Type=Integer,Description="Number of hemi. alt. carriers in gnomAD genomes">
##INFO=<ID=helix_an,Number=1,Type=Integer,Description="Number of alleles in HelixMtDb">
##INFO=<ID=helix_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in HelixMtDb">
##INFO=<ID=helix_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in HelixMtDb">
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##x-varfish-case-uuid=d2bad2ec-a75d-44b9-bd0a-83a3f1331b7c
##x-varfish-version=<ID=varfish-server-worker,Version=x.y.z>
##x-varfish-version=<ID=orig-caller,Name=Dragen,Version=SW: 07.021.624.3.10.9, HW: 07.021.624>
##x-varfish-version=<ID=orig-caller,Name=GatkHaplotypeCaller,Version=4.4.0.0>
Note
The gnomad-mtDNA information is written to the INFO/gnomdad_genome_*
fields.
Note
Future versions of the worker will annotate the worst effect on a MANE select or MANE Clinical transcript.
This file takes as the input a file created by seqvars ingest
and filters the variants by population frequency and/or distance to exon.
You can pass the prefilter criteria as JSON on the command line corresponding to the following Rust structs:
struct PrefilterParams {
/// Path to output file.
pub path_out: String,
/// Maximal allele population frequency.
pub max_freq: f64,
/// Maximal distance to exon.
pub max_dist: i32,
}
You can either specify the parameters on the command line directly or pass a path to a JSONL file starting with @
.
You can mix both ways.
$ varfish-server-worker strucvars prefilter \
--path-input INPUT.vcf \
--params '{"path_out": "out.vcf", "max_freq": 0.01, "max_dist": 100}' \
[--params ...] \
# OR
$ varfish-server-worker strucvars prefilter \
--path-input INPUT.vcf \
--params @path/to/params.json \
[--params ...] \
## The `seqvars aggregate` Command
This command reads through multiple files written by `seqvars ingest` and computes a in-house carrier counts table.
You can specify the VCF files individually on the command line or pass in files that have paths to the VCF files line by line.
The resulting table is a folder to a RocksDB database.
```shell session
varfish-server-worker seqvars aggregate \
--genome-build {grch37,grch38} \
--path-out-rocksdb rocksdb/folder \
--path-in-vcf path/to/vcf.gz \
--path-in-vcf @path/to/file/list.txt
This command perform the querying of sequence variants and further annotation using annonars databases.
This command takes as the input one or more VCF files from structural variant callers and converts it into a file for further querying. The command supports the following variant callers and can guess the caller from the VCF header and first record.
- Delly2
- Dragen-SV (equivalent to Manta)
- Dragen-CNV
- GATK gCNV
- Manta
- MELT
- PopDel
- Sniffles2
One record will be written out for each variant, each with a single alternate allele.
The following symbolic ALT
alleles are used:
<DEL>
<DUP>
<INS>
<INV>
- VCF break-end syntax, e.g.,
T[chr1:5[
The following INFO
fields are written:
IMPRECISE
-- flag that specifies that this is an imprecise variantEND
-- end position of the variantsSVTYPE
-- type of the variant, one of<DEL>
,<DUP>
,<INS>
,<INV>
,BND
SVLEN
-- absolute length of the SV for linear variants,.
for non-linear variantsSVCLAIM
-- specificaton ofD
(change in abundance),J
(novel junction), orDJ
(both change in abundance and novel junction)callers
-- (non-standard field), list of callers that called the variantchr2
-- (non-standard field), second chromosome for BND variantsannsv
-- (non-standard field), annotation of the variant effect on each affected gene
The annsv
field is a pipe-character (|
) separated list of the following fields:
- symbolic alternate alele, e.g.,
<DEL>
- effects on the gene's transcript, separated by
&
transcript_variant
-- variant affects the whole transcriptexon_variant
-- variant affects exonsplice_region_variant
-- variant affects splice regionintron_variant
-- variant affects only intronupstream_variant
-- variant upsream of genedownstream_variant
-- variant downstream of geneintergenic_variant
-- default for "no gene affected", but never written
- HGNC gene symbol, e.g.,
BRCA1
- HGNC gene ID, e.g.,
HGNC:1100
The following FORMAT
fields are written:
GT
-- (standard field) genotype, if applicableGQ
-- (standard field) genotype quality, if applicablepec
-- total coverage with paired-end readspev
-- paired-end reads supporting the variantsrc
-- total coverage with split readssrv
-- split reads supporting the variantamq
-- average mapping quality over the variantcn
-- copy number of the variant in the sampleanc
-- average normalized coverage over the variant in the samplepc
-- point count (windows/targets/probes)
Overall, the command will emit the following header rows in addition to the ##contig=<ID=.,length=.>
lines.
##fileformat=VCFv4.4
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the longest variant described in this record">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=A,Type=Integer,Description="Length of structural variant">
##INFO=<ID=SVCLAIM,Number=A,Type=String,Description="Claim made by the structural variant call. Valid values are D, J, DJ for abundance, adjacency and both respectively">
##INFO=<ID=callers,Number=.,Type=String,Description="Callers that called the variant">
##INFO=<ID=chr2,Number=1,Type=String,Description="Second chromosome, if not equal to CHROM">
##INFO=<ID=annsv,Number=1,Type=String,Description="Effect annotations: 'Allele | Annotation | Gene_Name | Gene_ID'">
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=pec,Number=1,Type=Integer,Description="Total coverage with paired-end reads">
##FORMAT=<ID=pev,Number=1,Type=Integer,Description="Paired-end reads supporting the variant">
##FORMAT=<ID=src,Number=1,Type=Integer,Description="Total coverage with split reads">
##FORMAT=<ID=srv,Number=1,Type=Integer,Description="Split reads supporting the variant">
##FORMAT=<ID=amq,Number=1,Type=Float,Description="Average mapping quality over the variant">
##FORMAT=<ID=cn,Number=1,Type=Integer,Description="Copy number of the variant in the sample">
##FORMAT=<ID=anc,Number=1,Type=Float,Description="Average normalized coverage over the variant in the sample">
##FORMAT=<ID=pc,Number=1,Type=Integer,Description="Point count (windows/targets/probes)">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INS,Description="Insertion">
##ALT=<ID=CNV,Description="Copy Number Variation">
##ALT=<ID=INV,Description="Inversion">
##fileDate=20230421
##x-varfish-genome-build=GRCh37
##SAMPLE=<ID=index,Sex="Male",Disease="Affected">
##SAMPLE=<ID=father,Sex="Male",Disease="Unaffected">
##SAMPLE=<ID=mother,Sex="Female",Disease="Unaffected">
##PEDIGREE=<ID=index,Father="father",Mother="mother">
##PEDIGREE=<ID=father>
##PEDIGREE=<ID=mother>
##x-varfish-case-uuid=d2bad2ec-a75d-44b9-bd0a-83a3f1331b7c
##x-varfish-version=<ID=varfish-server-worker,Version="x.y.z">
##x-varfish-version=<ID=Delly,Name="Delly",Version="1.1.3">
Note
The strucvars ingest
step does not perform any annotation.
It only merges the input VCF files from multiple callers (all files must have the same samples) and converts them into the internal format.
The INFO/annsv
field is filled by strucvars query
.
Import multiple files created by strucvars ingest
into a database that can be convered to .bin
with strucvars txt-to-bin
and then used by strucvars query
.
You can specify the files individually.
Paths starting with an at (@
) character are interpreted as files with lists of paths.
You can mix paths with @
and without.
$ varfish-server-worker strucvars aggregate \
--genome-release {Grch37,Grch38} \
--path-output OUT.tsv \
--path-input IN/file1.vcf.gz \
[--path-input IN/file1.vcf.gz] \
# OR:
$ varfish-server-worker db mk-inhouse \
--genome-release {Grch37,Grch38} \
--path-output OUT.tsv \
--path-input @IN/path-list.txt \
[--path-input @IN/path-list2.txt]
Convert output of varfish-db-downloader to a directory with databases to be used by query commands such as strucvars query
.
$ varfish-server-worker strucvars txt-to-bin \
--input-type {ClinvarSv,StrucvarInhouse,...} \
--path-input IN.txt \
--path-output DST.bin
Run a query on a VCF file with structural variants as created by strucvars ingest
using a varfish worker database.
$ varfish-server-worker strucvars query \
--genome-release grch37 \
--path-db path/to/worker-db \
--path-input IN.vcf.gz \
--path-output OUT.jsonl
The worker database has the following structure.
Note that also mehari transcripts are read, thus the mehari/
directory is included.
mehari/
{genome_release}/
txs.bin.zst
worker/
noref/
genes/
acmg.tsv -- ACMG SF list genes
mim2gene.tsv -- OMIM to NCBI mapping from clingen
xlink.bin -- gene crosslinks
{genome_release}/ -- one per genome release
features/ -- features important for annotation
masked_repeat.bin -- masked repeats
masked_seqdup.bin -- masked segmental duplications
strucvars/ -- structural variant specific
bgdbs/ -- background databases
dbvar.bin -- dbVar
dgv.bin -- DGV
dgv_gs.bin -- DGV gold standard
g1k.bin -- 1000 genomes CNVs
gnomad_exomes.bin -- gnomAD-exomes/ExAC SVs
gnomad_genomes.bin -- gnomAD-genomes SVs
clinvar.bin -- ClinVar SVs
inhouse.bin -- inhouse SV database
patho_mms.bed -- well-known pathogenic DELs/DUPs
tads/
hesc.bed -- hESC TAD definitions
This section is only relevant for developers of varfish-server-worker
.
You will also need to have git LFS installed to get the test databases.
You will need a recent version of protocolbuffers, e.g.:
# bash utils/install-protoc.sh
# export PATH=$PATH:$HOME/.local/share/protoc/bin
For running protolint, install it as python package protolint-bin
:
# virtualenv /tmp/varfish-server-worker
# source /tmp/varfish-server-worker/bin/activate
# pip install protolint-bin
To reduce compile times, we recommend using a pre-built version of rocksdb
, either from the system package manager or e.g. via conda
:
# Ubuntu
sudo apt-get install librocksdb-dev
# Conda
conda install -c conda-forge rocksdb
In either case, either add
[env]
ROCKSDB_LIB_DIR = "/usr/lib/" # in case of the system package manager, adjust the path accordingly for conda
SNAPPY_LIB_DIR = "/usr/lib/" # same as above
to .cargo/config.toml
or set the environment variables ROCKSDB_LIB_DIR
and SNAPPY_LIB_DIR
to the appropriate paths:
export ROCKSDB_LIB_DIR=/usr/lib/
export SNAPPY_LIB_DIR=/usr/lib/
By default, the environment variables are defined in the .cargo/config.toml
as described above, i.e. may need adjustments if not using the system package manager.
To build the project, run:
cargo build --release
To install the project locally, run:
cargo install --path .