Drop Genotypes #61

JakeHagen · 2024-08-19T18:56:05Z

Hello

I am opening this draft pull request to see if there is interest in adding not only a drop_genotypes option to view but also a INFO field ingestion tool. I think this would be a great way to speed up annotation when using vcf-zarr with existing annotation tools. The workflow might look like this.

vcztools view -G 100k_sample.vcz | VEP | vcztools ingest --infos CSQ 100k_sample.vcz

Ive tested this workflow on a toy annotation example, it works well. Is there any reason this would not work more generally? The only I can think of is the number and ordering of the variants must be exactly the same coming back.

I am happy to work on both of these if there is interest.

tomwhite · 2024-08-20T10:17:21Z

Thanks for this @JakeHagen! I think -G/--drop-genotypes would be a nice addition, so we'd be happy to accept a PR for that.

I'm not sure that the VEP pipeline fits with the processing model we are trying to encourage here. This tool (vcztools) is for streaming small parts of the VCF Zarr store, it's not really for processing the full set of variants in the store. For that we have sgkit which can do out-of-core parallel processing.

Having said that, I think we'd be keen to understand better what full pipeline you are proposing here is used for. What would you do with the 100k_sample.vcz file in your example? Would it be used to select a subset of variants from the original store and then apply some analysis to them?

We have had various discussions about how to integrate VEP in the past (see sgkit-dev/sgkit#1083, and https://github.com/search?q=repo%3Asgkit-dev%2Fsgkit+vep&type=issues more generally), so I think we'd like to accommodate it somehow in the sgkit-dev ecosystem.

Tagging @jeromekelleher and @benjeffery who may have more to say about this (although Jerome is away at the moment).

JakeHagen · 2024-08-20T15:31:54Z

Hi Tom

Thanks for the reply.

How I envision this working in our lab is the data would be joint called -> ingested into Zarr -> annotated -> subset by lab members for different analysis.

Since Im guessing very few annotation tools will operate directly on the Zarr format, annotation would need to be done before ingestion.

Having an info ingestion feature seems like it would:

allow using traditional annotations tools on Zarr format
greatly improve speed of traditional tools by not supplying format fields
make updating fields and adding new fields not require a duplicated dataset

These features would solve a lot of the pain points our lab has around annotation.

I am still learning about the Zarr format and sgkit ecosystem so maybe this is not practical or made pointless by a different tool but it seems like it could be a valuable feature of Zarr vcf to me. I have attached two simple scripts as an example of how I see this working. They would be called like so:

time vcztools view -G $LARGE_ZARR | python test_annotate.py | python test_ingest.py -z $LARGE_ZARR

1.71s

instead of

time vcztools view $LARGE_ZARR | python test_annotate.py > /dev/null

567.55s

zarr_annotate_ingest_sample_scripts.zip

tomwhite

This is looking good!

It needs a couple of tests - in test_vcf_writer.py , similar to test_write_vcf__samples, and one in test_bcftools_validation.py.

tomwhite · 2024-08-21T10:00:43Z

vcztools/cli.py

@@ -53,6 +53,13 @@ def index(path):
    default=None,
    help="Samples to include.",
 )
+@click.option(
+    "-G",
+    "--drop_genotypes",


Suggested change

"--drop_genotypes",

"--drop-genotypes",

Nit: I think Click can use either underscore or hyphen, but this is consistent with the other commands.

tomwhite · 2024-08-21T10:07:08Z

vcztools/vcf_writer.py

@@ -352,7 +361,7 @@ def c_chunk_to_vcf(
        encoder.add_info_field(name, array)

    for name, array in format_fields.items():
-        assert num_samples > 0
+        #assert num_samples > 0


If num_samples is 0 then format_fields will be empty so this line will never be executed - so the assertion can be left in (unless I'm missing something).

tomwhite · 2024-08-21T10:12:53Z

Hi Jake!

Thanks for the explanation of your use case. I think I'd missed that the existing Zarr store is being updated - by adding a new field - and not being completely rewritten. For INFO field annotations (which are usually modest in size compared to FORMAT fields with many samples) this could be a good workflow. The -G option certainly speeds up the time taken to do annotation (with VEP or whatever), so it would be good to add that as I mentioned before.

The ingest command would be non-standard (compared to bcftools), so it could just be a Python script, or perhaps we could explore adding a plugin at some point if it's generally useful.

…for missing formats, make drop-genotypes click arg consistent with others

JakeHagen · 2024-08-21T15:35:23Z

Hi Tom thanks for the comments, I will add the tests next.

I added a new commit that removes the format header lines when using -G and removes the . after the INFO fields.
This involved a small edit in lib/vcf_encoder.c and lib was in the .gitignore is this not the correct place to make the edit?

tomwhite · 2024-08-22T09:28:51Z

This involved a small edit in lib/vcf_encoder.c and lib was in the .gitignore is this not the correct place to make the edit?

I think it is. Hopefully @jeromekelleher can take a look too.

jeromekelleher · 2024-08-27T10:46:48Z

This involved a small edit in lib/vcf_encoder.c and lib was in the .gitignore is this not the correct place to make the edit?

This looks great - I don't really use .gitignore files, so they tend to be updated haphazardly, and can certainly be wrong in places.

jeromekelleher · 2024-08-27T10:54:14Z

This is an excellent update, thanks @JakeHagen!

The only thing I'm unsure about is whether having num_samples=0 and drop-genotypes are identical semantically. If bcftools is given a VCF file with zero samples, but has a FORMAT header (which is "." on every variant) what does it output? Does the VCF spec have anything to say about whether FORMAT should be included when there are zero samples?

Is it possible to provide zero samples (i.e. an empty file) to bcftools? What does it output? Making sure these awkward corner cases are handled in the same way as bcftools is an important goal for compatibility.

JakeHagen · 2024-08-27T20:13:55Z

Hi @jeromekelleher thanks for your comments.
I could not find anything in the VCF spec about including the FORMAT when there are no samples, but bcftools removes all format related things (header lines, FORMAT label columns, sample columns) from the output when -G is used or an empty file is provided to -S
Bcftools will also error out when given a vcf with a FORMAT column and a . for every variant.
I gave it a file with this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
chr1    69081   chr1_69081_G_C  G       C       36      .       AF=0.000142;AQ=36;AC=0;AN=2     .

and it errored with this:

[E::bcf_hdr_parse_sample_line] Could not parse the "#CHROM.." line, either FORMAT is missing or spaces are present instead of tabs:
        #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT

Failed to read from t.vcf: could not parse header

when passed through bcftools view

So far I think this PR handles everything the same as bcftools, but I will see if I can cover these cases in the tests

jeromekelleher · 2024-08-28T08:15:18Z

Excellent, thanks @JakeHagen. We should get some validation tests against bcftools for this, but that could be a follow up. Code LGTM.

JakeHagen · 2024-08-28T17:08:46Z

Ok I added two simple tests, let me know if you would like to see any others. Thanks!

jeromekelleher

LGTM! Thanks again @JakeHagen

tomwhite reviewed Aug 21, 2024

View reviewed changes

rebase

df74734

JakeHagen force-pushed the main branch from 31c539d to df74734 Compare August 21, 2024 14:33

dont add format fields to header with drop-genotypes, dont print '.' …

2f283bc

…for missing formats, make drop-genotypes click arg consistent with others

Merge branch 'sgkit-dev:main' into main

54943e8

ruff format

619e93a

JakeHagen and others added 2 commits August 28, 2024 11:48

Merge branch 'sgkit-dev:main' into main

01b4950

add tests for drop-genotypes option

5e27cfa

JakeHagen marked this pull request as ready for review August 28, 2024 17:07

jeromekelleher mentioned this pull request Aug 29, 2024

Add support for reading samples from a file (-S) #67

Closed

jeromekelleher approved these changes Aug 29, 2024

View reviewed changes

jeromekelleher merged commit efa463c into sgkit-dev:main Aug 29, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop Genotypes #61

Drop Genotypes #61

JakeHagen commented Aug 19, 2024

tomwhite commented Aug 20, 2024

JakeHagen commented Aug 20, 2024

tomwhite left a comment

tomwhite Aug 21, 2024

tomwhite Aug 21, 2024

tomwhite commented Aug 21, 2024 •

edited

Loading

JakeHagen commented Aug 21, 2024

tomwhite commented Aug 22, 2024

jeromekelleher commented Aug 27, 2024

jeromekelleher commented Aug 27, 2024

JakeHagen commented Aug 27, 2024

jeromekelleher commented Aug 28, 2024

JakeHagen commented Aug 28, 2024

jeromekelleher left a comment

Drop Genotypes #61

Drop Genotypes #61

Conversation

JakeHagen commented Aug 19, 2024

tomwhite commented Aug 20, 2024

JakeHagen commented Aug 20, 2024

tomwhite left a comment

Choose a reason for hiding this comment

tomwhite Aug 21, 2024

Choose a reason for hiding this comment

tomwhite Aug 21, 2024

Choose a reason for hiding this comment

tomwhite commented Aug 21, 2024 • edited Loading

JakeHagen commented Aug 21, 2024

tomwhite commented Aug 22, 2024

jeromekelleher commented Aug 27, 2024

jeromekelleher commented Aug 27, 2024

JakeHagen commented Aug 27, 2024

jeromekelleher commented Aug 28, 2024

JakeHagen commented Aug 28, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

tomwhite commented Aug 21, 2024 •

edited

Loading