Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop Genotypes #61

Merged
merged 6 commits into from
Aug 29, 2024
Merged

Drop Genotypes #61

merged 6 commits into from
Aug 29, 2024

Conversation

JakeHagen
Copy link
Contributor

Hello

I am opening this draft pull request to see if there is interest in adding not only a drop_genotypes option to view but also a INFO field ingestion tool. I think this would be a great way to speed up annotation when using vcf-zarr with existing annotation tools. The workflow might look like this.

vcztools view -G 100k_sample.vcz | VEP | vcztools ingest --infos CSQ 100k_sample.vcz

Ive tested this workflow on a toy annotation example, it works well. Is there any reason this would not work more generally? The only I can think of is the number and ordering of the variants must be exactly the same coming back.

I am happy to work on both of these if there is interest.

@tomwhite
Copy link
Contributor

Thanks for this @JakeHagen! I think -G/--drop-genotypes would be a nice addition, so we'd be happy to accept a PR for that.

I'm not sure that the VEP pipeline fits with the processing model we are trying to encourage here. This tool (vcztools) is for streaming small parts of the VCF Zarr store, it's not really for processing the full set of variants in the store. For that we have sgkit which can do out-of-core parallel processing.

Having said that, I think we'd be keen to understand better what full pipeline you are proposing here is used for. What would you do with the 100k_sample.vcz file in your example? Would it be used to select a subset of variants from the original store and then apply some analysis to them?

We have had various discussions about how to integrate VEP in the past (see sgkit-dev/sgkit#1083, and https://github.com/search?q=repo%3Asgkit-dev%2Fsgkit+vep&type=issues more generally), so I think we'd like to accommodate it somehow in the sgkit-dev ecosystem.

Tagging @jeromekelleher and @benjeffery who may have more to say about this (although Jerome is away at the moment).

@JakeHagen
Copy link
Contributor Author

Hi Tom

Thanks for the reply.

How I envision this working in our lab is the data would be joint called -> ingested into Zarr -> annotated -> subset by lab members for different analysis.

Since Im guessing very few annotation tools will operate directly on the Zarr format, annotation would need to be done before ingestion.

Having an info ingestion feature seems like it would:

  • allow using traditional annotations tools on Zarr format
  • greatly improve speed of traditional tools by not supplying format fields
  • make updating fields and adding new fields not require a duplicated dataset

These features would solve a lot of the pain points our lab has around annotation.

I am still learning about the Zarr format and sgkit ecosystem so maybe this is not practical or made pointless by a different tool but it seems like it could be a valuable feature of Zarr vcf to me. I have attached two simple scripts as an example of how I see this working. They would be called like so:

time vcztools view -G $LARGE_ZARR | python test_annotate.py | python test_ingest.py -z $LARGE_ZARR

1.71s

instead of

time vcztools view $LARGE_ZARR | python test_annotate.py > /dev/null

567.55s

zarr_annotate_ingest_sample_scripts.zip

Copy link
Contributor

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good!

It needs a couple of tests - in test_vcf_writer.py , similar to test_write_vcf__samples, and one in test_bcftools_validation.py.

vcztools/cli.py Outdated
@@ -53,6 +53,13 @@ def index(path):
default=None,
help="Samples to include.",
)
@click.option(
"-G",
"--drop_genotypes",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"--drop_genotypes",
"--drop-genotypes",

Nit: I think Click can use either underscore or hyphen, but this is consistent with the other commands.

@@ -352,7 +361,7 @@ def c_chunk_to_vcf(
encoder.add_info_field(name, array)

for name, array in format_fields.items():
assert num_samples > 0
#assert num_samples > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If num_samples is 0 then format_fields will be empty so this line will never be executed - so the assertion can be left in (unless I'm missing something).

@tomwhite
Copy link
Contributor

tomwhite commented Aug 21, 2024

Hi Jake!

Thanks for the explanation of your use case. I think I'd missed that the existing Zarr store is being updated - by adding a new field - and not being completely rewritten. For INFO field annotations (which are usually modest in size compared to FORMAT fields with many samples) this could be a good workflow. The -G option certainly speeds up the time taken to do annotation (with VEP or whatever), so it would be good to add that as I mentioned before.

The ingest command would be non-standard (compared to bcftools), so it could just be a Python script, or perhaps we could explore adding a plugin at some point if it's generally useful.

…for missing formats, make drop-genotypes click arg consistent with others
@JakeHagen
Copy link
Contributor Author

Hi Tom thanks for the comments, I will add the tests next.

I added a new commit that removes the format header lines when using -G and removes the . after the INFO fields.
This involved a small edit in lib/vcf_encoder.c and lib was in the .gitignore is this not the correct place to make the edit?

@tomwhite
Copy link
Contributor

This involved a small edit in lib/vcf_encoder.c and lib was in the .gitignore is this not the correct place to make the edit?

I think it is. Hopefully @jeromekelleher can take a look too.

@jeromekelleher
Copy link
Contributor

This involved a small edit in lib/vcf_encoder.c and lib was in the .gitignore is this not the correct place to make the edit?

This looks great - I don't really use .gitignore files, so they tend to be updated haphazardly, and can certainly be wrong in places.

@jeromekelleher
Copy link
Contributor

This is an excellent update, thanks @JakeHagen!

The only thing I'm unsure about is whether having num_samples=0 and drop-genotypes are identical semantically. If bcftools is given a VCF file with zero samples, but has a FORMAT header (which is "." on every variant) what does it output? Does the VCF spec have anything to say about whether FORMAT should be included when there are zero samples?

Is it possible to provide zero samples (i.e. an empty file) to bcftools? What does it output? Making sure these awkward corner cases are handled in the same way as bcftools is an important goal for compatibility.

@JakeHagen
Copy link
Contributor Author

Hi @jeromekelleher thanks for your comments.
I could not find anything in the VCF spec about including the FORMAT when there are no samples, but bcftools removes all format related things (header lines, FORMAT label columns, sample columns) from the output when -G is used or an empty file is provided to -S
Bcftools will also error out when given a vcf with a FORMAT column and a . for every variant.
I gave it a file with this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
chr1    69081   chr1_69081_G_C  G       C       36      .       AF=0.000142;AQ=36;AC=0;AN=2     .

and it errored with this:

[E::bcf_hdr_parse_sample_line] Could not parse the "#CHROM.." line, either FORMAT is missing or spaces are present instead of tabs:
        #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT

Failed to read from t.vcf: could not parse header

when passed through bcftools view

So far I think this PR handles everything the same as bcftools, but I will see if I can cover these cases in the tests

@jeromekelleher
Copy link
Contributor

Excellent, thanks @JakeHagen. We should get some validation tests against bcftools for this, but that could be a follow up. Code LGTM.

@JakeHagen JakeHagen marked this pull request as ready for review August 28, 2024 17:07
@JakeHagen
Copy link
Contributor Author

Ok I added two simple tests, let me know if you would like to see any others. Thanks!

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks again @JakeHagen

@jeromekelleher jeromekelleher merged commit efa463c into sgkit-dev:main Aug 29, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants