-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop Genotypes #61
Drop Genotypes #61
Conversation
Thanks for this @JakeHagen! I think I'm not sure that the VEP pipeline fits with the processing model we are trying to encourage here. This tool (vcztools) is for streaming small parts of the VCF Zarr store, it's not really for processing the full set of variants in the store. For that we have sgkit which can do out-of-core parallel processing. Having said that, I think we'd be keen to understand better what full pipeline you are proposing here is used for. What would you do with the We have had various discussions about how to integrate VEP in the past (see sgkit-dev/sgkit#1083, and https://github.com/search?q=repo%3Asgkit-dev%2Fsgkit+vep&type=issues more generally), so I think we'd like to accommodate it somehow in the sgkit-dev ecosystem. Tagging @jeromekelleher and @benjeffery who may have more to say about this (although Jerome is away at the moment). |
Hi Tom Thanks for the reply. How I envision this working in our lab is the data would be joint called -> ingested into Zarr -> annotated -> subset by lab members for different analysis. Since Im guessing very few annotation tools will operate directly on the Zarr format, annotation would need to be done before ingestion. Having an info ingestion feature seems like it would:
These features would solve a lot of the pain points our lab has around annotation. I am still learning about the Zarr format and sgkit ecosystem so maybe this is not practical or made pointless by a different tool but it seems like it could be a valuable feature of Zarr vcf to me. I have attached two simple scripts as an example of how I see this working. They would be called like so:
instead of
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good!
It needs a couple of tests - in test_vcf_writer.py
, similar to test_write_vcf__samples
, and one in test_bcftools_validation.py
.
vcztools/cli.py
Outdated
@@ -53,6 +53,13 @@ def index(path): | |||
default=None, | |||
help="Samples to include.", | |||
) | |||
@click.option( | |||
"-G", | |||
"--drop_genotypes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"--drop_genotypes", | |
"--drop-genotypes", |
Nit: I think Click can use either underscore or hyphen, but this is consistent with the other commands.
vcztools/vcf_writer.py
Outdated
@@ -352,7 +361,7 @@ def c_chunk_to_vcf( | |||
encoder.add_info_field(name, array) | |||
|
|||
for name, array in format_fields.items(): | |||
assert num_samples > 0 | |||
#assert num_samples > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If num_samples
is 0 then format_fields
will be empty so this line will never be executed - so the assertion can be left in (unless I'm missing something).
Hi Jake! Thanks for the explanation of your use case. I think I'd missed that the existing Zarr store is being updated - by adding a new field - and not being completely rewritten. For INFO field annotations (which are usually modest in size compared to FORMAT fields with many samples) this could be a good workflow. The The |
…for missing formats, make drop-genotypes click arg consistent with others
Hi Tom thanks for the comments, I will add the tests next. I added a new commit that removes the format header lines when using |
I think it is. Hopefully @jeromekelleher can take a look too. |
This looks great - I don't really use .gitignore files, so they tend to be updated haphazardly, and can certainly be wrong in places. |
This is an excellent update, thanks @JakeHagen! The only thing I'm unsure about is whether having Is it possible to provide zero samples (i.e. an empty file) to bcftools? What does it output? Making sure these awkward corner cases are handled in the same way as bcftools is an important goal for compatibility. |
Hi @jeromekelleher thanks for your comments.
and it errored with this:
when passed through So far I think this PR handles everything the same as bcftools, but I will see if I can cover these cases in the tests |
Excellent, thanks @JakeHagen. We should get some validation tests against bcftools for this, but that could be a follow up. Code LGTM. |
Ok I added two simple tests, let me know if you would like to see any others. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks again @JakeHagen
Hello
I am opening this draft pull request to see if there is interest in adding not only a drop_genotypes option to
view
but also a INFO field ingestion tool. I think this would be a great way to speed up annotation when using vcf-zarr with existing annotation tools. The workflow might look like this.Ive tested this workflow on a toy annotation example, it works well. Is there any reason this would not work more generally? The only I can think of is the number and ordering of the variants must be exactly the same coming back.
I am happy to work on both of these if there is interest.