Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query format #64

Merged
merged 3 commits into from
Aug 27, 2024
Merged

Query format #64

merged 3 commits into from
Aug 27, 2024

Conversation

Will-Tyler
Copy link
Contributor

@Will-Tyler Will-Tyler commented Aug 21, 2024

Overview

This pull request partially implements the query --format functionality from bcftools.

This pull request closes #50.

Approach

The approach consists of two components: a parser and a generator. The parser processes the query format string and produces a format specifiers list. The generator is a function that takes the root VCF Zarr group and generates the result of the query one line at a time. The generator's initializer composes the generator according to the structure of the format specifiers list.

Parser

I implement the parser using PyParsing. We used PyParsing to implement a parser in #49 as well.

Generator

The generator uses Python generators to yield query results one variant position at a time. This approach allows Python to iterate over each Zarr array's chunks independently. The high-level generator zips generators for each of the format specifiers and joins the results to produce a line for each variant position.

Query format language

This implementation does not support the full query format language that bcftools supports.

Here is what this implementation should support:

  • any variant-site-level field except the full INFO field,
  • newline characters,
  • tab characters,
  • subfield indexing with curly brackets.

This implementation does not support looping over samples at a variant site. Additionally, some format specifiers supported by bcftools are recognized by this implementation's parser but lead to an error in the generator (e.g. %END0).

Testing

I add unit tests and validation tests along with my changes. I ran the test suite to check that my changes have good coverage.

Example usage

vcztools query vcz_test_cache/sample.vcf.vcz -f "%REF\t%ALT\n"
A       C
A       G
G       A
T       A
A       G,T
T       .
G       GA,GAC
T       .
AC      A,ATG,C

References

Copy link
Contributor

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks amazing @Will-Tyler!

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing @Will-Tyler!

As a follow-up we should do a little bit of benchmarking to see if this has tolerable performance on large queries. We many need to push the actual string generation down to C (but that's no biggie - you've solved the hard problem here.)

@tomwhite
Copy link
Contributor

I'm going to merge this now. Another follow up would be to add expression, region and sample filtering to query.

@tomwhite tomwhite merged commit 058ce78 into sgkit-dev:main Aug 27, 2024
10 checks passed
@Will-Tyler Will-Tyler deleted the query-format branch August 27, 2024 14:46
@Will-Tyler
Copy link
Contributor Author

I'll be interested to see how this approach performs. Thanks all for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add query command with basic -f/--format support
3 participants