Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic support for '--targets/-t' #18

Merged
merged 6 commits into from
Jul 18, 2024

Conversation

tomwhite
Copy link
Contributor

Fixes #16

This is just a quick draft, but I wanted to share early for visibility.

  • It uses a variant mask, since we will need one anyway for overlap queries. (Some of the code is inspired by https://github.com/sgkit-dev/vcf-zarr-publication/blob/52428c97822a35a7c29f150f35ec6bd7a29dad37/src/zarr_afdist.py)
  • I haven't thought about efficiency. The most glaring is that chunks that are masked out are not skipped - perhaps that should be done here. The linked issues in Basic regions support #16 could be looked at later.
  • It only handles a very limited part of the -t syntax - still needs to handle missing start/end, multiple targets, and the ^ syntax (for exclude).
  • I only implemented it for the C implementation branch. It would be easy enough to add for Numba, but perhaps we could drop that branch soon if the C branch has feature parity now?
  • It needs tests and perhaps a bit more doc.
  • We should add linting to the repo.

@jeromekelleher
Copy link
Contributor

jeromekelleher commented Jul 11, 2024

This is great, thanks! I think skipping fully masked out chunks is good enough in the first instance perf wise - otherwise I don't think this will be practical for looking at bits of large datasets.

@tomwhite tomwhite marked this pull request as ready for review July 12, 2024 16:13
@tomwhite
Copy link
Contributor Author

This addresses most of the things discussed above (skipping fully masked out chunks, adding numba, region start/end, multiple regions).

I also introduced pyranges and found that it made things simpler even for the --targets case, and will make the --regions option straightforward from here. I had a look at bioframe too, and it is very similar, but I noticed it has a dependency on matplotlib I'd rather avoid if possible.

Still need to add support for the ^ exclude syntax, and more tests for edge cases, but I think these can be added later. (I'm away for the first half of next week BTW.)

import pyranges


def parse_region(region: str) -> tuple[str, Optional[int], Optional[int]]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this would return a Region object, like the one in bio2zarr, but I'm not sure which way the dependency is between the two projects (if indeed there is one). So I left it as a tuple for now.

@jeromekelleher
Copy link
Contributor

Thanks @tomwhite, this is great.

I'd be pretty slow to add external dependencies here, and I'm sceptical about whether we're really getting much value out of pyranges and pandas here for what we're currently doing. But, I'd vote we merge this and use it as a data point, and see if doing things with a simpler approach can achieve as good or better performance.

@jeromekelleher
Copy link
Contributor

So, I'm happy to merge this now, if you are?

@tomwhite
Copy link
Contributor Author

So, I'm happy to merge this now, if you are?

Yes, please go ahead! (Sorry for slow response, I was away for a few days.)

@jeromekelleher jeromekelleher merged commit bfb91f0 into sgkit-dev:main Jul 18, 2024
7 checks passed
@jeromekelleher
Copy link
Contributor

No worries, I knew you were away

@tomwhite tomwhite deleted the targets branch July 18, 2024 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Basic regions support
2 participants