Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSIv2 #148

Open
pd3 opened this issue Nov 13, 2014 · 5 comments
Open

CSIv2 #148

pd3 opened this issue Nov 13, 2014 · 5 comments
Milestone

Comments

@pd3
Copy link
Member

pd3 commented Nov 13, 2014

I created a new branch (9b11795) which supports reading and writing of the CSIv2 index (samtools/hts-specs@b131ffc). There is currently no API to use it. The motivation for the extension was to allow queries like:

  1. get N-th to M-th record
  2. create a list of regions with N records each (possibly with optional overlaps). This is useful in pipelines which split BCF/BAM in smaller chunks and process them in parallel

Shane suggested that the first type could be easily integrated with existing tools by using "::" instead of ":". Then chr::N-M would be interpreted as record indexes, while chr:from-to as genomic coordinates.

For the second, I was thinking of adding a new switch to tabix, one could do something like:
tabix --new-switch CHUNK_SIZE[,OVERLAP_SIZE] file.bcf [REGION]

What do you think? Comments and feedback is welcome.

@jkbonfield
Copy link
Contributor

On Thu, Nov 13, 2014 at 05:43:53AM -0800, pd3 wrote:

Shane suggested that the first type could be easily integrated with
existing tools by using "::" instead of ":". Then chr::N-M would be
interpreted as record indexes, while chr:from-to as genomic
coordinates.

I like that syntax. In theory all it can break is chromosome names
ending in a colon, but if anyone is that daft to create such names
then they probably deserve to have breakage!

For the second, I was thinking of adding a new switch to tabix, one could do something like:
tabix --new-switch CHUNK_SIZE[,OVERLAP_SIZE] file.bcf [REGION]

I'm not so sure on the logic behind having tabix understand file
formats other than pure line oriented text. It feels like anything
more is in the wrong place if it's getting shoehorned into tabix.

However the concept of breaking a file into N portions seems sane.

James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@pd3
Copy link
Member Author

pd3 commented Nov 13, 2014

Tabix started as an indexer of tab-delimited files, then it became a general indexer of .tbi, .bai, .csi files. Is it just the naming what you dislike or the whole idea of a general indexer?

@jkbonfield
Copy link
Contributor

On Thu, Nov 13, 2014 at 06:20:05AM -0800, pd3 wrote:

Tabix started as an indexer of tab-delimited files, then it became a
general indexer of .tbi, .bai, .csi files. Is it just the naming what
you dislike or the whole idea of a general indexer?

I probably don't fully understand what it's doing, but it was running
tabix on a .bcf file that concerned me as it implies some format
specific knowledge.

If it's not actually doing anything that knows what a bcf file is,
rather just acting on the general purpose index regardless of format,
then I think it's fine and sorry for the confusion.

James

James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@pd3
Copy link
Member Author

pd3 commented Nov 13, 2014

I think your understanding was correct. The indexer always has to know something about the file it is indexing. Even with tab delimited files, tabix has to know which column represents sequence name, which is start and end coordinate, if it's zero-based, one-based etc. Similarly with CSI indexes the indexer has to know about how parse data records and how to map numeric sequence ids to strings and back, for which it needs to parse the BCF/BAM header.

My idea was that tabix, a general indexer of tab-delimited files, will continue to exist as general indexer of all file formats supported by htslib. The naming is not so important to me, it could be called hts-index for instance, although I like the existing name tabix best, despite its functionality has grown.

@jmarshall jmarshall added this to the 1.4 milestone Jul 6, 2015
@mcshane mcshane removed this from the 1.4 milestone Feb 6, 2017
@mcshane
Copy link
Contributor

mcshane commented Mar 20, 2017

Closed PR #167, but leaving this open as the wishlist for CSIv2.

@mcshane mcshane changed the title CSIv2 branch CSIv2 Mar 20, 2017
@jenniferliddle jenniferliddle added this to the wishlist milestone Mar 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants