CSIv2 #148

pd3 · 2014-11-13T13:43:51Z

I created a new branch (9b11795) which supports reading and writing of the CSIv2 index (samtools/hts-specs@b131ffc). There is currently no API to use it. The motivation for the extension was to allow queries like:

get N-th to M-th record
create a list of regions with N records each (possibly with optional overlaps). This is useful in pipelines which split BCF/BAM in smaller chunks and process them in parallel

Shane suggested that the first type could be easily integrated with existing tools by using "::" instead of ":". Then chr::N-M would be interpreted as record indexes, while chr:from-to as genomic coordinates.

For the second, I was thinking of adding a new switch to tabix, one could do something like:
tabix --new-switch CHUNK_SIZE[,OVERLAP_SIZE] file.bcf [REGION]

What do you think? Comments and feedback is welcome.

The text was updated successfully, but these errors were encountered:

jkbonfield · 2014-11-13T14:06:56Z

On Thu, Nov 13, 2014 at 05:43:53AM -0800, pd3 wrote:

Shane suggested that the first type could be easily integrated with
existing tools by using "::" instead of ":". Then chr::N-M would be
interpreted as record indexes, while chr:from-to as genomic
coordinates.

I like that syntax. In theory all it can break is chromosome names
ending in a colon, but if anyone is that daft to create such names
then they probably deserve to have breakage!

For the second, I was thinking of adding a new switch to tabix, one could do something like:
tabix --new-switch CHUNK_SIZE[,OVERLAP_SIZE] file.bcf [REGION]

I'm not so sure on the logic behind having tabix understand file
formats other than pure line oriented text. It feels like anything
more is in the wrong place if it's getting shoehorned into tabix.

However the concept of breaking a file into N portions seems sane.

James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

pd3 · 2014-11-13T14:20:04Z

Tabix started as an indexer of tab-delimited files, then it became a general indexer of .tbi, .bai, .csi files. Is it just the naming what you dislike or the whole idea of a general indexer?

jkbonfield · 2014-11-13T15:01:29Z

On Thu, Nov 13, 2014 at 06:20:05AM -0800, pd3 wrote:

Tabix started as an indexer of tab-delimited files, then it became a
general indexer of .tbi, .bai, .csi files. Is it just the naming what
you dislike or the whole idea of a general indexer?

I probably don't fully understand what it's doing, but it was running
tabix on a .bcf file that concerned me as it implies some format
specific knowledge.

If it's not actually doing anything that knows what a bcf file is,
rather just acting on the general purpose index regardless of format,
then I think it's fine and sorry for the confusion.

James

James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

pd3 · 2014-11-13T16:04:34Z

I think your understanding was correct. The indexer always has to know something about the file it is indexing. Even with tab delimited files, tabix has to know which column represents sequence name, which is start and end coordinate, if it's zero-based, one-based etc. Similarly with CSI indexes the indexer has to know about how parse data records and how to map numeric sequence ids to strings and back, for which it needs to parse the BCF/BAM header.

My idea was that tabix, a general indexer of tab-delimited files, will continue to exist as general indexer of all file formats supported by htslib. The naming is not so important to me, it could be called hts-index for instance, although I like the existing name tabix best, despite its functionality has grown.

mcshane · 2017-03-20T14:35:44Z

Closed PR #167, but leaving this open as the wishlist for CSIv2.

jmarshall added this to the 1.4 milestone Jul 6, 2015

mcshane removed this from the 1.4 milestone Feb 6, 2017

mcshane changed the title ~~CSIv2 branch~~ CSIv2 Mar 20, 2017

jenniferliddle added this to the wishlist milestone Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSIv2 #148

CSIv2 #148

pd3 commented Nov 13, 2014

jkbonfield commented Nov 13, 2014

pd3 commented Nov 13, 2014

jkbonfield commented Nov 13, 2014

pd3 commented Nov 13, 2014

mcshane commented Mar 20, 2017