-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSIv2 #148
Comments
On Thu, Nov 13, 2014 at 05:43:53AM -0800, pd3 wrote:
I like that syntax. In theory all it can break is chromosome names
I'm not so sure on the logic behind having tabix understand file However the concept of breaking a file into N portions seems sane. James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova The Wellcome Trust Sanger Institute is operated by Genome Research |
Tabix started as an indexer of tab-delimited files, then it became a general indexer of .tbi, .bai, .csi files. Is it just the naming what you dislike or the whole idea of a general indexer? |
On Thu, Nov 13, 2014 at 06:20:05AM -0800, pd3 wrote:
I probably don't fully understand what it's doing, but it was running If it's not actually doing anything that knows what a bcf file is, James James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova The Wellcome Trust Sanger Institute is operated by Genome Research |
I think your understanding was correct. The indexer always has to know something about the file it is indexing. Even with tab delimited files, tabix has to know which column represents sequence name, which is start and end coordinate, if it's zero-based, one-based etc. Similarly with CSI indexes the indexer has to know about how parse data records and how to map numeric sequence ids to strings and back, for which it needs to parse the BCF/BAM header. My idea was that tabix, a general indexer of tab-delimited files, will continue to exist as general indexer of all file formats supported by htslib. The naming is not so important to me, it could be called hts-index for instance, although I like the existing name tabix best, despite its functionality has grown. |
Closed PR #167, but leaving this open as the wishlist for CSIv2. |
I created a new branch (9b11795) which supports reading and writing of the CSIv2 index (samtools/hts-specs@b131ffc). There is currently no API to use it. The motivation for the extension was to allow queries like:
Shane suggested that the first type could be easily integrated with existing tools by using "::" instead of ":". Then
chr::N-M
would be interpreted as record indexes, whilechr:from-to
as genomic coordinates.For the second, I was thinking of adding a new switch to tabix, one could do something like:
tabix --new-switch CHUNK_SIZE[,OVERLAP_SIZE] file.bcf [REGION]
What do you think? Comments and feedback is welcome.
The text was updated successfully, but these errors were encountered: