Flekschas/faster beddb #135

flekschas · 2020-12-10T02:59:29Z

Description

What was changed in this pull request?

I implemented a tile-based indexing strategy for beddb which can speed up queries by up to 20x at the expense of increasing the file size by a factor of ~2.5x

To avoid adding the burden of having to handle another format to the end-user I decided to mark this indexing using an appended t to the version number. I.e., version 3 is the normal version while 3t is the tile-index version

To create a tile-indexed beddb file use clodius aggregate bedfile with --tile-index.

Why is it necessary?

The range-based rtree indexing is getting slow with >5mio intervals (i.e., 0.5s for a query) while the tile-based index remains fast with ~0.025s.

Checklist

Unit tests added or updated
Updated CHANGELOG.md
Run black .

Genome wide this can lead to 20x faster queries but it increases the file size by a factor of 2.5

pkerpedjiev

The speedup looks phenomenal. I think the space might be able to be reduced. See the inline comments. And how about some tests?

clodius/cli/aggregate.py

pkerpedjiev · 2020-12-13T22:06:26Z

clodius/cli/aggregate.py

+                    tile_id = "{}.{}".format(curr_zoom, curr_tile)
+
+                    tile_counts[tile_id] += 1
+


Couldn't you put the tile_inserts.append() call from below here and save yourself a while loop?

I don't think this works. The reason why is the following:

tile_counts only counts the first time an interval is inserted but not the subsequent times. Hence, this while loop is only called once, while the one below is called in higher zoom levels as well.

Here's a short example: say we set the max number of intervals per tile to 2. The first interval will be inserted at tile 0.0 and its counter will increase. However, we want to make sure that the most important tile is always shown in tile with a higher zoom level so we will insert tile-id<>interval-id pairs for higher zoom levels as well. However, we will not increase the counter for those tiles as otherwise we would never see more details as we zoom in in some cases. Does this somehow make sense? :)

This replicates the existing behavior of the beddb format.

clodius/cli/aggregate.py

pkerpedjiev

Cool, thanks for the responses!

flekschas added 4 commits December 9, 2020 21:43

Add a new tile-based indexing strategy for beddb files

c02d4cc

Genome wide this can lead to 20x faster queries but it increases the file size by a factor of 2.5

Merge branch 'develop' into flekschas/faster-beddb

328ce3f

Blackification

eb50e38

Update

a0523f5

flekschas added the feature label Dec 10, 2020

flekschas requested a review from pkerpedjiev December 10, 2020 02:59

pkerpedjiev reviewed Dec 13, 2020

View reviewed changes

pkerpedjiev approved these changes Dec 14, 2020

View reviewed changes

flekschas added 2 commits December 14, 2020 11:22

Improve code comments for documentation and remove debug logs

8fc7cad

Added a test for tile index beddb files

74c2104

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flekschas/faster beddb #135

Flekschas/faster beddb #135

flekschas commented Dec 10, 2020

pkerpedjiev left a comment

pkerpedjiev Dec 13, 2020

flekschas Dec 14, 2020

pkerpedjiev left a comment

		tile_id = "{}.{}".format(curr_zoom, curr_tile)

		tile_counts[tile_id] += 1

Flekschas/faster beddb #135

Are you sure you want to change the base?

Flekschas/faster beddb #135

Conversation

flekschas commented Dec 10, 2020

Description

Checklist

pkerpedjiev left a comment

Choose a reason for hiding this comment

pkerpedjiev Dec 13, 2020

Choose a reason for hiding this comment

flekschas Dec 14, 2020

Choose a reason for hiding this comment

pkerpedjiev left a comment

Choose a reason for hiding this comment