-
Notifications
You must be signed in to change notification settings - Fork 55
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Summary: Knowing record boundaries is necessary: - to compute disk stream-sizes (by adding each of the stream's record disk size) - to know how much data we need to cache when streaming and prefetching records The VRS file index only provides the start offset of each record. When records are sorted in the file the same as they are by timestamp, each record ends where the next starts (or user records end), which is simple enough, and doesn't require additional memory. That's the 99% case. Sometimes, a few records aren't sorted well, and we actually only need to track these records' boundaries and the records around them. But knowing which records need to be tracked is complex, because each record by be followed by any other, and detecting record boundaries becomes difficult: you must effectively compile a sorted list every boundary, then search where the next boundary is. So when a file isn't fully sorted, and we need to track a relatively small number of boundaries, as for most records, we can use the offset of the following record as the record's limit. In some extreme cases however, from what I can tell only with artificially generated files designed to stress-test this situation, many records aren't sorted, and tracking exceptions become more memory consuming than keeping the list of sorted boundaries. In this case, we need a binary search each time we need to find where a record ends. With these changes, we make sure that: - we only build the list of boundaries when the file isn't sorted, saving memory and compute, 99% of the time. - we track boundaries using a map containing only unexpected limits around records "out of place", when there are a limited number of exceptions. - we can track boundaries with the complete list of boundaries when there are many exceptions. - for testing, we can force using either method, so we can compare results provided by both methods. This is a deceptively simple problem, but making an efficient solution that works correctly was surprisingly tricky, hence preserving both methods for unit testing. Differential Revision: D67614158 fbshipit-source-id: 08b86551ef02e932831c151a3b126295f6816afd
- Loading branch information
1 parent
635ce51
commit c7862b3
Showing
4 changed files
with
110 additions
and
28 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters