-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reading bloom filters from Parquet files and filter row groups using them #17289
Open
mhaseeb123
wants to merge
102
commits into
rapidsai:branch-25.02
Choose a base branch
from
mhaseeb123:fea/extract-pq-bloom-filter-data
base: branch-25.02
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
102 commits
Select commit
Hold shift + click to select a range
95fe8e8
Initial stuff for reading bloom filter from PQ files
mhaseeb123 4f0e7ab
Minor bug fix
mhaseeb123 48a50c4
Apply style fix
mhaseeb123 9a85d08
Merge branch 'branch-24.12' into fea/extract-pq-bloom-filter-data
mhaseeb123 b71cf9b
Merge branch 'branch-24.12' into fea/extract-pq-bloom-filter-data
mhaseeb123 68be24f
Some updates
mhaseeb123 f848251
Move contents to a separate file
mhaseeb123 0b65233
Revert erroneous changes
mhaseeb123 cf7d762
Style and doc fix
mhaseeb123 81efad2
Get equality predicate col indices
mhaseeb123 088377b
Enable `arrow_filter_policy` and `span` types in bloom filter.
mhaseeb123 0435bff
Merge branch 'branch-24.12' into fea/extract-pq-bloom-filter-data
mhaseeb123 3dff590
Successfully search bloom filter
mhaseeb123 71e1d33
style fix
mhaseeb123 aa65a2b
Code cleanup
mhaseeb123 c52821b
add tests
mhaseeb123 3a20a98
Initial stuff for reading bloom filter from PQ files
mhaseeb123 d67e4b5
Minor bug fix
mhaseeb123 10471d4
Apply style fix
mhaseeb123 1e12662
Some updates
mhaseeb123 ee7217c
Move contents to a separate file
mhaseeb123 f8e6159
Revert erroneous changes
mhaseeb123 1886cab
Style and doc fix
mhaseeb123 be228b3
Get equality predicate col indices
mhaseeb123 aaf355e
Enable `arrow_filter_policy` and `span` types in bloom filter.
mhaseeb123 e92324e
Successfully search bloom filter
mhaseeb123 0b1719d
style fix
mhaseeb123 ef3a262
Code cleanup
mhaseeb123 051be2d
add tests
mhaseeb123 a12c90e
Merge branch 'fea/extract-pq-bloom-filter-data' of https://github.com…
mhaseeb123 fb55c3f
Major cleanups
mhaseeb123 b477d2d
Significant code refactoring
mhaseeb123 f9f1746
minor style fix
mhaseeb123 bad484f
refactoring
mhaseeb123 ce09d43
Minor refactoring
mhaseeb123 dddee6c
Minor improvements
mhaseeb123 0cfeb80
Add gtest
mhaseeb123 9137585
Improvements
mhaseeb123 77152b4
Support int96 in bloom filter
mhaseeb123 3984291
Cleanup
mhaseeb123 9a39aa4
Minor improvements
mhaseeb123 1def801
Fix minor bug
mhaseeb123 6edc248
MInor bug fixing
mhaseeb123 2925f1e
Add python tests
mhaseeb123 efc6ec0
Correct parquet files
mhaseeb123 df84aca
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 a2fa784
minor spelling fix
mhaseeb123 1f5da37
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 fa0cec8
Apply suggestions from code review
mhaseeb123 7a309c6
Minor bug fix
mhaseeb123 bcc68c0
Convert to enum class
mhaseeb123 2dce9b1
Apply suggestion from code review
mhaseeb123 e03bea0
Suggestions from code reviews
mhaseeb123 059a9d8
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 4b0b5ed
Apply suggestions from code reviews
mhaseeb123 c1256b1
Refactor into single table for cudf::compute_column
mhaseeb123 88bf491
Minor, add const
mhaseeb123 9ca42c6
Move bloom filter test to parquet test
mhaseeb123 84c24c1
Minor updates
mhaseeb123 0c05031
Minor
mhaseeb123 09560c5
Logical and between bloom filter and stats
mhaseeb123 21f4412
Revert merging converted AST tables.
mhaseeb123 442de80
Revert an extra eol
mhaseeb123 f7952d4
Revert extra eol
mhaseeb123 4d0c570
Read bloom filter data sync
mhaseeb123 67c6247
Update cpp/src/io/parquet/bloom_filter_reader.cu
mhaseeb123 40c80b7
strong type for int96 timestamp
mhaseeb123 690c165
Merge branch 'fea/extract-pq-bloom-filter-data' of https://github.com…
mhaseeb123 c5f8150
Remove unused header
mhaseeb123 7a21a6e
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 4465277
Apply suggestions from code review
mhaseeb123 3888732
Apply suggestions
mhaseeb123 8bc8927
Update cpp/src/io/parquet/reader_impl_helpers.hpp
mhaseeb123 d719e65
Update cpp/src/io/parquet/reader_impl_helpers.hpp
mhaseeb123 03cf07f
Move equality_literals instead of copying
mhaseeb123 de94168
Merge branch 'fea/extract-pq-bloom-filter-data' of https://github.com…
mhaseeb123 c92d326
Minor
mhaseeb123 82083f9
Use spans instead of passing around vectors
mhaseeb123 6918a40
Minor
mhaseeb123 85cdc00
Make `get_equality_literals()` safe again
mhaseeb123 aa1a909
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 fdf8fc8
Update counting_iterator
mhaseeb123 10a8f5a
Minor changes
mhaseeb123 d46504f
Minor
mhaseeb123 c94ce86
Sync arrow filter policy with cuco
mhaseeb123 69aa685
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 d95a178
Address partial reviewer comments and fix new logger header
mhaseeb123 840c6e7
Revert to direct dtype check until I find a way to get scalar from li…
mhaseeb123 9d8c071
Create a dummy scalar of type T and compare with dtype
mhaseeb123 3b8aea0
Use a temporary scalar
mhaseeb123 0c859db
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 c385537
Recalculate `total_row_groups` in apply_bloom_filter
mhaseeb123 3693ad1
Simplify bloom filter expression with ast::tree and handle non-equali…
mhaseeb123 c2de9fb
Apply suggestions from code review
mhaseeb123 344851c
Minor optimization: Set `have_bloom_filters` while populating `bloom_…
mhaseeb123 96fb7c2
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 4522afa
Add pytest to test logical or with non == expr
mhaseeb123 ed66593
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 f509148
Remove temporary arrow_filter_policy.cuh and use cuco directly.
mhaseeb123 c8cd646
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 4194d30
MInor style fix
mhaseeb123 8b7baff
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -382,12 +382,62 @@ struct ColumnChunkMetaData { | |
// Set of all encodings used for pages in this column chunk. This information can be used to | ||
// determine if all data pages are dictionary encoded for example. | ||
std::optional<std::vector<PageEncodingStats>> encoding_stats; | ||
// Byte offset from beginning of file to Bloom filter data. | ||
std::optional<int64_t> bloom_filter_offset; | ||
// Size of Bloom filter data including the serialized header, in bytes. Added in 2.10 so readers | ||
// may not read this field from old files and it can be obtained after the BloomFilterHeader has | ||
// been deserialized. Writers should write this field so readers can read the bloom filter in a | ||
// single I/O. | ||
std::optional<int32_t> bloom_filter_length; | ||
// Optional statistics to help estimate total memory when converted to in-memory representations. | ||
// The histograms contained in these statistics can also be useful in some cases for more | ||
// fine-grained nullability/list length filter pushdown. | ||
std::optional<SizeStatistics> size_statistics; | ||
}; | ||
|
||
/** | ||
* @brief The algorithm used in bloom filter | ||
*/ | ||
struct BloomFilterAlgorithm { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
// Block-based Bloom filter. | ||
enum class Algorithm { UNDEFINED, SPLIT_BLOCK }; | ||
Algorithm algorithm{Algorithm::SPLIT_BLOCK}; | ||
}; | ||
|
||
/** | ||
* @brief The hash function used in Bloom filter | ||
*/ | ||
struct BloomFilterHash { | ||
// xxHash_64 | ||
enum class Hash { UNDEFINED, XXHASH }; | ||
Hash hash{Hash::XXHASH}; | ||
}; | ||
|
||
/** | ||
* @brief The compression used in the bloom filter | ||
*/ | ||
struct BloomFilterCompression { | ||
enum class Compression { UNDEFINED, UNCOMPRESSED }; | ||
Compression compression{Compression::UNCOMPRESSED}; | ||
}; | ||
|
||
/** | ||
* @brief Bloom filter header struct | ||
* | ||
* The bloom filter data of a column chunk stores this header at the beginning | ||
* following by the filter bitset. | ||
*/ | ||
struct BloomFilterHeader { | ||
// The size of bitset in bytes | ||
int32_t num_bytes; | ||
// The algorithm for setting bits | ||
BloomFilterAlgorithm algorithm; | ||
// The hash function used for bloom filter | ||
BloomFilterHash hash; | ||
// The compression used in the bloom filter | ||
BloomFilterCompression compression; | ||
}; | ||
|
||
/** | ||
* @brief Thrift-derived struct describing a chunk of data for a particular | ||
* column | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fields from spec: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L843