Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading bloom filters from Parquet files and filter row groups using them #17289

Open
wants to merge 102 commits into
base: branch-25.02
Choose a base branch
from

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Nov 9, 2024

Description

This PR adds support to read bloom filters from Parquet files and use them to filter row groups based on col == literal like predicate(s), if provided.

Related to #17164

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 9, 2024
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue cuco cuCollections related issue feature request New feature or request non-breaking Non-breaking change labels Nov 9, 2024
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ast_operator::NOT, _bloom_filter_expr.push(ast::operation{ast_operator::NOT, value})});
}
// For all other expressions, push an always true expression
else {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karthikeyann @vuule added this logic handle any non col == lit type expressions in the filter. Essentially just transforming them all to always true.

* @brief Collects lists of equality predicate literals in the AST expression, one list per input
* table column. This is used in row group filtering based on bloom filters.
*/
class equality_literals_collector : public ast::detail::expression_transformer {
Copy link
Member Author

@mhaseeb123 mhaseeb123 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for an ast::tree in this expression converter as we only visit and collect literals for col == lit expressions.

*/
std::reference_wrapper<ast::expression const> visit(ast::literal const& expr) override
{
return expr;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to push any of these to the ast::tree from the child class bloom_filter_expression_converter either as these columns or literals don't participate in the transformed expression tree.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few small comments, looks good overall :)

cpp/src/io/parquet/bloom_filter_reader.cu Show resolved Hide resolved
cpp/src/io/parquet/bloom_filter_reader.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/bloom_filter_reader.cu Outdated Show resolved Hide resolved
@mhaseeb123 mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Dec 14, 2024
mhaseeb123 added a commit to mhaseeb123/rapids-cmake that referenced this pull request Dec 18, 2024
Bump cuco by one commit which contains updates to unblock rapidsai/cudf#17289
@mhaseeb123 mhaseeb123 added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Dec 18, 2024
rapids-bot bot pushed a commit to rapidsai/rapids-cmake that referenced this pull request Dec 18, 2024
#735)

Simply bump cuco by one commit which contains updates to unblock rapidsai/cudf#17289

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #735
@mhaseeb123 mhaseeb123 removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Dec 18, 2024
rapids-bot bot pushed a commit that referenced this pull request Dec 20, 2024
…st::tree` (#17587)

This PR simplifies the StatsAST expression transformer in Parquet reader's predicate pushdown using `ast::tree` from (#17156). 

This PR is a follow up to @bdice's comment at #17289 (comment). Similar changes for the `BloomfilterAST` expression converter have been incorporated in the PR #17289.

Related to #17164

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #17587
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue cuco cuCollections related issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants