Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HOST_UDF aggregation for reduction and segmented reduction #17645

Open
wants to merge 43 commits into
base: branch-25.02
Choose a base branch
from

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Dec 20, 2024

Following #17592, this enables HOST_UDF aggregation in reduction and segmented reduction, allowing to execute a host-side user-defined function (UDF) through libcudf aggregation framework.

Closes #16633.

@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Dec 20, 2024
@ttnghia ttnghia self-assigned this Dec 20, 2024
@ttnghia ttnghia requested review from a team as code owners December 20, 2024 19:25
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia
Copy link
Contributor Author

ttnghia commented Dec 29, 2024

The logic for host_udf_base interface is changed in this PR. Instead of one universal interface for aggregation contexts (groupby/reduction/segmented reduction) as designed in the previous work, now there are three separate interfaces for them. Implementation for each context needs to derive from the corresponding interface.

Because the interfaces are no longer designed to be universal for all contexts, they are much simpler, avoiding potential bugs due to universal usage of one interface for different contexts.

Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@@ -1,6 +1,6 @@
/*
*
* Copyright (c) 2020-2023, NVIDIA CORPORATION.
* Copyright (c) 2020-2025, NVIDIA CORPORATION.
Copy link
Contributor Author

@ttnghia ttnghia Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the copyright header for the files that were forgotten in the previous PR. Github CI seems does not check Java files.

cpp/include/cudf/aggregation/host_udf.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/aggregation/host_udf.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/aggregation/host_udf.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/aggregation/host_udf.hpp Outdated Show resolved Hide resolved
@ttnghia ttnghia requested a review from a team as a code owner January 10, 2025 05:14
@github-actions github-actions bot added the CMake CMake build issue label Jan 10, 2025
ttnghia and others added 4 commits January 9, 2025 21:22
@@ -256,7 +256,7 @@ public static ReductionAggregation collectSet() {
* @param nanEquality Flag to specify whether NaN values in floating point column should be considered equal.
*/
public static ReductionAggregation collectSet(NullPolicy nullPolicy,
NullEquality nullEquality, NaNEquality nanEquality) {
NullEquality nullEquality, NaNEquality nanEquality) {
Copy link
Contributor

@sperlingxx sperlingxx Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these style adjustments necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. This is what the IDE automatically reformat. I checked other files and this seems consistent. For example:

public ColumnVector(DType type, long rows, Optional<Long> nullCount,
DeviceMemoryBuffer dataBuffer, DeviceMemoryBuffer validityBuffer,
DeviceMemoryBuffer offsetBuffer, List<DeviceMemoryBuffer> toClose, long[] childHandles) {

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a java standpoint the code matches what was there before.

Copy link
Contributor

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved trivial CMake changes

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really happy with the final outcome of this PR. The finalized APIs are much more accessible and user-friendly. I appreciate your prompt responses to the review comments and the time you took for offline discussions to address all my concerns and answer my questions.

Looks good to me!

cpp/include/cudf/aggregation/host_udf.hpp Outdated Show resolved Hide resolved
@ttnghia ttnghia requested a review from nvdbaranec January 10, 2025 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
Status: Burndown
Development

Successfully merging this pull request may close these issues.

[FEA] Allow to run groupby/reduction with externally derived aggregations
7 participants