Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyper log log plus plus(HLL++) #2522

Open
wants to merge 15 commits into
base: branch-25.02
Choose a base branch
from
Open

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Oct 21, 2024

@res-life res-life requested a review from ttnghia October 21, 2024 12:45
@res-life res-life force-pushed the hll branch 3 times, most recently from b6f5cf5 to 526a61f Compare October 31, 2024 11:34
@res-life res-life changed the title [Do not review] Hyper log log plus plus(HLL++) Hyper log log plus plus(HLL++) Oct 31, 2024
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr)
{
CUDF_EXPECTS(precision >= 4 && precision <= 18, "HLL++ requires precision in range: [4, 18]");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use std::numeric_limits<>::digits instead of hardcoded values 4 and 18.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuCo hardcoded 4, and Spark also hardcoded 4.

auto input_cols = std::vector<int64_t const*>(input_iter, input_iter + input.num_children());
auto d_inputs = cudf::detail::make_device_uvector_async(input_cols, stream, mr);
auto result = cudf::make_numeric_column(
cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need such all-valid null mask? How about cudf::mask_state::UNALLOCATED?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested Spark behavior, for approx_count_distinct(null) returns 0.
So the values in result column are always non-null

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, if all rows are valid, we don't need to allocate a null mask.
BTW, we need to pass mr to the returning column (but do not pass it to the intermediate vector/column).

Suggested change
cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream);
cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::UNALLOCATED, stream, mr);

auto result = cudf::make_numeric_column(
cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream);
// evaluate from struct<long, ..., long>
thrust::for_each_n(rmm::exec_policy(stream),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to use exec_policy_nosync as much as possible.

Suggested change
thrust::for_each_n(rmm::exec_policy(stream),
thrust::for_each_n(rmm::exec_policy_nosync(stream),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 34 to 36
* The input sketch values must be given in the format `LIST<INT8>`.
*
* @param input The sketch column which constains `LIST<INT8> values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INT8 or INT64?

Copy link
Collaborator

@ttnghia ttnghia Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, in estimate_from_hll_sketches I see that the input is STRUCT<LONG, LONG, ....> instead of LIST<>. Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's STRUCT<LONG, LONG, ....> consistent with Spark. The input is columnar data, e.g.: sketch 0 is composed of by all the data of the children at index 0.
Updated the function comments, refer to commit

@res-life
Copy link
Collaborator Author

Ready to review except test cases.

@@ -196,6 +196,7 @@ add_library(
src/HashJni.cpp
src/HistogramJni.cpp
src/HostTableJni.cpp
src/HLLPPJni.cpp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to be generic.

Suggested change
src/HLLPPJni.cpp
src/AggregationJni.cpp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to: HLLPPHostUDFJni
AggregationJni is too generic

@@ -204,6 +205,7 @@ add_library(
src/SparkResourceAdaptorJni.cpp
src/SubStringIndexJni.cpp
src/ZOrderJni.cpp
src/HLLPP.cu
Copy link
Collaborator

@ttnghia ttnghia Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about HyperLogLogPP?

Suggested change
src/HLLPP.cu
src/HyperLogLogPP.cu

This name is also applied for the .hpp and *.java files.

@@ -0,0 +1,102 @@
/*
* Copyright (c) 2023-2024, NVIDIA CORPORATION.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2023-2024, NVIDIA CORPORATION.
* Copyright (c) 2024-2025, NVIDIA CORPORATION.

Comment on lines 50 to 51
int64_t shift_mask = MASK << (REGISTER_VALUE_BITS * reg_idx);
int64_t v = (long_10_registers & shift_mask) >> (REGISTER_VALUE_BITS * reg_idx);
Copy link
Collaborator

@ttnghia ttnghia Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int64_t shift_mask = MASK << (REGISTER_VALUE_BITS * reg_idx);
int64_t v = (long_10_registers & shift_mask) >> (REGISTER_VALUE_BITS * reg_idx);
auto const shift_bits = REGISTER_VALUE_BITS * reg_idx;
auto const shift_mask = MASK << shift_bits;
auto const v = (long_10_registers & shift_mask) >> shift_bit;

}

struct estimate_fn {
cudf::device_span<int64_t const*> sketch_longs;
Copy link
Collaborator

@ttnghia ttnghia Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cudf::device_span<int64_t const*> sketch_longs;
cudf::device_span<int64_t const*> sketches;

Comment on lines 57 to 58
int const precision;
int64_t* const out;
Copy link
Collaborator

@ttnghia ttnghia Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now favor non-const members so the functor can be moved by the compiler if needed.
In addition, member variables need to be sorted by their sizes to reduce padding.

Suggested change
int const precision;
int64_t* const out;
int64_t* out;
int precision;


__device__ void operator()(cudf::size_type const idx) const
{
auto const num_regs = 1ull << precision;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be used to compare with signed int later, thus it should not be unsigned here.

Suggested change
auto const num_regs = 1ull << precision;
auto const num_regs = 1 << precision;

rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr)
{
CUDF_EXPECTS(precision >= 4, "HyperLogLogPlusPlus requires precision is bigger than 4.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CUDF_EXPECTS(precision >= 4, "HyperLogLogPlusPlus requires precision is bigger than 4.");
CUDF_EXPECTS(precision >= 4, "HyperLogLogPlusPlus requires precision bigger than 4.");

Comment on lines 88 to 89
auto const input_iter = cudf::detail::make_counting_transform_iterator(
0, [&](int i) { return input.child(i).begin<int64_t>(); });
Copy link
Collaborator

@ttnghia ttnghia Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a CUDF_EXPECTS to check for input type too (struct of longs).

Comment on lines 90 to 91
auto input_cols = std::vector<int64_t const*>(input_iter, input_iter + input.num_children());
auto d_inputs = cudf::detail::make_device_uvector_async(input_cols, stream, mr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto input_cols = std::vector<int64_t const*>(input_iter, input_iter + input.num_children());
auto d_inputs = cudf::detail::make_device_uvector_async(input_cols, stream, mr);
auto const h_input_ptrs = std::vector<int64_t const*>(input_iter, input_iter + input.num_children());
auto const d_input_ptrs = cudf::detail::make_device_uvector_async(input_cols, stream, cudf::get_current_device_resource_ref());

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudf::get_current_device_resource_ref()):
Why not use the mr passed in?


#include <cudf/column/column.hpp>
#include <cudf/column/column_view.hpp>
#include <cudf/utilities/default_stream.hpp>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <cudf/utilities/default_stream.hpp>
#include <cudf/utilities/default_stream.hpp>
#include <cudf/utilities/memory_resource.hpp>

#include <cudf/utilities/default_stream.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/resource_ref.hpp>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <rmm/resource_ref.hpp>

cudf::column_view const& input,
int precision,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource());
Copy link
Collaborator

@ttnghia ttnghia Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource());
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/**
* HyperLogLogPlusPlus
*/
public class HLLPP {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class HLLPP {
public class AggregationUtils {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AggregationUtils is too generic, is HyperLogLogPlusPlusHostUDF OK?

Comment on lines 31 to 47
/**
* Compute the approximate count distinct value from sketch values.
* <p>
* The input sketch values must be given in the format `Struct<INT64, INT64, ...>`,
* The num of children is: num_registers_per_sketch / 10 + 1, here 10 means a INT64 contains
* max 10 registers. Register value is 6 bits. The input is columnar data, e.g.: sketch 0
* is composed of by all the data of the children at index 0.
*
* @param input The sketch column which constains Struct<INT64, INT64, ...> values.
* @param precision The num of bits for addressing.
* @return A INT64 column with each value indicates the approximate count distinct value.
*/
public static ColumnVector estimateDistinctValueFromSketches(ColumnView input, int precision) {
return new ColumnVector(estimateDistinctValueFromSketches(input.getNativeView(), precision));
}

private static native long estimateDistinctValueFromSketches(long inputHandle, int precision);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if this Java interface will no longer be needed after converting the code to use HOST_UDF.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to: HyperLogLogPlusPlusHostUDF
It now is used to create UDF and do estimate JNI.

@res-life
Copy link
Collaborator Author

Now, this PR is using Host UDF.
Will fix the comments ASAP.

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

Verified Host UDF successfully via NVIDIA/spark-rapids#11638

@res-life res-life marked this pull request as ready for review December 17, 2024 13:20
@ttnghia
Copy link
Collaborator

ttnghia commented Dec 18, 2024

Need to wait for the dependencies to be merged first before we can build.

int64_t const precision, // num of bits for register addressing, e.g.: 9
int* const registers_output_cache, // num is num_groups * num_registers_per_sketch
int* const registers_thread_cache, // num is num_threads * num_registers_per_sketch
cudf::size_type* const group_lables_thread_cache // save the group lables for each thread
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: labels?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants