Hyper log log plus plus(HLL++) #2522

res-life · 2024-10-21T12:45:50Z

Add support for Hyper log log plus plus(HLL++)

Depends on:

Signed-off-by: Chong Gao [email protected]

ttnghia · 2024-11-01T03:14:10Z

src/main/cpp/src/HLLPP.cu

+                                                         rmm::cuda_stream_view stream,
+                                                         rmm::device_async_resource_ref mr)
+{
+  CUDF_EXPECTS(precision >= 4 && precision <= 18, "HLL++ requires precision in range: [4, 18]");


We can use std::numeric_limits<>::digits instead of hardcoded values 4 and 18.

cuCo hardcoded 4, and Spark also hardcoded 4.

ttnghia · 2024-11-01T03:16:48Z

src/main/cpp/src/HLLPP.cu

+  auto input_cols = std::vector<int64_t const*>(input_iter, input_iter + input.num_children());
+  auto d_inputs   = cudf::detail::make_device_uvector_async(input_cols, stream, mr);
+  auto result     = cudf::make_numeric_column(
+    cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream);


Do we need such all-valid null mask? How about cudf::mask_state::UNALLOCATED?

Tested Spark behavior, for approx_count_distinct(null) returns 0.
So the values in result column are always non-null

I meant, if all rows are valid, we don't need to allocate a null mask.
BTW, we need to pass mr to the returning column (but do not pass it to the intermediate vector/column).

Suggested change

cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream);

cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::UNALLOCATED, stream, mr);

ttnghia · 2024-11-01T03:17:16Z

src/main/cpp/src/HLLPP.cu

+  auto result     = cudf::make_numeric_column(
+    cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream);
+  // evaluate from struct<long, ..., long>
+  thrust::for_each_n(rmm::exec_policy(stream),


Try to use exec_policy_nosync as much as possible.

Suggested change

thrust::for_each_n(rmm::exec_policy(stream),

thrust::for_each_n(rmm::exec_policy_nosync(stream),

ttnghia · 2024-11-01T03:19:15Z

src/main/java/com/nvidia/spark/rapids/jni/HLLPP.java

+   * The input sketch values must be given in the format `LIST<INT8>`.
+   *
+   * @param input         The sketch column which constains `LIST<INT8> values.


INT8 or INT64?

In addition, in estimate_from_hll_sketches I see that the input is STRUCT<LONG, LONG, ....> instead of LIST<>. Why?

It's STRUCT<LONG, LONG, ....> consistent with Spark. The input is columnar data, e.g.: sketch 0 is composed of by all the data of the children at index 0.
Updated the function comments, refer to commit

Signed-off-by: Chong Gao <[email protected]>

res-life · 2024-11-26T10:38:25Z

Ready to review except test cases.

ttnghia · 2024-12-13T21:46:14Z

src/main/cpp/CMakeLists.txt

@@ -196,6 +196,7 @@ add_library(
  src/HashJni.cpp
  src/HistogramJni.cpp
  src/HostTableJni.cpp
+  src/HLLPPJni.cpp


Let's try to be generic.

Suggested change

src/HLLPPJni.cpp

src/AggregationJni.cpp

Renamed to: HLLPPHostUDFJni
AggregationJni is too generic

ttnghia · 2024-12-13T21:46:42Z

src/main/cpp/CMakeLists.txt

@@ -204,6 +205,7 @@ add_library(
  src/SparkResourceAdaptorJni.cpp
  src/SubStringIndexJni.cpp
  src/ZOrderJni.cpp
+  src/HLLPP.cu


How about HyperLogLogPP?

Suggested change

src/HLLPP.cu

src/HyperLogLogPP.cu

This name is also applied for the .hpp and *.java files.

ttnghia · 2024-12-13T21:46:57Z

src/main/cpp/src/HLLPP.cu

@@ -0,0 +1,102 @@
+/*
+ * Copyright (c) 2023-2024, NVIDIA CORPORATION.


Suggested change

* Copyright (c) 2023-2024, NVIDIA CORPORATION.

* Copyright (c) 2024-2025, NVIDIA CORPORATION.

ttnghia · 2024-12-13T21:54:23Z

src/main/cpp/src/HLLPP.cu

+  int64_t shift_mask = MASK << (REGISTER_VALUE_BITS * reg_idx);
+  int64_t v          = (long_10_registers & shift_mask) >> (REGISTER_VALUE_BITS * reg_idx);


Suggested change

int64_t shift_mask = MASK << (REGISTER_VALUE_BITS * reg_idx);

int64_t v = (long_10_registers & shift_mask) >> (REGISTER_VALUE_BITS * reg_idx);

auto const shift_bits = REGISTER_VALUE_BITS * reg_idx;

auto const shift_mask = MASK << shift_bits;

auto const v = (long_10_registers & shift_mask) >> shift_bit;

ttnghia · 2024-12-13T21:56:12Z

src/main/cpp/src/HLLPP.cu

+}
+
+struct estimate_fn {
+  cudf::device_span<int64_t const*> sketch_longs;


Suggested change

cudf::device_span<int64_t const*> sketch_longs;

cudf::device_span<int64_t const*> sketches;

ttnghia · 2024-12-13T21:57:15Z

src/main/cpp/src/HLLPP.cu

+  int const precision;
+  int64_t* const out;


We now favor non-const members so the functor can be moved by the compiler if needed.
In addition, member variables need to be sorted by their sizes to reduce padding.

Suggested change

int const precision;

int64_t* const out;

int64_t* out;

int precision;

ttnghia · 2024-12-13T21:59:47Z

src/main/cpp/src/HLLPP.cu

+
+  __device__ void operator()(cudf::size_type const idx) const
+  {
+    auto const num_regs = 1ull << precision;


This seems to be used to compare with signed int later, thus it should not be unsigned here.

Suggested change

auto const num_regs = 1ull << precision;

auto const num_regs = 1 << precision;

ttnghia · 2024-12-13T22:22:43Z

src/main/cpp/src/HLLPP.cu

+                                                         rmm::cuda_stream_view stream,
+                                                         rmm::device_async_resource_ref mr)
+{
+  CUDF_EXPECTS(precision >= 4, "HyperLogLogPlusPlus requires precision is bigger than 4.");


Suggested change

CUDF_EXPECTS(precision >= 4, "HyperLogLogPlusPlus requires precision is bigger than 4.");

CUDF_EXPECTS(precision >= 4, "HyperLogLogPlusPlus requires precision bigger than 4.");

ttnghia · 2024-12-13T22:23:35Z

src/main/cpp/src/HLLPP.cu

+  auto const input_iter = cudf::detail::make_counting_transform_iterator(
+    0, [&](int i) { return input.child(i).begin<int64_t>(); });


We need a CUDF_EXPECTS to check for input type too (struct of longs).

ttnghia · 2024-12-13T22:25:10Z

src/main/cpp/src/HLLPP.cu

+  auto input_cols = std::vector<int64_t const*>(input_iter, input_iter + input.num_children());
+  auto d_inputs   = cudf::detail::make_device_uvector_async(input_cols, stream, mr);


Suggested change

auto input_cols = std::vector<int64_t const*>(input_iter, input_iter + input.num_children());

auto d_inputs = cudf::detail::make_device_uvector_async(input_cols, stream, mr);

auto const h_input_ptrs = std::vector<int64_t const*>(input_iter, input_iter + input.num_children());

auto const d_input_ptrs = cudf::detail::make_device_uvector_async(input_cols, stream, cudf::get_current_device_resource_ref());

cudf::get_current_device_resource_ref()):
Why not use the mr passed in?

ttnghia · 2024-12-13T22:33:13Z

src/main/cpp/src/HLLPP.hpp

+
+#include <cudf/column/column.hpp>
+#include <cudf/column/column_view.hpp>
+#include <cudf/utilities/default_stream.hpp>


Suggested change

#include <cudf/utilities/default_stream.hpp>

#include <cudf/utilities/default_stream.hpp>

#include <cudf/utilities/memory_resource.hpp>

ttnghia · 2024-12-13T22:33:25Z

src/main/cpp/src/HLLPP.hpp

+#include <cudf/utilities/default_stream.hpp>
+
+#include <rmm/cuda_stream_view.hpp>
+#include <rmm/resource_ref.hpp>


Suggested change

#include <rmm/resource_ref.hpp>

ttnghia · 2024-12-13T22:33:43Z

src/main/cpp/src/HLLPP.hpp

+  cudf::column_view const& input,
+  int precision,
+  rmm::cuda_stream_view stream      = cudf::get_default_stream(),
+  rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource());


Suggested change

rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource());

rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

ttnghia · 2024-12-13T22:34:52Z

src/main/java/com/nvidia/spark/rapids/jni/HLLPP.java

+/**
+ * HyperLogLogPlusPlus
+ */
+public class HLLPP {


Suggested change

public class HLLPP {

public class AggregationUtils {

AggregationUtils is too generic, is HyperLogLogPlusPlusHostUDF OK?

ttnghia · 2024-12-13T22:35:49Z

src/main/java/com/nvidia/spark/rapids/jni/HLLPP.java

+  /**
+   * Compute the approximate count distinct value from sketch values.
+   * <p>
+   * The input sketch values must be given in the format `Struct<INT64, INT64, ...>`,
+   * The num of children is: num_registers_per_sketch / 10 + 1, here 10 means a INT64 contains
+   * max 10 registers. Register value is 6 bits. The input is columnar data, e.g.: sketch 0
+   * is composed of by all the data of the children at index 0.
+   *
+   * @param input         The sketch column which constains Struct<INT64, INT64, ...> values.
+   * @param precision     The num of bits for addressing.
+   * @return A INT64 column with each value indicates the approximate count distinct value.
+   */
+  public static ColumnVector estimateDistinctValueFromSketches(ColumnView input, int precision) {
+    return new ColumnVector(estimateDistinctValueFromSketches(input.getNativeView(), precision));
+  }
+
+  private static native long estimateDistinctValueFromSketches(long inputHandle, int precision);


I think if this Java interface will no longer be needed after converting the code to use HOST_UDF.

Rename to: HyperLogLogPlusPlusHostUDF
It now is used to create UDF and do estimate JNI.

res-life · 2024-12-17T13:11:38Z

Now, this PR is using Host UDF.
Will fix the comments ASAP.

res-life · 2024-12-17T13:15:00Z

build

res-life · 2024-12-17T13:19:59Z

Verified Host UDF successfully via NVIDIA/spark-rapids#11638

ttnghia · 2024-12-18T21:45:54Z

Need to wait for the dependencies to be merged first before we can build.

sperlingxx · 2024-12-23T02:24:57Z

src/main/cpp/src/hyper_log_log_plus_plus.cu

+  int64_t const precision,                          // num of bits for register addressing, e.g.: 9
+  int* const registers_output_cache,                // num is num_groups * num_registers_per_sketch
+  int* const registers_thread_cache,                // num is num_threads * num_registers_per_sketch
+  cudf::size_type* const group_lables_thread_cache  // save the group lables for each thread


nit: labels?

res-life requested a review from ttnghia October 21, 2024 12:45

res-life force-pushed the hll branch 3 times, most recently from b6f5cf5 to 526a61f Compare October 31, 2024 11:34

res-life changed the title ~~[Do not review] Hyper log log plus plus(HLL++)~~ Hyper log log plus plus(HLL++) Oct 31, 2024

res-life force-pushed the hll branch from 526a61f to b7abf6e Compare October 31, 2024 12:47

ttnghia reviewed Nov 1, 2024

View reviewed changes

Chong Gao added 3 commits November 21, 2024 13:26

Add HLL++ evaluation function

03c0f5a

Update function comments

df8b223

Fix

2daca3f

res-life force-pushed the hll branch from 11e97a9 to 2daca3f Compare November 21, 2024 07:33

res-life changed the base branch from branch-24.12 to branch-25.02 November 25, 2024 09:53

res-life mentioned this pull request Nov 26, 2024

[WIP] Add support for Hyper Log Log PLus Plus(HLL++) NVIDIA/spark-rapids#11638

Draft

Chong Gao added 2 commits November 26, 2024 15:43

Use exec_policy_nosync instead of exec_policy

3afdfde

Format code; Remove a useless file

956af39

Signed-off-by: Chong Gao <[email protected]>

res-life force-pushed the hll branch from c7da8ed to 956af39 Compare November 26, 2024 07:51

Merge branch 'branch-25.02' into hll

8aaf0f6

ttnghia reviewed Dec 13, 2024

View reviewed changes

Chong Gao added 3 commits December 17, 2024 20:56

Use UDF

5bfb544

Use UDF

f8c6a02

Use UDF

208d67e

res-life mentioned this pull request Dec 17, 2024

[Do not Review] Support hyper log log plus plus(HLL++) rapidsai/cudf#17133

Closed

10 tasks

res-life marked this pull request as ready for review December 17, 2024 13:20

Chong Gao added 2 commits December 18, 2024 17:17

Address comments

e29d5a1

Merge branch 'branch-25.02' into hll

9f7ec44

Chong Gao added 3 commits December 19, 2024 19:20

Merge branch 'branch-25.02' into hll

3c70a30

Fix compile error

3e22512

Handle null inputs: must ignore the null input values

aa7ca68

sperlingxx reviewed Dec 23, 2024

View reviewed changes

Rename refactor: Correct spelling errors

f0970c0

	cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream);
	cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::UNALLOCATED, stream, mr);

	thrust::for_each_n(rmm::exec_policy(stream),
	thrust::for_each_n(rmm::exec_policy_nosync(stream),

		@@ -0,0 +1,102 @@
		/*
		* Copyright (c) 2023-2024, NVIDIA CORPORATION.

	* Copyright (c) 2023-2024, NVIDIA CORPORATION.
	* Copyright (c) 2024-2025, NVIDIA CORPORATION.

		int64_t shift_mask = MASK << (REGISTER_VALUE_BITS * reg_idx);
		int64_t v = (long_10_registers & shift_mask) >> (REGISTER_VALUE_BITS * reg_idx);

-  int64_t shift_mask = MASK << (REGISTER_VALUE_BITS * reg_idx);
-  int64_t v          = (long_10_registers & shift_mask) >> (REGISTER_VALUE_BITS * reg_idx);
+  auto const shift_bits = REGISTER_VALUE_BITS * reg_idx;
+  auto const shift_mask = MASK << shift_bits;
+  auto const v          = (long_10_registers & shift_mask) >> shift_bit;

	cudf::device_span<int64_t const*> sketch_longs;
	cudf::device_span<int64_t const*> sketches;

	auto const num_regs = 1ull << precision;
	auto const num_regs = 1 << precision;

	CUDF_EXPECTS(precision >= 4, "HyperLogLogPlusPlus requires precision is bigger than 4.");
	CUDF_EXPECTS(precision >= 4, "HyperLogLogPlusPlus requires precision bigger than 4.");

		auto const input_iter = cudf::detail::make_counting_transform_iterator(
		0, [&](int i) { return input.child(i).begin<int64_t>(); });

		auto input_cols = std::vector<int64_t const*>(input_iter, input_iter + input.num_children());
		auto d_inputs = cudf::detail::make_device_uvector_async(input_cols, stream, mr);

	#include <cudf/utilities/default_stream.hpp>
	#include <cudf/utilities/default_stream.hpp>
	#include <cudf/utilities/memory_resource.hpp>

	rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource());
	rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

Hyper log log plus plus(HLL++) #2522

Are you sure you want to change the base?

Hyper log log plus plus(HLL++) #2522

Conversation

res-life commented Oct 21, 2024 • edited by ttnghia Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

ttnghia Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

ttnghia Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Dec 17, 2024

res-life commented Dec 17, 2024

res-life commented Dec 17, 2024

ttnghia commented Dec 18, 2024

Choose a reason for hiding this comment

res-life commented Oct 21, 2024 •

edited by ttnghia

Loading

ttnghia Nov 1, 2024 •

edited

Loading

ttnghia Dec 13, 2024 •

edited

Loading

ttnghia Dec 13, 2024 •

edited

Loading

ttnghia Dec 13, 2024 •

edited

Loading

ttnghia Dec 13, 2024 •

edited

Loading

ttnghia Dec 13, 2024 •

edited

Loading

ttnghia Dec 13, 2024 •

edited

Loading