Skip to content

Commit

Permalink
Implement from_json_to_structs (#2510)
Browse files Browse the repository at this point in the history
* Implement `castStringsToBooleans`

Signed-off-by: Nghia Truong <[email protected]>

* Implement `removeQuotes`

Signed-off-by: Nghia Truong <[email protected]>

* Rewrite using offsets and chars

Signed-off-by: Nghia Truong <[email protected]>

* Fix empty input

Signed-off-by: Nghia Truong <[email protected]>

* Misc

Signed-off-by: Nghia Truong <[email protected]>

* Add `nullifyIfNotQuoted` option for `removeQuotes`

Signed-off-by: Nghia Truong <[email protected]>

* Implement `castStringsToDecimals`

Signed-off-by: Nghia Truong <[email protected]>

* Implement `removeQuotesForFloats`

Signed-off-by: Nghia Truong <[email protected]>

* Fix `removeQuotesForFloats`

Signed-off-by: Nghia Truong <[email protected]>

* Implement `castStringsToIntegers`

Signed-off-by: Nghia Truong <[email protected]>

* Implement non-legacy `castStringsToDates`

Signed-off-by: Nghia Truong <[email protected]>

* WIP for `cast_strings_to_dates_legacy`

Signed-off-by: Nghia Truong <[email protected]>

* Revert "WIP for `cast_strings_to_dates_legacy`"

This reverts commit dcb463e.

* Fix compile issues

Signed-off-by: Nghia Truong <[email protected]>

* WIP: Implement `from_json_to_structs`

Signed-off-by: Nghia Truong <[email protected]>

* Fix cmake

Signed-off-by: Nghia Truong <[email protected]>

* Fix compile issues

Signed-off-by: Nghia Truong <[email protected]>

* Implement `castStringsToFloats`

Signed-off-by: Nghia Truong <[email protected]>

* WIP

Signed-off-by: Nghia Truong <[email protected]>

* WIP: Implementing `fromJSONToStructs`

Signed-off-by: Nghia Truong <[email protected]>

* Fix compile errors

Signed-off-by: Nghia Truong <[email protected]>

* Cleanup

Signed-off-by: Nghia Truong <[email protected]>

* Revert code as we still need them

* Add error check

Signed-off-by: Nghia Truong <[email protected]>

* Add more comments

Signed-off-by: Nghia Truong <[email protected]>

* Cleanup

Signed-off-by: Nghia Truong <[email protected]>

* Return as-is if the column is date/time

Signed-off-by: Nghia Truong <[email protected]>

* Update test

Signed-off-by: Nghia Truong <[email protected]>

* Update cudf

Signed-off-by: Nghia Truong <[email protected]>

* Revert "Update cudf"

This reverts commit 5d07db1.

* Update cudf

* Update cudf

* Change header

* Rewrite JSONUtils.cpp

* Implement a common function for converting column

Signed-off-by: Nghia Truong <[email protected]>

* Rewrite `convert_data_type`

Signed-off-by: Nghia Truong <[email protected]>

* Remove `cast_strings_to_dates`

Signed-off-by: Nghia Truong <[email protected]>

* Implement `convert_data_type`

Signed-off-by: Nghia Truong <[email protected]>

* Fix compile errors

Signed-off-by: Nghia Truong <[email protected]>

* Add `CUDF_FUNC_RANGE();`

Signed-off-by: Nghia Truong <[email protected]>

* Fix schema

Signed-off-by: Nghia Truong <[email protected]>

* Complete `from_json_to_structs`

Signed-off-by: Nghia Truong <[email protected]>

* Fix null mask

Signed-off-by: Nghia Truong <[email protected]>

* Write Javadoc

Signed-off-by: Nghia Truong <[email protected]>

* Rewrite JNI

Signed-off-by: Nghia Truong <[email protected]>

* Remove deprecated function

Signed-off-by: Nghia Truong <[email protected]>

* Revert test

Signed-off-by: Nghia Truong <[email protected]>

* Remove header

Signed-off-by: Nghia Truong <[email protected]>

* Rewrite Javadoc

Signed-off-by: Nghia Truong <[email protected]>

* Rename variable

Signed-off-by: Nghia Truong <[email protected]>

* Rewrite docs

Signed-off-by: Nghia Truong <[email protected]>

* Revert test

Signed-off-by: Nghia Truong <[email protected]>

* Cleanup headers

Signed-off-by: Nghia Truong <[email protected]>

* Cleanup

Signed-off-by: Nghia Truong <[email protected]>

* Rewrite the conversion functions

Signed-off-by: Nghia Truong <[email protected]>

* Move code

Signed-off-by: Nghia Truong <[email protected]>

* Remove call to `make_structs_column`

Signed-off-by: Nghia Truong <[email protected]>

* Cleanup

Signed-off-by: Nghia Truong <[email protected]>

* Optimize conversion further, avoiding to materialize column if not needed

Signed-off-by: Nghia Truong <[email protected]>

* Rewrite docs and change function name

Signed-off-by: Nghia Truong <[email protected]>

* Reorganize code

Signed-off-by: Nghia Truong <[email protected]>

* Handle schema mismatching

Signed-off-by: Nghia Truong <[email protected]>

* Add test

Signed-off-by: Nghia Truong <[email protected]>

* Add another test

Signed-off-by: Nghia Truong <[email protected]>

* Revert "Add another test"

This reverts commit 8a17651.

* Fix schema mismatch

Signed-off-by: Nghia Truong <[email protected]>

* Cleanup

Signed-off-by: Nghia Truong <[email protected]>

* Add another test

Signed-off-by: Nghia Truong <[email protected]>

* Revert "Add another test"

This reverts commit cf9d6bf.

* Revert "Add test"

This reverts commit 553d7d0.

Signed-off-by: Nghia Truong <[email protected]>

* Add prefix `spark_rapids_jni::`

Signed-off-by: Nghia Truong <[email protected]>

* Remove handling for schema mismatching

Signed-off-by: Nghia Truong <[email protected]>

* Avoid materializing a column when converting strings

Signed-off-by: Nghia Truong <[email protected]>

* Revert "Remove handling for schema mismatching"

This reverts commit d2b6fb5.

* Fix handling for schema mismatching in case of `column_view` input

Signed-off-by: Nghia Truong <[email protected]>

---------

Signed-off-by: Nghia Truong <[email protected]>
  • Loading branch information
ttnghia authored Nov 23, 2024
1 parent 0708bce commit 4080f49
Show file tree
Hide file tree
Showing 6 changed files with 1,309 additions and 134 deletions.
3 changes: 2 additions & 1 deletion src/main/cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -207,13 +207,14 @@ add_library(
src/bloom_filter.cu
src/case_when.cu
src/cast_decimal_to_string.cu
src/format_float.cu
src/cast_float_to_string.cu
src/cast_string.cu
src/cast_string_to_float.cu
src/datetime_rebase.cu
src/decimal_utils.cu
src/format_float.cu
src/from_json_to_raw_map.cu
src/from_json_to_structs.cu
src/get_json_object.cu
src/histogram.cu
src/json_utils.cu
Expand Down
126 changes: 97 additions & 29 deletions src/main/cpp/src/JSONUtilsJni.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -166,50 +166,118 @@ JNIEXPORT jlong JNICALL Java_com_nvidia_spark_rapids_jni_JSONUtils_extractRawMap
CATCH_STD(env, 0);
}

JNIEXPORT jlongArray JNICALL Java_com_nvidia_spark_rapids_jni_JSONUtils_concatenateJsonStrings(
JNIEnv* env, jclass, jlong j_input)
JNIEXPORT jlong JNICALL
Java_com_nvidia_spark_rapids_jni_JSONUtils_fromJSONToStructs(JNIEnv* env,
jclass,
jlong j_input,
jobjectArray j_col_names,
jintArray j_num_children,
jintArray j_types,
jintArray j_scales,
jintArray j_precisions,
jboolean normalize_single_quotes,
jboolean allow_leading_zeros,
jboolean allow_nonnumeric_numbers,
jboolean allow_unquoted_control,
jboolean is_us_locale)
{
JNI_NULL_CHECK(env, j_input, "j_input is null", 0);
JNI_NULL_CHECK(env, j_col_names, "j_col_names is null", 0);
JNI_NULL_CHECK(env, j_num_children, "j_num_children is null", 0);
JNI_NULL_CHECK(env, j_types, "j_types is null", 0);
JNI_NULL_CHECK(env, j_scales, "j_scales is null", 0);
JNI_NULL_CHECK(env, j_precisions, "j_precisions is null", 0);

try {
cudf::jni::auto_set_device(env);
auto const input_cv = reinterpret_cast<cudf::column_view const*>(j_input);

// Currently, set `nullify_invalid_rows = false` as `concatenateJsonStrings` is used only for
// `from_json` with struct schema.
auto [joined_strings, delimiter, should_be_nullify] = spark_rapids_jni::concat_json(
cudf::strings_column_view{*input_cv}, /*nullify_invalid_rows*/ false);

// The output array contains 5 elements:
// [0]: address of the cudf::column object `is_valid` in host memory
// [1]: address of data buffer of the concatenated strings in device memory
// [2]: data length
// [3]: address of the rmm::device_buffer object (of the concatenated strings) in host memory
// [4]: delimiter char
auto out_handles = cudf::jni::native_jlongArray(env, 5);
out_handles[0] = reinterpret_cast<jlong>(should_be_nullify.release());
out_handles[1] = reinterpret_cast<jlong>(joined_strings->data());
out_handles[2] = static_cast<jlong>(joined_strings->size());
out_handles[3] = reinterpret_cast<jlong>(joined_strings.release());
out_handles[4] = static_cast<jlong>(delimiter);
return out_handles.get_jArray();
auto const input_cv = reinterpret_cast<cudf::column_view const*>(j_input);
auto const col_names = cudf::jni::native_jstringArray(env, j_col_names).as_cpp_vector();
auto const num_children = cudf::jni::native_jintArray(env, j_num_children).to_vector();
auto const types = cudf::jni::native_jintArray(env, j_types).to_vector();
auto const scales = cudf::jni::native_jintArray(env, j_scales).to_vector();
auto const precisions = cudf::jni::native_jintArray(env, j_precisions).to_vector();

CUDF_EXPECTS(col_names.size() > 0, "Invalid schema data: col_names.");
CUDF_EXPECTS(col_names.size() == num_children.size(), "Invalid schema data: num_children.");
CUDF_EXPECTS(col_names.size() == types.size(), "Invalid schema data: types.");
CUDF_EXPECTS(col_names.size() == scales.size(), "Invalid schema data: scales.");
CUDF_EXPECTS(col_names.size() == precisions.size(), "Invalid schema data: precisions.");

return cudf::jni::ptr_as_jlong(
spark_rapids_jni::from_json_to_structs(cudf::strings_column_view{*input_cv},
col_names,
num_children,
types,
scales,
precisions,
normalize_single_quotes,
allow_leading_zeros,
allow_nonnumeric_numbers,
allow_unquoted_control,
is_us_locale)
.release());
}
CATCH_STD(env, 0);
}

JNIEXPORT jlong JNICALL Java_com_nvidia_spark_rapids_jni_JSONUtils_makeStructs(
JNIEnv* env, jclass, jlongArray j_children, jlong j_is_null)
JNIEXPORT jlong JNICALL
Java_com_nvidia_spark_rapids_jni_JSONUtils_convertFromStrings(JNIEnv* env,
jclass,
jlong j_input,
jintArray j_num_children,
jintArray j_types,
jintArray j_scales,
jintArray j_precisions,
jboolean allow_nonnumeric_numbers,
jboolean is_us_locale)
{
JNI_NULL_CHECK(env, j_children, "j_children is null", 0);
JNI_NULL_CHECK(env, j_is_null, "j_is_null is null", 0);
JNI_NULL_CHECK(env, j_input, "j_input is null", 0);
JNI_NULL_CHECK(env, j_num_children, "j_num_children is null", 0);
JNI_NULL_CHECK(env, j_types, "j_types is null", 0);
JNI_NULL_CHECK(env, j_scales, "j_scales is null", 0);
JNI_NULL_CHECK(env, j_precisions, "j_precisions is null", 0);

try {
cudf::jni::auto_set_device(env);
auto const children =
cudf::jni::native_jpointerArray<cudf::column_view>{env, j_children}.get_dereferenced();
auto const is_null = *reinterpret_cast<cudf::column_view const*>(j_is_null);
return cudf::jni::ptr_as_jlong(spark_rapids_jni::make_structs(children, is_null).release());

auto const input_cv = reinterpret_cast<cudf::column_view const*>(j_input);
auto const num_children = cudf::jni::native_jintArray(env, j_num_children).to_vector();
auto const types = cudf::jni::native_jintArray(env, j_types).to_vector();
auto const scales = cudf::jni::native_jintArray(env, j_scales).to_vector();
auto const precisions = cudf::jni::native_jintArray(env, j_precisions).to_vector();

CUDF_EXPECTS(num_children.size() > 0, "Invalid schema data: num_children.");
CUDF_EXPECTS(num_children.size() == types.size(), "Invalid schema data: types.");
CUDF_EXPECTS(num_children.size() == scales.size(), "Invalid schema data: scales.");
CUDF_EXPECTS(num_children.size() == precisions.size(), "Invalid schema data: precisions.");

return cudf::jni::ptr_as_jlong(
spark_rapids_jni::convert_from_strings(cudf::strings_column_view{*input_cv},
num_children,
types,
scales,
precisions,
allow_nonnumeric_numbers,
is_us_locale)
.release());
}
CATCH_STD(env, 0);
}

JNIEXPORT jlong JNICALL Java_com_nvidia_spark_rapids_jni_JSONUtils_removeQuotes(
JNIEnv* env, jclass, jlong j_input, jboolean nullify_if_not_quoted)
{
JNI_NULL_CHECK(env, j_input, "j_input is null", 0);

try {
cudf::jni::auto_set_device(env);
auto const input_cv = reinterpret_cast<cudf::column_view const*>(j_input);
return cudf::jni::ptr_as_jlong(
spark_rapids_jni::remove_quotes(cudf::strings_column_view{*input_cv}, nullify_if_not_quoted)
.release());
}
CATCH_STD(env, 0);
}

} // extern "C"
Loading

0 comments on commit 4080f49

Please sign in to comment.