add concat_json #2364

karthikeyann · 2024-08-30T04:37:59Z

Concatenates json from string column to single string data buffer, and return a delimiter that's not present in the string.

unit tests

ttnghia · 2024-09-03T17:39:26Z

src/main/cpp/src/from_json_to_raw_map.cu

+                                      input_scv.chars_begin(stream),
+                                      d_histogram.begin(),
+                                      num_levels,
+                                      lower_level,
+                                      upper_level,
+                                      input_scv.chars_size(stream),


input_scv.chars_begin(stream) and input_scv.chars_size(stream) are called here and below, each is called twice. So there are 4 stream syncs. We can do better by manually calling the underlying memcpy to ultimately have just 1 stream sync.

Oh I was wrong. chars_begin doesn't have stream sync. So we only need to cache chars_size out of the function parameters.

src/main/cpp/src/from_json_to_raw_map.cu

ttnghia · 2024-09-03T17:54:13Z

src/main/cpp/src/from_json_to_raw_map.cu

+  auto first_non_zero_pos =
+    thrust::find(rmm::exec_policy(stream), zero_level + '\n', d_histogram.end(), 0);


I see that this is finding number 0 in the range [zero_level + 10, end()) ([d_histogram.begin() - 127 + 10, end())). Why is it called first_non_zero_pos? And why do we have to offset with zero_level+ \n?

Here is finding a character that doesn't have any histogram count so maybe it should be called first_zero_count_pos?

I am offsetting to \n so that if \n is not present, it's convenient to use \n.

GregoryKimball · 2024-09-25T17:21:33Z

@revans2 would you please share your review?

revans2 · 2024-09-25T18:34:29Z

@GregoryKimball I know that @ttnghia is working on a version of this with fixes for a number of issues. I don't really want to muddy the waters too much right now and would rather look at his.

But a few things to point out.

I was not able to get the histogram to actually work in selecting anything but '\n' I don't know why but all of the counts returned were 0 when I started to debug it.

The selection process for the the delimiter does not include a disallow list, https://github.com/rapidsai/cudf/blob/ef270827cc3e4f336258d1e1ad4b7f633656409b/cpp/include/cudf/io/json.hpp#L411-L428 so even if it did work to select a delimiter that is could fail if it ended up selecting one that is not allowed to be there.

The code does not include a way to know if the input record was an empty string, or a string with just spaces in it. Spark treats an empty string as special for from_json and will return null at the top level of the returned struct for them instead of returning a struct with null values in it.

Empty strings are also filtered out. The concat code will replace nulls with a placeholder, but it will not replace empty strings when it concats them. Thh JSON tokenization/parsing code filters out empty lines (this is expected behavior for reading JSON from a file, but it is not acceptable for from_json. This can result in us getting the wrong number of rows back.

ttnghia · 2024-10-01T05:52:06Z

Close this as it is replaced by #2457.

add concat_jsons

cd235e2

ttnghia reviewed Sep 3, 2024

View reviewed changes

src/main/cpp/src/from_json_to_raw_map.cu Outdated Show resolved Hide resolved

ttnghia reviewed Sep 3, 2024

View reviewed changes

address review comments

a73e8dc

karthikeyann requested a review from ttnghia September 6, 2024 17:47

karthikeyann mentioned this pull request Sep 6, 2024

[FEA] Find a way to support String column input/fixup for JSON parsing rapidsai/cudf#15277

Closed

karthikeyann marked this pull request as ready for review September 24, 2024 18:46

ttnghia closed this Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add concat_json #2364

add concat_json #2364

karthikeyann commented Aug 30, 2024

ttnghia Sep 3, 2024

ttnghia Sep 3, 2024

ttnghia Sep 3, 2024 •

edited

Loading

ttnghia Sep 3, 2024

karthikeyann Sep 5, 2024

GregoryKimball commented Sep 25, 2024

revans2 commented Sep 25, 2024

ttnghia commented Oct 1, 2024

		auto first_non_zero_pos =
		thrust::find(rmm::exec_policy(stream), zero_level + '\n', d_histogram.end(), 0);

add concat_json #2364

add concat_json #2364

Conversation

karthikeyann commented Aug 30, 2024

ttnghia Sep 3, 2024

Choose a reason for hiding this comment

ttnghia Sep 3, 2024

Choose a reason for hiding this comment

ttnghia Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

ttnghia Sep 3, 2024

Choose a reason for hiding this comment

karthikeyann Sep 5, 2024

Choose a reason for hiding this comment

GregoryKimball commented Sep 25, 2024

revans2 commented Sep 25, 2024

ttnghia commented Oct 1, 2024

ttnghia Sep 3, 2024 •

edited

Loading