Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add concat_json #2364

Closed

Conversation

karthikeyann
Copy link
Contributor

Concatenates json from string column to single string data buffer, and return a delimiter that's not present in the string.

  • unit tests

Comment on lines +741 to +746
input_scv.chars_begin(stream),
d_histogram.begin(),
num_levels,
lower_level,
upper_level,
input_scv.chars_size(stream),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_scv.chars_begin(stream) and input_scv.chars_size(stream) are called here and below, each is called twice. So there are 4 stream syncs. We can do better by manually calling the underlying memcpy to ultimately have just 1 stream sync.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was wrong. chars_begin doesn't have stream sync. So we only need to cache chars_size out of the function parameters.

Comment on lines 759 to 760
auto first_non_zero_pos =
thrust::find(rmm::exec_policy(stream), zero_level + '\n', d_histogram.end(), 0);
Copy link
Collaborator

@ttnghia ttnghia Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is finding number 0 in the range [zero_level + 10, end()) ([d_histogram.begin() - 127 + 10, end())). Why is it called first_non_zero_pos? And why do we have to offset with zero_level+ \n?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is finding a character that doesn't have any histogram count so maybe it should be called first_zero_count_pos?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am offsetting to \n so that if \n is not present, it's convenient to use \n.

@GregoryKimball
Copy link
Collaborator

@revans2 would you please share your review?

@revans2
Copy link
Collaborator

revans2 commented Sep 25, 2024

@GregoryKimball I know that @ttnghia is working on a version of this with fixes for a number of issues. I don't really want to muddy the waters too much right now and would rather look at his.

But a few things to point out.

I was not able to get the histogram to actually work in selecting anything but '\n' I don't know why but all of the counts returned were 0 when I started to debug it.

The selection process for the the delimiter does not include a disallow list, https://github.com/rapidsai/cudf/blob/ef270827cc3e4f336258d1e1ad4b7f633656409b/cpp/include/cudf/io/json.hpp#L411-L428 so even if it did work to select a delimiter that is could fail if it ended up selecting one that is not allowed to be there.

The code does not include a way to know if the input record was an empty string, or a string with just spaces in it. Spark treats an empty string as special for from_json and will return null at the top level of the returned struct for them instead of returning a struct with null values in it.

Empty strings are also filtered out. The concat code will replace nulls with a placeholder, but it will not replace empty strings when it concats them. Thh JSON tokenization/parsing code filters out empty lines (this is expected behavior for reading JSON from a file, but it is not acceptable for from_json. This can result in us getting the wrong number of rows back.

@ttnghia
Copy link
Collaborator

ttnghia commented Oct 1, 2024

Close this as it is replaced by #2457.

@ttnghia ttnghia closed this Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants