-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add concat_json #2364
add concat_json #2364
Conversation
input_scv.chars_begin(stream), | ||
d_histogram.begin(), | ||
num_levels, | ||
lower_level, | ||
upper_level, | ||
input_scv.chars_size(stream), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input_scv.chars_begin(stream)
and input_scv.chars_size(stream)
are called here and below, each is called twice. So there are 4 stream syncs. We can do better by manually calling the underlying memcpy
to ultimately have just 1 stream sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was wrong. chars_begin
doesn't have stream sync. So we only need to cache chars_size
out of the function parameters.
auto first_non_zero_pos = | ||
thrust::find(rmm::exec_policy(stream), zero_level + '\n', d_histogram.end(), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this is finding number 0
in the range [zero_level + 10, end())
([d_histogram.begin() - 127 + 10, end())
). Why is it called first_non_zero_pos
? And why do we have to offset with zero_level+ \n
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is finding a character that doesn't have any histogram count so maybe it should be called first_zero_count_pos
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am offsetting to \n
so that if \n
is not present, it's convenient to use \n
.
@revans2 would you please share your review? |
@GregoryKimball I know that @ttnghia is working on a version of this with fixes for a number of issues. I don't really want to muddy the waters too much right now and would rather look at his. But a few things to point out. I was not able to get the histogram to actually work in selecting anything but '\n' I don't know why but all of the counts returned were 0 when I started to debug it. The selection process for the the delimiter does not include a disallow list, https://github.com/rapidsai/cudf/blob/ef270827cc3e4f336258d1e1ad4b7f633656409b/cpp/include/cudf/io/json.hpp#L411-L428 so even if it did work to select a delimiter that is could fail if it ended up selecting one that is not allowed to be there. The code does not include a way to know if the input record was an empty string, or a string with just spaces in it. Spark treats an empty string as special for from_json and will return null at the top level of the returned struct for them instead of returning a struct with null values in it. Empty strings are also filtered out. The concat code will replace nulls with a placeholder, but it will not replace empty strings when it concats them. Thh JSON tokenization/parsing code filters out empty lines (this is expected behavior for reading JSON from a file, but it is not acceptable for from_json. This can result in us getting the wrong number of rows back. |
Close this as it is replaced by #2457. |
Concatenates json from string column to single string data buffer, and return a delimiter that's not present in the string.