-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the ORC decoding bug for the timestamp data #17570
base: branch-25.02
Are you sure you want to change the base?
Fix the ORC decoding bug for the timestamp data #17570
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
43fdca1
to
88cfa09
Compare
/ok to test |
88cfa09
to
6a8280f
Compare
/ok to test |
6a8280f
to
e9f4328
Compare
/ok to test |
2 similar comments
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some old comment, not sure if applicable still
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple small suggestions.
ab4221d
to
0312873
Compare
Should we run a benchmark on this patch to see how much performance impact it causes? |
Are there Spark-RAPIDS benchmarks that we can (also) run to check the impact? |
… selected snappy data with the correct uncompressed version
0312873
to
22b6e70
Compare
Performance impact
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple of small comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments. Overall looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be on vacation for the next week and I don't want to block this PR, so I'm just leaving comments without requesting blocking changes. Feel free to ping me if you have thoughts though!
cpp/src/io/orc/stripe_data.cu
Outdated
@@ -640,9 +783,14 @@ static __device__ uint32_t Integer_RLEv2(orc_bytestream_s* bs, | |||
T* vals, | |||
uint32_t maxvals, | |||
int t, | |||
bool has_buffered_values = false) | |||
bool has_buffered_values = false, | |||
run_cache_manager* run_cache_manager_inst = nullptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a cuda::std::optional
instead?
Aside: if we had C++23 the optional usage in this file would be a great use case for std::optional::and_then
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A naive question from someone with very limited knowledge of our I/O: I only see one usage of this function outside of the gpuDecodeOrcColumnData
kernel, and that's inside gpuDecodeNullsAndStringDictionaries
. Are guaranteed to not need this cache in that case (again, asking naively since I don't really know what that function does).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't studied the function gpuDecodeNullsAndStringDictionaries
yet, and I'm on the conservative side applying the cache to existing code.
I'm not quite sure about how to use cuda::std::optional
and what's its advantage when it comes to function default arguments. Since the function will have side effect on the optional object, should I pass it by reference cuda::std::optional<X>&
? But this way I can't give it a default value cuda::std::nullopt
, and have to pass cuda::std::nullopt
to all function calls that don't need X
. Is my understanding correct?
Description
This PR introduces a band-aid class
run_cache_manager
to handle an exceptional case in TIMESTAMP data type, where the DATA stream (seconds) is processed ahead of SECONDARY stream (nanoseconds) and the excess rows are lost. The fix usesrun_cache_manager
to cache the potentially missed data from the DATA stream and let them be used in the next decoding iteration, thus preventing data loss.Closes #17155
Checklist