Fix the ORC decoding bug for the timestamp data #17570

kingcrimsontianyu · 2024-12-10T20:08:59Z

Description

This PR introduces a band-aid class run_cache_manager to handle an exceptional case in TIMESTAMP data type, where the DATA stream (seconds) is processed ahead of SECONDARY stream (nanoseconds) and the excess rows are lost. The fix uses run_cache_manager to cache the potentially missed data from the DATA stream and let them be used in the next decoding iteration, thus preventing data loss.

Closes #17155

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-12-10T20:09:03Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

kingcrimsontianyu · 2024-12-11T04:18:44Z

/ok to test

kingcrimsontianyu · 2024-12-11T18:18:09Z

/ok to test

kingcrimsontianyu · 2024-12-11T18:28:52Z

/ok to test

kingcrimsontianyu · 2024-12-16T20:41:17Z

/ok to test

kingcrimsontianyu · 2024-12-16T22:23:40Z

/ok to test

kingcrimsontianyu · 2024-12-16T22:38:28Z

/ok to test

vuule

some old comment, not sure if applicable still

cpp/src/io/orc/stripe_data.cu

Matt711

Just a couple small suggestions.

cpp/src/io/orc/stripe_data.cu

python/cudf/cudf/tests/test_orc.py

cpp/src/io/orc/stripe_data.cu

ttnghia · 2024-12-18T18:31:46Z

Should we run a benchmark on this patch to see how much performance impact it causes?

vuule · 2024-12-18T19:22:38Z

Should we run a benchmark on this patch to see how much performance impact it causes?

Are there Spark-RAPIDS benchmarks that we can (also) run to check the impact?

… selected snappy data with the correct uncompressed version

kingcrimsontianyu · 2024-12-19T14:49:13Z

Performance impact

Running the ORC benchmark
- Command
```
ORC_READER_NVBENCH -d 0 -b orc_read_io_compression \
-a compression=SNAPPY -a io=HOST_BUFFER -a cardinality=0 -a run_length=1 --min-samples 40
```
- Result
  
  time current branch 25.02 this PR
  
  gpuDecodeOrcColumnData kernel average/median 9.039 ms/5.800 ms 8.950 ms/5.814 ms
  
  CPU time 73.268 ms 72.442 ms
  
  For the average kernel time, Welch's t-test p value = 0.83, indicating no statiatically significant difference in time.
Reading the reported "buggy" data
- Use the ORC IO example program
- gpuDecodeOrcColumnData kernel is called only once, measurement subject to large uncertainty
- Result
  
  time current branch 25.02 this PR
  
  gpuDecodeOrcColumnData kernel single call 210.888 μs 216.183 μs

cc @vuule @ttnghia

vuule

Looks good, just a couple of small comments

cpp/src/io/orc/stripe_data.cu

mhaseeb123

Some comments. Overall looks good.

vyasr

I'll be on vacation for the next week and I don't want to block this PR, so I'm just leaving comments without requesting blocking changes. Feel free to ping me if you have thoughts though!

vyasr · 2024-12-20T17:47:10Z

cpp/src/io/orc/stripe_data.cu

@@ -640,9 +783,14 @@ static __device__ uint32_t Integer_RLEv2(orc_bytestream_s* bs,
                                         T* vals,
                                         uint32_t maxvals,
                                         int t,
-                                         bool has_buffered_values = false)
+                                         bool has_buffered_values                  = false,
+                                         run_cache_manager* run_cache_manager_inst = nullptr,


Could we use a cuda::std::optional instead?

Aside: if we had C++23 the optional usage in this file would be a great use case for std::optional::and_then.

A naive question from someone with very limited knowledge of our I/O: I only see one usage of this function outside of the gpuDecodeOrcColumnData kernel, and that's inside gpuDecodeNullsAndStringDictionaries. Are guaranteed to not need this cache in that case (again, asking naively since I don't really know what that function does).

I haven't studied the function gpuDecodeNullsAndStringDictionaries yet, and I'm on the conservative side applying the cache to existing code.

I'm not quite sure about how to use cuda::std::optional and what's its advantage when it comes to function default arguments. Since the function will have side effect on the optional object, should I pass it by reference cuda::std::optional<X>&? But this way I can't give it a default value cuda::std::nullopt, and have to pass cuda::std::nullopt to all function calls that don't need X. Is my understanding correct?

cpp/src/io/orc/stripe_data.cu

kingcrimsontianyu added bug Something isn't working non-breaking Non-breaking change labels Dec 10, 2024

kingcrimsontianyu self-assigned this Dec 10, 2024

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 10, 2024

kingcrimsontianyu force-pushed the fix-orc-decode-bug branch from 43fdca1 to 88cfa09 Compare December 11, 2024 15:01

kingcrimsontianyu force-pushed the fix-orc-decode-bug branch from 88cfa09 to 6a8280f Compare December 11, 2024 18:28

vuule self-requested a review December 13, 2024 22:03

kingcrimsontianyu force-pushed the fix-orc-decode-bug branch from 6a8280f to e9f4328 Compare December 16, 2024 19:07

github-actions bot added the Python Affects Python cuDF API. label Dec 17, 2024

kingcrimsontianyu marked this pull request as ready for review December 17, 2024 20:07

kingcrimsontianyu requested review from a team as code owners December 17, 2024 20:07

kingcrimsontianyu requested review from vyasr, Matt711, ttnghia and mhaseeb123 December 17, 2024 20:07

vuule reviewed Dec 17, 2024

View reviewed changes

cpp/src/io/orc/stripe_data.cu Show resolved Hide resolved

cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved

cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved

Matt711 approved these changes Dec 17, 2024

View reviewed changes

cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_orc.py Outdated Show resolved Hide resolved

mhaseeb123 reviewed Dec 17, 2024

View reviewed changes

cpp/src/io/orc/stripe_data.cu Show resolved Hide resolved

mhaseeb123 reviewed Dec 17, 2024

View reviewed changes

cpp/src/io/orc/stripe_data.cu Show resolved Hide resolved

kingcrimsontianyu force-pushed the fix-orc-decode-bug branch from ab4221d to 0312873 Compare December 18, 2024 15:58

kingcrimsontianyu mentioned this pull request Dec 18, 2024

[BUG] Misaligned timestamps produced by ORC reader #17155

Open

kingcrimsontianyu added 12 commits December 18, 2024 22:06

Fix the orc decoding bug

c9e706c

Cleanup

bd93523

Add comments

9055ebf

Add comments

7ebb33c

Improve implementation

0c53f08

Improve implementation more

6df48c6

Simplify

ba00561

Improve the comment

1f4e0ea

Add python test

16a8590

Remove forceinline

2896eb6

Modify comments according to reviewer's comment

4eee986

Add more comments. Rename test orc data name. Replace the incorrectly…

22b6e70

… selected snappy data with the correct uncompressed version

kingcrimsontianyu force-pushed the fix-orc-decode-bug branch from 0312873 to 22b6e70 Compare December 19, 2024 03:07

vuule self-requested a review December 19, 2024 19:53

vuule reviewed Dec 19, 2024

View reviewed changes

Address review comments

ce352ea

mhaseeb123 reviewed Dec 20, 2024

View reviewed changes

cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved

mhaseeb123 reviewed Dec 20, 2024

View reviewed changes

cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved

mhaseeb123 reviewed Dec 20, 2024

View reviewed changes

cpp/src/io/orc/stripe_data.cu Show resolved Hide resolved

mhaseeb123 reviewed Dec 20, 2024

View reviewed changes

vyasr reviewed Dec 20, 2024

View reviewed changes

kingcrimsontianyu added 5 commits December 20, 2024 15:34

Address reviewer comment. Add more clarifying comments

2029eb2

Move sync around

5c47aaa

Add cache_helper class to simplify function call

9d2cac0

Simplify implementation at the sacrifice of encapsulation

cd0d272

Minor change

c2a2992

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the ORC decoding bug for the timestamp data #17570

Fix the ORC decoding bug for the timestamp data #17570

kingcrimsontianyu commented Dec 10, 2024 •

edited

Loading

copy-pr-bot bot commented Dec 10, 2024

kingcrimsontianyu commented Dec 11, 2024

kingcrimsontianyu commented Dec 11, 2024

kingcrimsontianyu commented Dec 11, 2024

kingcrimsontianyu commented Dec 16, 2024

kingcrimsontianyu commented Dec 16, 2024

kingcrimsontianyu commented Dec 16, 2024

vuule left a comment

Matt711 left a comment

ttnghia commented Dec 18, 2024

vuule commented Dec 18, 2024 •

edited

Loading

kingcrimsontianyu commented Dec 19, 2024 •

edited

Loading

vuule left a comment

mhaseeb123 left a comment

vyasr left a comment

vyasr Dec 20, 2024

vyasr Dec 20, 2024

kingcrimsontianyu Dec 21, 2024

Fix the ORC decoding bug for the timestamp data #17570

Are you sure you want to change the base?

Fix the ORC decoding bug for the timestamp data #17570

Conversation

kingcrimsontianyu commented Dec 10, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Dec 10, 2024

kingcrimsontianyu commented Dec 11, 2024

kingcrimsontianyu commented Dec 11, 2024

kingcrimsontianyu commented Dec 11, 2024

kingcrimsontianyu commented Dec 16, 2024

kingcrimsontianyu commented Dec 16, 2024

kingcrimsontianyu commented Dec 16, 2024

vuule left a comment

Choose a reason for hiding this comment

Matt711 left a comment

Choose a reason for hiding this comment

ttnghia commented Dec 18, 2024

vuule commented Dec 18, 2024 • edited Loading

kingcrimsontianyu commented Dec 19, 2024 • edited Loading

Performance impact

vuule left a comment

Choose a reason for hiding this comment

mhaseeb123 left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

vyasr Dec 20, 2024

Choose a reason for hiding this comment

vyasr Dec 20, 2024

Choose a reason for hiding this comment

kingcrimsontianyu Dec 21, 2024

Choose a reason for hiding this comment

kingcrimsontianyu commented Dec 10, 2024 •

edited

Loading

vuule commented Dec 18, 2024 •

edited

Loading

kingcrimsontianyu commented Dec 19, 2024 •

edited

Loading