Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the ORC decoding bug for the timestamp data #17570

Open
wants to merge 18 commits into
base: branch-25.02
Choose a base branch
from

Conversation

kingcrimsontianyu
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu commented Dec 10, 2024

Description

This PR introduces a band-aid class run_cache_manager to handle an exceptional case in TIMESTAMP data type, where the DATA stream (seconds) is processed ahead of SECONDARY stream (nanoseconds) and the excess rows are lost. The fix uses run_cache_manager to cache the potentially missed data from the DATA stream and let them be used in the next decoding iteration, thus preventing data loss.

Closes #17155

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@kingcrimsontianyu kingcrimsontianyu added bug Something isn't working non-breaking Non-breaking change labels Dec 10, 2024
@kingcrimsontianyu kingcrimsontianyu self-assigned this Dec 10, 2024
Copy link

copy-pr-bot bot commented Dec 10, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 10, 2024
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

@vuule vuule self-requested a review December 13, 2024 22:03
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

2 similar comments
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

@github-actions github-actions bot added the Python Affects Python cuDF API. label Dec 17, 2024
@kingcrimsontianyu kingcrimsontianyu marked this pull request as ready for review December 17, 2024 20:07
@kingcrimsontianyu kingcrimsontianyu requested review from a team as code owners December 17, 2024 20:07
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some old comment, not sure if applicable still

cpp/src/io/orc/stripe_data.cu Show resolved Hide resolved
cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple small suggestions.

cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_orc.py Outdated Show resolved Hide resolved
@ttnghia
Copy link
Contributor

ttnghia commented Dec 18, 2024

Should we run a benchmark on this patch to see how much performance impact it causes?

@vuule
Copy link
Contributor

vuule commented Dec 18, 2024

Should we run a benchmark on this patch to see how much performance impact it causes?

Are there Spark-RAPIDS benchmarks that we can (also) run to check the impact?

@kingcrimsontianyu
Copy link
Contributor Author

kingcrimsontianyu commented Dec 19, 2024

Performance impact

  • Running the ORC benchmark

    • Command

      ORC_READER_NVBENCH -d 0 -b orc_read_io_compression \
      -a compression=SNAPPY -a io=HOST_BUFFER -a cardinality=0 -a run_length=1 --min-samples 40
      
    • Result

      time current branch 25.02 this PR
      gpuDecodeOrcColumnData kernel average/median 9.039 ms/5.800 ms 8.950 ms/5.814 ms
      CPU time 73.268 ms 72.442 ms

      For the average kernel time, Welch's t-test p value = 0.83, indicating no statiatically significant difference in time.

  • Reading the reported "buggy" data

    • Use the ORC IO example program

    • gpuDecodeOrcColumnData kernel is called only once, measurement subject to large uncertainty

    • Result

      time current branch 25.02 this PR
      gpuDecodeOrcColumnData kernel single call 210.888 μs 216.183 μs

cc @vuule @ttnghia

@vuule vuule self-requested a review December 19, 2024 19:53
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple of small comments

cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_data.cu Show resolved Hide resolved
cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Overall looks good.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be on vacation for the next week and I don't want to block this PR, so I'm just leaving comments without requesting blocking changes. Feel free to ping me if you have thoughts though!

@@ -640,9 +783,14 @@ static __device__ uint32_t Integer_RLEv2(orc_bytestream_s* bs,
T* vals,
uint32_t maxvals,
int t,
bool has_buffered_values = false)
bool has_buffered_values = false,
run_cache_manager* run_cache_manager_inst = nullptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a cuda::std::optional instead?

Aside: if we had C++23 the optional usage in this file would be a great use case for std::optional::and_then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A naive question from someone with very limited knowledge of our I/O: I only see one usage of this function outside of the gpuDecodeOrcColumnData kernel, and that's inside gpuDecodeNullsAndStringDictionaries. Are guaranteed to not need this cache in that case (again, asking naively since I don't really know what that function does).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't studied the function gpuDecodeNullsAndStringDictionaries yet, and I'm on the conservative side applying the cache to existing code.

I'm not quite sure about how to use cuda::std::optional and what's its advantage when it comes to function default arguments. Since the function will have side effect on the optional object, should I pass it by reference cuda::std::optional<X>&? But this way I can't give it a default value cuda::std::nullopt, and have to pass cuda::std::nullopt to all function calls that don't need X. Is my understanding correct?

cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Status: No status
Development

Successfully merging this pull request may close these issues.

[BUG] Misaligned timestamps produced by ORC reader
6 participants