Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JNI: throw CUDA errors more specifically #10551

Merged
merged 29 commits into from
Apr 24, 2022

Conversation

sperlingxx
Copy link
Contributor

@sperlingxx sperlingxx commented Mar 31, 2022

This PR is for NVIDIA/spark-rapids#5029 and NVIDIA/spark-rapids#1870, which enables cuDF JNI to throw CUDA errors with specific error code. This PR relies on #10630, which exposes the CUDA error code and distinguishes fatal CUDA errors from the others.

With this improvement, it is supposed to be easier to track CUDA errors triggered by JVM APIs.

@sperlingxx sperlingxx added Java Affects Java cuDF API. 4 - Needs cuDF (Java) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 31, 2022
@sperlingxx sperlingxx requested a review from jlowe March 31, 2022 11:46
@sperlingxx sperlingxx requested a review from a team as a code owner March 31, 2022 11:46
@codecov
Copy link

codecov bot commented Mar 31, 2022

Codecov Report

Merging #10551 (50bfc2c) into branch-22.06 (65b1cbd) will increase coverage by 0.04%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10551      +/-   ##
================================================
+ Coverage         86.35%   86.39%   +0.04%     
================================================
  Files               142      142              
  Lines             22335    22302      -33     
================================================
- Hits              19287    19268      -19     
+ Misses             3048     3034      -14     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/column.py 89.43% <0.00%> (-0.02%) ⬇️
python/cudf/cudf/core/frame.py 93.41% <0.00%> (ø)
python/cudf/cudf/core/series.py 95.15% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 93.75% <0.00%> (+<0.01%) ⬆️
python/cudf/cudf/utils/utils.py 90.35% <0.00%> (+0.06%) ⬆️
python/cudf/cudf/core/column/string.py 89.21% <0.00%> (+0.10%) ⬆️
python/cudf/cudf/testing/_utils.py 93.98% <0.00%> (+0.13%) ⬆️
python/cudf/cudf/core/multiindex.py 92.28% <0.00%> (+0.13%) ⬆️
python/cudf/cudf/core/reshape.py 90.00% <0.00%> (+0.17%) ⬆️
python/cudf/cudf/core/column/categorical.py 89.97% <0.00%> (+0.20%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65b1cbd...50bfc2c. Read the comment docs.

if (jt != NULL) { \
env->Throw(jt); \
} \
return ret_val; \
} \
}

#define JNI_CUDA_CHECK(env, cuda_status) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best practice would be to put this in a do{...} while(0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks for advice!

Comment on lines 757 to 759
// Build the error message in the format of cudf::cuda_error, so that cudf::jni::CUDA_ERROR_CLASS
// can parse both of them.
std::string n_msg = "CUDA error encountered at: " + std::string{file} + ":" +
Copy link
Contributor

@jrhemstad jrhemstad Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't count on the contents of that exception message being stable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see cudf::cuda_error updated to allow the extraction of the CUDA error ID rather than relying on parsing. In general we should be moving away from string-scraping for error identification, not adding more instances of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #10553

Comment on lines 175 to 186
private static final Set<CudaError> stickyErrors = new HashSet<CudaError>(){{
add(CudaError.cudaErrorIllegalAddress);
add(CudaError.cudaErrorLaunchTimeout);
add(CudaError.cudaErrorHardwareStackError);
add(CudaError.cudaErrorIllegalInstruction);
add(CudaError.cudaErrorMisalignedAddress);
add(CudaError.cudaErrorInvalidAddressSpace);
add(CudaError.cudaErrorInvalidPc);
add(CudaError.cudaErrorLaunchFailure);
add(CudaError.cudaErrorExternalDevice);
add(CudaError.cudaErrorUnknown);
}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't very robust. I described a more robust way to detect sticky errors here: #10200 (comment)

Soon I hope to have libcudf throw a separate exception type for sticky errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, it would be good to align this with what's going on in that other issue.

Copy link
Contributor Author

@sperlingxx sperlingxx Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with @jrhemstad as well. So, shall we pend the JNI-side work for the time being until the libcudf is enhanced in terms of CUDA error handling?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RAPIDS Accelerator is already addressing this for the short-term at NVIDIA/spark-rapids#5118. Therefore I'd rather we take the time here to to leverage a proper interface in libcudf rather than rush this in and then need to change it when libcudf refines its exception handling soon afterwards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jlowe, I reworked the PR. For now, it pushes down the sticky error detection to libcudf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I am stuck on how to trigger a fatal CUDA error through the unit test.

@sperlingxx sperlingxx changed the title JNI: throw CUDA errors more specificially JNI: throw CUDA errors more specifically Apr 13, 2022
java/src/main/java/ai/rapids/cudf/CudaException.java Outdated Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
java/src/main/native/src/CudaJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/CudaJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is now, this needs to be marked as a breaking change, since we're either changing the names or removing interfaces in `java/src/main/native/include/jni_utils.hpp". That header is used in other projects, like cuspatial. It would be nice if we didn't change too much here and break things unnecessarily.

java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
java/src/main/native/src/RmmJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
java/src/main/native/src/RmmJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
@sperlingxx
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ae7e979 into rapidsai:branch-22.06 Apr 24, 2022
@sperlingxx sperlingxx deleted the break_down_catch_std branch April 24, 2022 01:53
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Java) Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants