-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JNI: throw CUDA errors more specifically #10551
JNI: throw CUDA errors more specifically #10551
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10551 +/- ##
================================================
+ Coverage 86.35% 86.39% +0.04%
================================================
Files 142 142
Lines 22335 22302 -33
================================================
- Hits 19287 19268 -19
+ Misses 3048 3034 -14
Continue to review full report at Codecov.
|
if (jt != NULL) { \ | ||
env->Throw(jt); \ | ||
} \ | ||
return ret_val; \ | ||
} \ | ||
} | ||
|
||
#define JNI_CUDA_CHECK(env, cuda_status) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best practice would be to put this in a do{...} while(0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. Thanks for advice!
// Build the error message in the format of cudf::cuda_error, so that cudf::jni::CUDA_ERROR_CLASS | ||
// can parse both of them. | ||
std::string n_msg = "CUDA error encountered at: " + std::string{file} + ":" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't count on the contents of that exception message being stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see cudf::cuda_error
updated to allow the extraction of the CUDA error ID rather than relying on parsing. In general we should be moving away from string-scraping for error identification, not adding more instances of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #10553
private static final Set<CudaError> stickyErrors = new HashSet<CudaError>(){{ | ||
add(CudaError.cudaErrorIllegalAddress); | ||
add(CudaError.cudaErrorLaunchTimeout); | ||
add(CudaError.cudaErrorHardwareStackError); | ||
add(CudaError.cudaErrorIllegalInstruction); | ||
add(CudaError.cudaErrorMisalignedAddress); | ||
add(CudaError.cudaErrorInvalidAddressSpace); | ||
add(CudaError.cudaErrorInvalidPc); | ||
add(CudaError.cudaErrorLaunchFailure); | ||
add(CudaError.cudaErrorExternalDevice); | ||
add(CudaError.cudaErrorUnknown); | ||
}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't very robust. I described a more robust way to detect sticky errors here: #10200 (comment)
Soon I hope to have libcudf throw a separate exception type for sticky errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, it would be good to align this with what's going on in that other issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree with @jrhemstad as well. So, shall we pend the JNI-side work for the time being until the libcudf is enhanced in terms of CUDA error handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RAPIDS Accelerator is already addressing this for the short-term at NVIDIA/spark-rapids#5118. Therefore I'd rather we take the time here to to leverage a proper interface in libcudf rather than rush this in and then need to change it when libcudf refines its exception handling soon afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jlowe, I reworked the PR. For now, it pushes down the sticky error detection to libcudf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I am stuck on how to trigger a fatal CUDA error through the unit test.
Signed-off-by: sperlingxx <[email protected]>
…cudf into break_down_catch_std
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is now, this needs to be marked as a breaking change, since we're either changing the names or removing interfaces in `java/src/main/native/include/jni_utils.hpp". That header is used in other projects, like cuspatial. It would be nice if we didn't change too much here and break things unnecessarily.
Co-authored-by: Jason Lowe <[email protected]>
Co-authored-by: Jason Lowe <[email protected]>
@gpucibot merge |
This PR is for NVIDIA/spark-rapids#5029 and NVIDIA/spark-rapids#1870, which enables cuDF JNI to throw CUDA errors with specific error code. This PR relies on #10630, which exposes the CUDA error code and distinguishes fatal CUDA errors from the others.
With this improvement, it is supposed to be easier to track CUDA errors triggered by JVM APIs.