-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Java OOM when testing non-UTC time zone with lots of cases fallback. #9829
Comments
There're more contexts here. One point may be related is the way we run those test cases. When we set to non-UTC case, most cases will be fallback onto CPU. It may lead to higher memory demand for Java heap. If that's the case, it's more likely a test setting problem other than Rapids plugin's bug. |
When TEST_PARALLEL > 1, then PYSP_TEST_spark_rapids_memory_gpu_allocSize is 1536m. And the following GPU error occurs:
|
About the Gpu OOM, I checked the passed CIs for UTC time zone, it also reports OOM:
Refer to CI 8525, 8527, select Blue Ocean, select Premerge CI 2. @revans2 Is this a issue? |
From the log, Parquet is trying to split its batch to a "< 10MB" batch. It's reasonable for Is there any err msg about memory leak like below?
|
Per offline discussion with @res-life @GaryShen2008, a few further information:
A few follow-ups as below needed to further narrow down the issue:
cc @pxLi |
Tested after this commit, then the OOM will disappear. |
There are a number of different issues listed here and I think we need to split them up into separate issues so we can address them properly. The original issue for a java heap OOM when running the tests. Well the simplest way to debug this is to get a heap dump on OOM As for the GPU out of memory error. #9829 (comment). That one is more concerning and we should try to understand what is happening here. The "split" on a parquet read is hack that does not do much. But it indicates that we really ran out of GPU memory and need to do some debugging to see what happened. It could be a memory leak or it could be something else that is problematic. That should be filed as a separate issue and I would suggest that we first look for leaks #9829 (comment) but with |
It sounds like ExecutionPlanCaptureCallback has a memory leak in it. |
okay we should keep this open, but I am not as concerned about this as I was initially. We can drop the priority on it. |
@wjxiz1992 I'm not sure how ExecutionPlanCaptureCallback is tied into the problem if we are just running test_explode_array_data. That test does not perform plan capturing, nor does it call the methods in asserts.py that leverage ExecutionPlanCaptureCallback (it's just calling assert_gpu_and_cpu_are_equal_collect, not one of the assert forms with plan capture). Therefore ExecutionPlanCaptureCallback's methods should not be getting called with the exception of Indeed I ran test_explode_array_data with a debugger attached to the JVM and breakpoints set on every ExecutionPlanCaptureCallback method, but only |
I am still investigating on it with no further conclusion yet. But It's leaked in |
Yes, I totally agree that the OOM is due to leaked plans in ExecutionPlanCaptureCallback. What's likely is that there's a test that needs a plan capture, it calls That all makes sense, but what I don't get is how to reproduce it solely with test_explode_array_data. That test doesn't trigger capturing in any way that I can see (and debugger breakpoints prove it, at least for my environment). In order for EsecutionPlanCaptureCallback to start leaking plans, it needs to start capturing first, and I don't see how test_explode_array_data will do that on its own. |
Yes, your guess is correct. The
|
Sorry, I missed some important things. I also did test, that I only run this I don't think it's related to |
Describe the bug
Premerge reports OOM when running CI.
Refer to: #9773
Errors are:
Steps/Code to reproduce bug
Based on this PR #9773.
Additional context
Locally I set TEST_PARALLEL to 1 on Spark 341, it also fails.
Changes in #9652
Changes in #9773:
The text was updated successfully, but these errors were encountered: