-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pydrake: Memory Leak While Generating Training Images (leaking Diagrams w/ MultibodyPlant, SceneGraph, etc) #14387
Comments
I am working with the OP to produce a better repro. I will reassign when we have it. |
I was able to create the simplest case of the issue that I was running into in Colab. I found that changing the drake version from 'latest' to '20201119' was the last nightly build that successfully does not run out of RAM ('20201120' does). Additionally, it seems that there is a deprecation warning that I receive in the same versions as the error:
|
This is likely related to PR #14375 or #14376 which merged on 11/20. |
For clarity, you're saying the deprecation warning is correlated with the leak? Assuming that's the case, I have two thoughts:
|
Most likely... Since we have that reference cycle, the diagram will be staying alive, including the My money is that my PR, #14356 (b1d0617), is the main offender here. @dhoule36 is there any chance that you have time to see if code before this commit (a) doesn't crash and (b) doesn't leak memory as you run stuff w/in a loop? Suggested commit:
So I'd use:
|
Confirmed; I'm like 99.99% sure that #14356 is the offender. Which means, we're stuck between a rock and hard place. I don't have a good idea on how to easily proceed :( (See updated permalink for more details) |
I don't think they're correlated. I followed your suggestion to not use DepthCameraProperties but ColorRenderCamera and DepthRenderCamera, and while there is no longer a warning, I am still seeing the memory increase. @EricCousineau-TRI I'm just seeing this now. Did you still want me to try out the code? It looks like you may have already done so. |
Nah, we should be good now, thank you for checking! |
Possible solutions:
(3) is probably feasible, but will be ridden with |
I’m honestly not sure what’s best. All three options seem very
undesirable. Would appreciate thoughts from the other developers.
…On Tue, Dec 1, 2020 at 4:20 PM Eric Cousineau ***@***.***> wrote:
Possible solutions:
1. Revert #14356 <#14356>
and say beware
2. Just keep the leak?
3. Get overly creative with pybind11 and see if we can somehow
transmit keep alive in the specific instance of DiagramBuilder ->
Diagram.
(3) is probably feasible, but will be ridden with std::function<>, type
erasure, and nastiness all around... or shared_ptr (#13058
<#13058>).
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#14387 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABRE2NFYGJU2ERFFPEPHY33SSVMZXANCNFSM4UGE5PNA>
.
|
For my part, I'm missing part of the backstory here. Python has reference cycle garbage collection. Cycles are not, in general, a disaster. Reference cycles might increase the delay until the memory is reclaimed, but they should not cause unbounded growth. I assume that |
Will be investigating this (in a slow-burn fashion). See: pybind/pybind11#2761 |
@ggould-tri ran into this, but in a different form. He had a segfault on his machine; on my machine, it was |
Yes, I can confirm that in my case reusing a single DrakeLcm fixes my issue, implying that something in DrakeLcm scope is leaked (at least the receive thread, based on gdb |
I also recently encountered this issue; my setup is very similar to @dhoule36's. I was collecting rollouts in scooping veggie task and I had to re-build before every rollout to import new geometries. Memory grew 300mb every time, so I had to revert #14356. If we don't have to re-build every time, that also partially solves the issue. |
Having accumulated a lot more Python knowledge since I wrote that, my current hypothesis / understanding is that python only collects cycles through Python containers (list, set, dict), so we are going to need to rework our C++ containers that participate in cycles (like Diagram and Simulator) to expose their children as python |
Not sure if it was mentioned on this issue, but just as FYI, reference cycles are an issue here because they seem to be declared in a way that the Python GC cannot detect / prune, i.e. EDIT: Example exported from Anzu |
Next tasks: (1) Make a reproducer demo program that uses (2) Survey for the latest instrumentation & diagnostic tools to help give us leverage and observability. (Rico) (3) Confirm that we see leak(s) on the reproducer demo, at what kind of rate(s) for each level. (4) Create toy bindings (simpler than the entire systems framework & etc) that share the same architecture, and confirm that the toy likewise reproduces the leak(s). (5) Iterate on solutions using the toy bindings. My first guess is that we will need to store owned children like "Simulator has-a Diagram" and "Diagram has-a LeafSystem(s)" in a In the marginal dead time, work on #20491 so that our edit-test cycle on the real problem will be faster. (Rico) In the marginal dead time, work on #20260 (or something similar, like sharding the docs) so that our edit-test cycle on the real problem will be faster. (Jeremy) |
Regarding (1) and (3), I have a simple example showing leakage using drake/tmp/test/pybind_lifecycle_test.py Lines 47 to 84 in cfd1f53
Note that the "hack" in this case (removing pybind11 patients) shows that the objects are appropriately freed. |
The goal for (1) is to make a reproducer akin to what a user would do, so |
I am a student at MIT in Russ Tedrake's Robotic Manipulation course. I have been able to successfully run one of his scripts (http://manipulation.csail.mit.edu/manipulation/clutter_maskrcnn_data.py) to generate training images within the Google Colab environment, until yesterday. Google Colab now crashes after generating about 90 images saying that it has used all of the RAM, where Colab Pro has 25GB.
With Colab, I use the following lines to install drake:
I am able to successfully generate 1,000 images with the script by installing a previous build:
setup_manipulation(manipulation_sha='fa5bcfb6367cd0cfda0e3d11e11854d68b39478a', drake_version='20201118', drake_build='nightly')
The script itself, linked above, calls a generate_image function repeatedly, where I noticed that the inactive memory grows by about 300MB with every execution. It uses a multibody plant, adding an rgbd camera for capturing the training images. It drops random objects from the ycb dataset into a bin before grabbing the image from the camera's output port.
Let me know if you need me to provide any more information!
The text was updated successfully, but these errors were encountered: