pydrake: Memory Leak While Generating Training Images (leaking Diagrams w/ MultibodyPlant, SceneGraph, etc) #14387

dhoule36 · 2020-11-29T01:09:07Z

I am a student at MIT in Russ Tedrake's Robotic Manipulation course. I have been able to successfully run one of his scripts (http://manipulation.csail.mit.edu/manipulation/clutter_maskrcnn_data.py) to generate training images within the Google Colab environment, until yesterday. Google Colab now crashes after generating about 90 images saying that it has used all of the RAM, where Colab Pro has 25GB.

With Colab, I use the following lines to install drake:

# Install drake.
if 'google.colab' in sys.modules and importlib.util.find_spec('manipulation') is None:
    urlretrieve(f"http://manipulation.csail.mit.edu/scripts/setup/setup_manipulation_colab.py",
                "setup_manipulation_colab.py")
    from setup_manipulation_colab import setup_manipulation
    setup_manipulation(manipulation_sha='master', drake_version='latest', drake_build='continuous')

I am able to successfully generate 1,000 images with the script by installing a previous build:
setup_manipulation(manipulation_sha='fa5bcfb6367cd0cfda0e3d11e11854d68b39478a', drake_version='20201118', drake_build='nightly')

The script itself, linked above, calls a generate_image function repeatedly, where I noticed that the inactive memory grows by about 300MB with every execution. It uses a multibody plant, adding an rgbd camera for capturing the training images. It drops random objects from the ycb dataset into a bin before grabbing the image from the camera's output port.

Let me know if you need me to provide any more information!

The text was updated successfully, but these errors were encountered:

RussTedrake · 2020-11-30T15:03:47Z

I am working with the OP to produce a better repro. I will reassign when we have it.

dhoule36 · 2020-11-30T18:02:11Z

I was able to create the simplest case of the issue that I was running into in Colab. I found that changing the drake version from 'latest' to '20201119' was the last nightly build that successfully does not run out of RAM ('20201120' does). Additionally, it seems that there is a deprecation warning that I receive in the same versions as the error:

Deprecated:
CameraProperties are being deprecated. Please use the RenderCamera
variant. This will be removed from Drake on or after 2021-03-01.

import importlib
import sys
from urllib.request import urlretrieve


# Install drake.
if 'google.colab' in sys.modules and importlib.util.find_spec('pydrake') is None:
  version='latest'
  build='nightly'
  urlretrieve(f"https://drake-packages.csail.mit.edu/drake/{build}/drake-{version}/setup_drake_colab.py", "setup_drake_colab.py")
  from setup_drake_colab import setup_drake
  setup_drake(version=version, build=build)

import psutil

import os
import numpy as np
from tqdm import tqdm

import pydrake.all
from pydrake.all import RigidTransform, RollPitchYaw

ycb = ['003_cracker_box.sdf', '004_sugar_box.sdf', '005_tomato_soup_can.sdf', '006_mustard_bottle.sdf', '009_gelatin_box.sdf', '010_potted_meat_can.sdf']
path = '/tmp/clutter_maskrcnn_data'
num_images = 10000

rng = np.random.default_rng()  # this is for python
generator = pydrake.common.RandomGenerator(rng.integers(1000))  # for c++

def generate_image(image_num):
    builder = pydrake.systems.framework.DiagramBuilder()
    plant, scene_graph = pydrake.multibody.plant.AddMultibodyPlantSceneGraph(builder, time_step=0.0005)
    parser = pydrake.multibody.parsing.Parser(plant)
    parser.AddModelFromFile(pydrake.common.FindResourceOrThrow(
        "drake/examples/manipulation_station/models/bin.sdf"))
    plant.WeldFrames(plant.world_frame(), plant.GetFrameByName("bin_base"))
    inspector = scene_graph.model_inspector()

    instance_id_to_class_name = dict()

    for object_num in range(rng.integers(1,10)):
        this_object = ycb[rng.integers(len(ycb))]
        class_name = os.path.splitext(this_object)[0]
        sdf = pydrake.common.FindResourceOrThrow("drake/manipulation/models/ycb/sdf/" + this_object)
        instance = parser.AddModelFromFile(sdf, f"object{object_num}")

        frame_id = plant.GetBodyFrameIdOrThrow(
            plant.GetBodyIndices(instance)[0])
        geometry_ids = inspector.GetGeometries(
            frame_id, pydrake.geometry.Role.kPerception)
        for geom_id in geometry_ids:
            instance_id_to_class_name[int(inspector.GetPerceptionProperties(geom_id).GetProperty("label", "id"))] = class_name

    plant.Finalize()

    renderer = "my_renderer"
    scene_graph.AddRenderer(
        renderer, pydrake.geometry.render.MakeRenderEngineVtk(pydrake.geometry.render.RenderEngineVtkParams()))
    properties = pydrake.geometry.render.DepthCameraProperties(width=640,
                                        height=480,
                                        fov_y=np.pi / 4.0,
                                        renderer_name=renderer,
                                        z_near=0.1,
                                        z_far=10.0)
    camera = builder.AddSystem(
        pydrake.systems.sensors.RgbdSensor(parent_id=scene_graph.world_frame_id(),
                    X_PB=RigidTransform(
                        RollPitchYaw(np.pi, 0, np.pi/2.0),
                        [0, 0, .8]),
                    properties=properties,
                    show_window=False))

for image_num in range(num_images):
    print(f"Current Memory: {psutil.virtual_memory()}")
    generate_image(image_num)

sherm1 · 2020-11-30T18:19:14Z

This is likely related to PR #14375 or #14376 which merged on 11/20.
cc @SeanCurtis-TRI @rpoyner-tri

SeanCurtis-TRI · 2020-11-30T18:39:00Z

Additionally, it seems that there is a deprecation warning that I receive in the same versions as the error.

For clarity, you're saying the deprecation warning is correlated with the leak?

Assuming that's the case, I have two thoughts:

Update your code to not use DepthCameraProperties but ColorRenderCamera and DepthRenderCamera. The Drake examples have been updated.
@EricCousineau-TRI could this be related to the nature of our conservative exercise of the python binding's keep alive protocols?

EricCousineau-TRI · 2020-11-30T23:13:41Z

@EricCousineau-TRI could this be related to the nature of our conservative exercise of the python binding's keep alive protocols?

Most likely... Since we have that reference cycle, the diagram will be staying alive, including the SceneGraph and all related assets.

My money is that my PR, #14356 (b1d0617), is the main offender here.

@dhoule36 is there any chance that you have time to see if code before this commit (a) doesn't crash and (b) doesn't leak memory as you run stuff w/in a loop?

Suggested commit:

$ git log -n 1 --oneline --no-decorate b1d0617~
commit aa02a6bd9765478d6f16c448239a4e2fa9474041 (HEAD)
Author: Eric Cousineau <[email protected]>
Date:   Thu Nov 19 16:41:17 2020 -0500

    py examples: Ensure manipulation_station_py.cc imports dep modules (#14370)

So I'd use:

version = '20201118'
build = 'nightly'
...
setup_drake(version=version, build=build)

EricCousineau-TRI · 2020-11-30T23:25:25Z

Trying on my end:
https://colab.research.google.com/github/EricCousineau-TRI/repro/blob/3d297d22117941d773a954547fc47f673987a111/drake_stuff/issues/issue_14387/repro.ipynb

EricCousineau-TRI · 2020-11-30T23:41:51Z

Confirmed; I'm like 99.99% sure that #14356 is the offender.
In @dhoule36's notebook, I see memory increase using 20201120 (after that commit), but not on 20201119 (before that commit).

Which means, we're stuck between a rock and hard place.
If we revert, then it means segfaults.
If we keep it, then it means memory leaks, which is really bad for data gen.

I don't have a good idea on how to easily proceed :(
Ideas?

(See updated permalink for more details)

dhoule36 · 2020-12-01T19:30:08Z

@SeanCurtis-TRI

For clarity, you're saying the deprecation warning is correlated with the leak?

I don't think they're correlated. I followed your suggestion to not use DepthCameraProperties but ColorRenderCamera and DepthRenderCamera, and while there is no longer a warning, I am still seeing the memory increase.

@EricCousineau-TRI I'm just seeing this now. Did you still want me to try out the code? It looks like you may have already done so.

EricCousineau-TRI · 2020-12-01T21:18:13Z

Did you still want me to try out the code? It looks like you may have already done so.

Nah, we should be good now, thank you for checking!

EricCousineau-TRI · 2020-12-01T21:20:11Z

Possible solutions:

Revert py systems: Add keep_alive cycle to DiagramBuilder.AddSystem #14356 and say beware
Just keep the leak?
Get overly creative with pybind11 and see if we can somehow transmit keep alive in the specific instance of DiagramBuilder -> Diagram.

(3) is probably feasible, but will be ridden with std::function<>, type erasure, and nastiness all around... or shared_ptr (#13058).

RussTedrake · 2020-12-06T17:11:08Z

I’m honestly not sure what’s best. All three options seem very undesirable. Would appreciate thoughts from the other developers.

…

On Tue, Dec 1, 2020 at 4:20 PM Eric Cousineau ***@***.***> wrote: Possible solutions: 1. Revert #14356 <#14356> and say beware 2. Just keep the leak? 3. Get overly creative with pybind11 and see if we can somehow transmit keep alive in the specific instance of DiagramBuilder -> Diagram. (3) is probably feasible, but will be ridden with std::function<>, type erasure, and nastiness all around... or shared_ptr (#13058 <#13058>). — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#14387 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABRE2NFYGJU2ERFFPEPHY33SSVMZXANCNFSM4UGE5PNA> .

jwnimmer-tri · 2020-12-09T03:50:43Z

For my part, I'm missing part of the backstory here.

Python has reference cycle garbage collection. Cycles are not, in general, a disaster. Reference cycles might increase the delay until the memory is reclaimed, but they should not cause unbounded growth.

I assume that keep_alive is implemented in a way that does not allow python to detect the cycles. What are the details of how keep_alive is implemented, and why is doesn't play nice with the gc?

EricCousineau-TRI · 2020-12-30T21:40:55Z

Will be investigating this (in a slow-burn fashion). See: pybind/pybind11#2761

EricCousineau-TRI · 2021-02-26T17:21:29Z

@ggould-tri ran into this, but in a different form. He had a segfault on his machine; on my machine, it was Too many open files from too many LCM instances being created and not destroyed. Workaround is to save 'em:
EricCousineau-TRI/repro@4a1c9efd

ggould-tri · 2021-02-26T17:35:04Z

Yes, I can confirm that in my case reusing a single DrakeLcm fixes my issue, implying that something in DrakeLcm scope is leaked (at least the receive thread, based on gdb thread apply all bt results on the core dump)

allenzren · 2022-04-05T19:58:43Z

I also recently encountered this issue; my setup is very similar to @dhoule36's. I was collecting rollouts in scooping veggie task and I had to re-build before every rollout to import new geometries. Memory grew 300mb every time, so I had to revert #14356.

If we don't have to re-build every time, that also partially solves the issue.

jwnimmer-tri · 2024-09-18T18:30:25Z

Python has reference cycle garbage collection. Cycles are not, in general, a disaster. Reference cycles might increase the delay until the memory is reclaimed, but they should not cause unbounded growth.

Having accumulated a lot more Python knowledge since I wrote that, my current hypothesis / understanding is that python only collects cycles through Python containers (list, set, dict), so we are going to need to rework our C++ containers that participate in cycles (like Diagram and Simulator) to expose their children as python lists, with the C++ storage weak-aliasing the python storage instead of the other way around.

EricCousineau-TRI · 2024-09-18T18:49:33Z

Not sure if it was mentioned on this issue, but just as FYI, reference cycles are an issue here because they seem to be declared in a way that the Python GC cannot detect / prune, i.e.
https://github.com/RobotLocomotion/pybind11/blob/51d715e037386fcdbeda75ffab15f02f8e4388d8/include/pybind11/pybind11.h#L2701-L2729

EDIT: Example exported from Anzu
master...EricCousineau-TRI:drake:feature-clear-patients-example

jwnimmer-tri · 2024-09-19T16:48:02Z

Next tasks:

(1) Make a reproducer demo program that uses pydrake, with a "level" flag to choose how complicated of an example it should be. The simplest level will be something like an empty Diagram + Simulator without even advancing. The most complicated level will have render engines, etc. something like examples/hardware_sim. (Jeremy)

(2) Survey for the latest instrumentation & diagnostic tools to help give us leverage and observability. (Rico)

(3) Confirm that we see leak(s) on the reproducer demo, at what kind of rate(s) for each level.

(4) Create toy bindings (simpler than the entire systems framework & etc) that share the same architecture, and confirm that the toy likewise reproduces the leak(s).

(5) Iterate on solutions using the toy bindings. My first guess is that we will need to store owned children like "Simulator has-a Diagram" and "Diagram has-a LeafSystem(s)" in a py::list attribute of the bound parent, so that normal GC cycle detection will apply.

In the marginal dead time, work on #20491 so that our edit-test cycle on the real problem will be faster. (Rico)

In the marginal dead time, work on #20260 (or something similar, like sharding the docs) so that our edit-test cycle on the real problem will be faster. (Jeremy)

EricCousineau-TRI · 2024-09-19T22:01:59Z

Regarding (1) and (3), I have a simple example showing leakage using weakref.ref() in my above branch:

drake/tmp/test/pybind_lifecycle_test.py

Lines 47 to 84 in cfd1f53

    
           def test_make_diagram(self): 
        
               builder_ref, diagram_ref, system_ref = make_diagram() 
        
               # Dereference weakref's. 
        
               builder = builder_ref() 
        
               diagram = diagram_ref() 
        
               system = system_ref() 
        
               # Reference cycle prevents objects from being GC'd. 
        
               self.assertIsNotNone(builder) 
        
               self.assertIsNotNone(diagram) 
        
               self.assertIsNotNone(system) 
        
           def test_get_patients(self): 
        
               builder_ref, diagram_ref, system_ref = make_diagram() 
        
               # Dereference weakref's. 
        
               builder = builder_ref() 
        
               diagram = diagram_ref() 
        
               system = system_ref() 
        
               # As shown here, the cycle is formed between `builder` and `system`. 
        
               self.assertListIs(GetPatients(builder), [system, diagram]) 
        
               self.assertListIs(GetPatients(diagram), []) 
        
               self.assertListIs(GetPatients(system), [builder]) 
        
           def test_clear_patients(self): 
        
               builder_ref, diagram_ref, system_ref = make_diagram() 
        
               # Clear cycles for builder. 
        
               # WARNING: This may cause use-after-free errors. Use this with caution! 
        
               # Notes: 
        
               # - Depending on how your code operates, and what accessor you use, 
        
               #   you may need to clear patients on other objects as well. 
        
               # - It be difficult to free everything if you have sub-builders / 
        
               #   diagrams that are constructed in Python. 
        
               # - Lifetime cycles may not occur if `builder.Build()` is called in 
        
               #   C++. 
        
               ClearPatients(builder_ref()) 
        
               # Now objects are GC'd. 
        
               self.assertIsNone(builder_ref()) 
        
               self.assertIsNone(diagram_ref()) 
        
               self.assertIsNone(system_ref())

Note that the "hack" in this case (removing pybind11 patients) shows that the objects are appropriately freed.

jwnimmer-tri · 2024-09-19T22:11:04Z

The goal for (1) is to make a reproducer akin to what a user would do, so import weakref cannot be part of that story.

rpoyner-tri · 2024-10-24T21:23:04Z

Progress:
#22059
#22075
#22068

RussTedrake self-assigned this Nov 30, 2020

ggould-tri added the unused team: robot locomotion group label Nov 30, 2020

EricCousineau-TRI added the component: pydrake Python API and its supporting Starlark macros label Dec 1, 2020

EricCousineau-TRI mentioned this issue Dec 30, 2020

[FEAT] keep_alive_impl should admit reference cycles, and ideally be released by GC? pybind/pybind11#2761

Open

EricCousineau-TRI assigned EricCousineau-TRI and unassigned RussTedrake Dec 30, 2020

EricCousineau-TRI changed the title ~~Memory Leak While Generating Training Images~~ pydrake: Memory Leak While Generating Training Images (leaking Diagrams w/ MultibodyPlant, SceneGraph, etc) Feb 26, 2021

jwnimmer-tri added the priority: backlog label Nov 12, 2021

jwnimmer-tri removed the unused team: robot locomotion group label May 3, 2022

EricCousineau-TRI mentioned this issue Jan 26, 2023

[pybind] Keep alive, refererence (tr.) usages and semnatics are confusing / possibly incorrect #18656

Open

jwnimmer-tri mentioned this issue Apr 23, 2023

[workspace] Switch pybind11 to upstream #19250

Draft

5 tasks

jwnimmer-tri added the type: bug label May 22, 2023

jwnimmer-tri self-assigned this Jun 8, 2023

jwnimmer-tri added this to #dynamics (Drake board) Aug 27, 2024

jwnimmer-tri moved this to Todo in #dynamics (Drake board) Sep 3, 2024

jwnimmer-tri added priority: high and removed priority: backlog labels Sep 3, 2024

jwnimmer-tri unassigned EricCousineau-TRI Sep 16, 2024

jwnimmer-tri mentioned this issue Sep 25, 2024

[pydrake] Add example of memory leaks #21951

Merged

jwnimmer-tri moved this from Todo to In Progress in #dynamics (Drake board) Sep 30, 2024

jwnimmer-tri mentioned this issue Oct 3, 2024

[pydrake] Add more examples of memory leaks #21986

Merged

jwnimmer-tri assigned rpoyner-tri Oct 7, 2024

This was referenced Oct 10, 2024

Python gc fixes prototype #22022

Draft

[visualization] Fix meshcat control deletion paths #22028

Merged

jwnimmer-tri removed their assignment Nov 1, 2024

rpoyner-tri mentioned this issue Nov 6, 2024

[systems] Add diagram life support #22132

Merged

rpoyner-tri mentioned this issue Nov 20, 2024

[pydrake] Fix diagram memory leaks #22221

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pydrake: Memory Leak While Generating Training Images (leaking Diagrams w/ MultibodyPlant, SceneGraph, etc) #14387

pydrake: Memory Leak While Generating Training Images (leaking Diagrams w/ MultibodyPlant, SceneGraph, etc) #14387

dhoule36 commented Nov 29, 2020

RussTedrake commented Nov 30, 2020

dhoule36 commented Nov 30, 2020

sherm1 commented Nov 30, 2020

SeanCurtis-TRI commented Nov 30, 2020

EricCousineau-TRI commented Nov 30, 2020 •

edited

Loading

EricCousineau-TRI commented Nov 30, 2020 •

edited

Loading

EricCousineau-TRI commented Nov 30, 2020 •

edited

Loading

dhoule36 commented Dec 1, 2020

EricCousineau-TRI commented Dec 1, 2020

EricCousineau-TRI commented Dec 1, 2020

RussTedrake commented Dec 6, 2020 via email

jwnimmer-tri commented Dec 9, 2020

EricCousineau-TRI commented Dec 30, 2020

EricCousineau-TRI commented Feb 26, 2021 •

edited

Loading

ggould-tri commented Feb 26, 2021

allenzren commented Apr 5, 2022 •

edited

Loading

jwnimmer-tri commented Sep 18, 2024

EricCousineau-TRI commented Sep 18, 2024 •

edited

Loading

jwnimmer-tri commented Sep 19, 2024 •

edited

Loading

EricCousineau-TRI commented Sep 19, 2024

jwnimmer-tri commented Sep 19, 2024

rpoyner-tri commented Oct 24, 2024

pydrake: Memory Leak While Generating Training Images (leaking Diagrams w/ MultibodyPlant, SceneGraph, etc) #14387

pydrake: Memory Leak While Generating Training Images (leaking Diagrams w/ MultibodyPlant, SceneGraph, etc) #14387

Comments

dhoule36 commented Nov 29, 2020

RussTedrake commented Nov 30, 2020

dhoule36 commented Nov 30, 2020

sherm1 commented Nov 30, 2020

SeanCurtis-TRI commented Nov 30, 2020

EricCousineau-TRI commented Nov 30, 2020 • edited Loading

EricCousineau-TRI commented Nov 30, 2020 • edited Loading

EricCousineau-TRI commented Nov 30, 2020 • edited Loading

dhoule36 commented Dec 1, 2020

EricCousineau-TRI commented Dec 1, 2020

EricCousineau-TRI commented Dec 1, 2020

RussTedrake commented Dec 6, 2020 via email

jwnimmer-tri commented Dec 9, 2020

EricCousineau-TRI commented Dec 30, 2020

EricCousineau-TRI commented Feb 26, 2021 • edited Loading

ggould-tri commented Feb 26, 2021

allenzren commented Apr 5, 2022 • edited Loading

jwnimmer-tri commented Sep 18, 2024

EricCousineau-TRI commented Sep 18, 2024 • edited Loading

jwnimmer-tri commented Sep 19, 2024 • edited Loading

EricCousineau-TRI commented Sep 19, 2024

jwnimmer-tri commented Sep 19, 2024

rpoyner-tri commented Oct 24, 2024

EricCousineau-TRI commented Nov 30, 2020 •

edited

Loading

EricCousineau-TRI commented Nov 30, 2020 •

edited

Loading

EricCousineau-TRI commented Nov 30, 2020 •

edited

Loading

EricCousineau-TRI commented Feb 26, 2021 •

edited

Loading

allenzren commented Apr 5, 2022 •

edited

Loading

EricCousineau-TRI commented Sep 18, 2024 •

edited

Loading

jwnimmer-tri commented Sep 19, 2024 •

edited

Loading