Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pydrake: Memory Leak While Generating Training Images (leaking Diagrams w/ MultibodyPlant, SceneGraph, etc) #14387

Open
dhoule36 opened this issue Nov 29, 2020 · 22 comments
Assignees
Labels
component: pydrake Python API and its supporting Starlark macros priority: high type: bug

Comments

@dhoule36
Copy link

I am a student at MIT in Russ Tedrake's Robotic Manipulation course. I have been able to successfully run one of his scripts (http://manipulation.csail.mit.edu/manipulation/clutter_maskrcnn_data.py) to generate training images within the Google Colab environment, until yesterday. Google Colab now crashes after generating about 90 images saying that it has used all of the RAM, where Colab Pro has 25GB.

With Colab, I use the following lines to install drake:

# Install drake.
if 'google.colab' in sys.modules and importlib.util.find_spec('manipulation') is None:
    urlretrieve(f"http://manipulation.csail.mit.edu/scripts/setup/setup_manipulation_colab.py",
                "setup_manipulation_colab.py")
    from setup_manipulation_colab import setup_manipulation
    setup_manipulation(manipulation_sha='master', drake_version='latest', drake_build='continuous')

I am able to successfully generate 1,000 images with the script by installing a previous build:
setup_manipulation(manipulation_sha='fa5bcfb6367cd0cfda0e3d11e11854d68b39478a', drake_version='20201118', drake_build='nightly')

The script itself, linked above, calls a generate_image function repeatedly, where I noticed that the inactive memory grows by about 300MB with every execution. It uses a multibody plant, adding an rgbd camera for capturing the training images. It drops random objects from the ycb dataset into a bin before grabbing the image from the camera's output port.

Let me know if you need me to provide any more information!

@RussTedrake RussTedrake self-assigned this Nov 30, 2020
@RussTedrake
Copy link
Contributor

I am working with the OP to produce a better repro. I will reassign when we have it.

@dhoule36
Copy link
Author

I was able to create the simplest case of the issue that I was running into in Colab. I found that changing the drake version from 'latest' to '20201119' was the last nightly build that successfully does not run out of RAM ('20201120' does). Additionally, it seems that there is a deprecation warning that I receive in the same versions as the error:

Deprecated:
CameraProperties are being deprecated. Please use the RenderCamera
variant. This will be removed from Drake on or after 2021-03-01.

import importlib
import sys
from urllib.request import urlretrieve


# Install drake.
if 'google.colab' in sys.modules and importlib.util.find_spec('pydrake') is None:
  version='latest'
  build='nightly'
  urlretrieve(f"https://drake-packages.csail.mit.edu/drake/{build}/drake-{version}/setup_drake_colab.py", "setup_drake_colab.py")
  from setup_drake_colab import setup_drake
  setup_drake(version=version, build=build)

import psutil

import os
import numpy as np
from tqdm import tqdm

import pydrake.all
from pydrake.all import RigidTransform, RollPitchYaw

ycb = ['003_cracker_box.sdf', '004_sugar_box.sdf', '005_tomato_soup_can.sdf', '006_mustard_bottle.sdf', '009_gelatin_box.sdf', '010_potted_meat_can.sdf']
path = '/tmp/clutter_maskrcnn_data'
num_images = 10000

rng = np.random.default_rng()  # this is for python
generator = pydrake.common.RandomGenerator(rng.integers(1000))  # for c++

def generate_image(image_num):
    builder = pydrake.systems.framework.DiagramBuilder()
    plant, scene_graph = pydrake.multibody.plant.AddMultibodyPlantSceneGraph(builder, time_step=0.0005)
    parser = pydrake.multibody.parsing.Parser(plant)
    parser.AddModelFromFile(pydrake.common.FindResourceOrThrow(
        "drake/examples/manipulation_station/models/bin.sdf"))
    plant.WeldFrames(plant.world_frame(), plant.GetFrameByName("bin_base"))
    inspector = scene_graph.model_inspector()

    instance_id_to_class_name = dict()

    for object_num in range(rng.integers(1,10)):
        this_object = ycb[rng.integers(len(ycb))]
        class_name = os.path.splitext(this_object)[0]
        sdf = pydrake.common.FindResourceOrThrow("drake/manipulation/models/ycb/sdf/" + this_object)
        instance = parser.AddModelFromFile(sdf, f"object{object_num}")

        frame_id = plant.GetBodyFrameIdOrThrow(
            plant.GetBodyIndices(instance)[0])
        geometry_ids = inspector.GetGeometries(
            frame_id, pydrake.geometry.Role.kPerception)
        for geom_id in geometry_ids:
            instance_id_to_class_name[int(inspector.GetPerceptionProperties(geom_id).GetProperty("label", "id"))] = class_name

    plant.Finalize()

    renderer = "my_renderer"
    scene_graph.AddRenderer(
        renderer, pydrake.geometry.render.MakeRenderEngineVtk(pydrake.geometry.render.RenderEngineVtkParams()))
    properties = pydrake.geometry.render.DepthCameraProperties(width=640,
                                        height=480,
                                        fov_y=np.pi / 4.0,
                                        renderer_name=renderer,
                                        z_near=0.1,
                                        z_far=10.0)
    camera = builder.AddSystem(
        pydrake.systems.sensors.RgbdSensor(parent_id=scene_graph.world_frame_id(),
                    X_PB=RigidTransform(
                        RollPitchYaw(np.pi, 0, np.pi/2.0),
                        [0, 0, .8]),
                    properties=properties,
                    show_window=False))

for image_num in range(num_images):
    print(f"Current Memory: {psutil.virtual_memory()}")
    generate_image(image_num)

@sherm1
Copy link
Member

sherm1 commented Nov 30, 2020

This is likely related to PR #14375 or #14376 which merged on 11/20.
cc @SeanCurtis-TRI @rpoyner-tri

@SeanCurtis-TRI
Copy link
Contributor

Additionally, it seems that there is a deprecation warning that I receive in the same versions as the error.

For clarity, you're saying the deprecation warning is correlated with the leak?

Assuming that's the case, I have two thoughts:

  1. Update your code to not use DepthCameraProperties but ColorRenderCamera and DepthRenderCamera. The Drake examples have been updated.
  2. @EricCousineau-TRI could this be related to the nature of our conservative exercise of the python binding's keep alive protocols?

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Nov 30, 2020

@EricCousineau-TRI could this be related to the nature of our conservative exercise of the python binding's keep alive protocols?

Most likely... Since we have that reference cycle, the diagram will be staying alive, including the SceneGraph and all related assets.

My money is that my PR, #14356 (b1d0617), is the main offender here.

@dhoule36 is there any chance that you have time to see if code before this commit (a) doesn't crash and (b) doesn't leak memory as you run stuff w/in a loop?

Suggested commit:

$ git log -n 1 --oneline --no-decorate b1d0617~
commit aa02a6bd9765478d6f16c448239a4e2fa9474041 (HEAD)
Author: Eric Cousineau <[email protected]>
Date:   Thu Nov 19 16:41:17 2020 -0500

    py examples: Ensure manipulation_station_py.cc imports dep modules (#14370)

So I'd use:

version = '20201118'
build = 'nightly'
...
setup_drake(version=version, build=build)

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Nov 30, 2020

Confirmed; I'm like 99.99% sure that #14356 is the offender.
In @dhoule36's notebook, I see memory increase using 20201120 (after that commit), but not on 20201119 (before that commit).

Which means, we're stuck between a rock and hard place.
If we revert, then it means segfaults.
If we keep it, then it means memory leaks, which is really bad for data gen.

I don't have a good idea on how to easily proceed :(
Ideas?

(See updated permalink for more details)

@dhoule36
Copy link
Author

dhoule36 commented Dec 1, 2020

@SeanCurtis-TRI

For clarity, you're saying the deprecation warning is correlated with the leak?

I don't think they're correlated. I followed your suggestion to not use DepthCameraProperties but ColorRenderCamera and DepthRenderCamera, and while there is no longer a warning, I am still seeing the memory increase.

@EricCousineau-TRI I'm just seeing this now. Did you still want me to try out the code? It looks like you may have already done so.

@EricCousineau-TRI
Copy link
Contributor

Did you still want me to try out the code? It looks like you may have already done so.

Nah, we should be good now, thank you for checking!

@EricCousineau-TRI EricCousineau-TRI added the component: pydrake Python API and its supporting Starlark macros label Dec 1, 2020
@EricCousineau-TRI
Copy link
Contributor

Possible solutions:

  1. Revert py systems: Add keep_alive cycle to DiagramBuilder.AddSystem #14356 and say beware
  2. Just keep the leak?
  3. Get overly creative with pybind11 and see if we can somehow transmit keep alive in the specific instance of DiagramBuilder -> Diagram.

(3) is probably feasible, but will be ridden with std::function<>, type erasure, and nastiness all around... or shared_ptr (#13058).

@RussTedrake
Copy link
Contributor

RussTedrake commented Dec 6, 2020 via email

@jwnimmer-tri
Copy link
Collaborator

For my part, I'm missing part of the backstory here.

Python has reference cycle garbage collection. Cycles are not, in general, a disaster. Reference cycles might increase the delay until the memory is reclaimed, but they should not cause unbounded growth.

I assume that keep_alive is implemented in a way that does not allow python to detect the cycles. What are the details of how keep_alive is implemented, and why is doesn't play nice with the gc?

@EricCousineau-TRI
Copy link
Contributor

Will be investigating this (in a slow-burn fashion). See: pybind/pybind11#2761

@EricCousineau-TRI EricCousineau-TRI changed the title Memory Leak While Generating Training Images pydrake: Memory Leak While Generating Training Images (leaking Diagrams w/ MultibodyPlant, SceneGraph, etc) Feb 26, 2021
@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Feb 26, 2021

@ggould-tri ran into this, but in a different form. He had a segfault on his machine; on my machine, it was Too many open files from too many LCM instances being created and not destroyed. Workaround is to save 'em:
EricCousineau-TRI/repro@4a1c9efd

@ggould-tri
Copy link
Contributor

Yes, I can confirm that in my case reusing a single DrakeLcm fixes my issue, implying that something in DrakeLcm scope is leaked (at least the receive thread, based on gdb thread apply all bt results on the core dump)

@allenzren
Copy link
Contributor

allenzren commented Apr 5, 2022

I also recently encountered this issue; my setup is very similar to @dhoule36's. I was collecting rollouts in scooping veggie task and I had to re-build before every rollout to import new geometries. Memory grew 300mb every time, so I had to revert #14356.

If we don't have to re-build every time, that also partially solves the issue.

@jwnimmer-tri
Copy link
Collaborator

Python has reference cycle garbage collection. Cycles are not, in general, a disaster. Reference cycles might increase the delay until the memory is reclaimed, but they should not cause unbounded growth.

Having accumulated a lot more Python knowledge since I wrote that, my current hypothesis / understanding is that python only collects cycles through Python containers (list, set, dict), so we are going to need to rework our C++ containers that participate in cycles (like Diagram and Simulator) to expose their children as python lists, with the C++ storage weak-aliasing the python storage instead of the other way around.

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Sep 18, 2024

Not sure if it was mentioned on this issue, but just as FYI, reference cycles are an issue here because they seem to be declared in a way that the Python GC cannot detect / prune, i.e.
https://github.com/RobotLocomotion/pybind11/blob/51d715e037386fcdbeda75ffab15f02f8e4388d8/include/pybind11/pybind11.h#L2701-L2729

EDIT: Example exported from Anzu
master...EricCousineau-TRI:drake:feature-clear-patients-example

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Sep 19, 2024

Next tasks:

(1) Make a reproducer demo program that uses pydrake, with a "level" flag to choose how complicated of an example it should be. The simplest level will be something like an empty Diagram + Simulator without even advancing. The most complicated level will have render engines, etc. something like examples/hardware_sim. (Jeremy)

(2) Survey for the latest instrumentation & diagnostic tools to help give us leverage and observability. (Rico)

(3) Confirm that we see leak(s) on the reproducer demo, at what kind of rate(s) for each level.

(4) Create toy bindings (simpler than the entire systems framework & etc) that share the same architecture, and confirm that the toy likewise reproduces the leak(s).

(5) Iterate on solutions using the toy bindings. My first guess is that we will need to store owned children like "Simulator has-a Diagram" and "Diagram has-a LeafSystem(s)" in a py::list attribute of the bound parent, so that normal GC cycle detection will apply.

In the marginal dead time, work on #20491 so that our edit-test cycle on the real problem will be faster. (Rico)

In the marginal dead time, work on #20260 (or something similar, like sharding the docs) so that our edit-test cycle on the real problem will be faster. (Jeremy)

@EricCousineau-TRI
Copy link
Contributor

Regarding (1) and (3), I have a simple example showing leakage using weakref.ref() in my above branch:

def test_make_diagram(self):
builder_ref, diagram_ref, system_ref = make_diagram()
# Dereference weakref's.
builder = builder_ref()
diagram = diagram_ref()
system = system_ref()
# Reference cycle prevents objects from being GC'd.
self.assertIsNotNone(builder)
self.assertIsNotNone(diagram)
self.assertIsNotNone(system)
def test_get_patients(self):
builder_ref, diagram_ref, system_ref = make_diagram()
# Dereference weakref's.
builder = builder_ref()
diagram = diagram_ref()
system = system_ref()
# As shown here, the cycle is formed between `builder` and `system`.
self.assertListIs(GetPatients(builder), [system, diagram])
self.assertListIs(GetPatients(diagram), [])
self.assertListIs(GetPatients(system), [builder])
def test_clear_patients(self):
builder_ref, diagram_ref, system_ref = make_diagram()
# Clear cycles for builder.
# WARNING: This may cause use-after-free errors. Use this with caution!
# Notes:
# - Depending on how your code operates, and what accessor you use,
# you may need to clear patients on other objects as well.
# - It be difficult to free everything if you have sub-builders /
# diagrams that are constructed in Python.
# - Lifetime cycles may not occur if `builder.Build()` is called in
# C++.
ClearPatients(builder_ref())
# Now objects are GC'd.
self.assertIsNone(builder_ref())
self.assertIsNone(diagram_ref())
self.assertIsNone(system_ref())

Note that the "hack" in this case (removing pybind11 patients) shows that the objects are appropriately freed.

@jwnimmer-tri
Copy link
Collaborator

The goal for (1) is to make a reproducer akin to what a user would do, so import weakref cannot be part of that story.

@rpoyner-tri
Copy link
Contributor

Progress:
#22059
#22075
#22068

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: pydrake Python API and its supporting Starlark macros priority: high type: bug
Projects
Status: In Progress
Development

No branches or pull requests

9 participants