Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bindings] Add custom ref_cycle annotation #22068

Merged
merged 1 commit into from
Oct 25, 2024

Conversation

rpoyner-tri
Copy link
Contributor

@rpoyner-tri rpoyner-tri commented Oct 23, 2024

Introduce the new annotation internal::ref_cycle<M, N>(). It will eventually replace existing cyclic py::keep_alive<>() annotations (which do their job, but leak their participants forever). The participants of ref_cycle<>() will be garbage collectable, so that applications can run loops that use various drake components without exhausting memory.

This patch just adds the implementation and its unit test.

Full WIP at #22022.


This change is Reviewable

@rpoyner-tri rpoyner-tri added the release notes: none This pull request should not be mentioned in the release notes label Oct 23, 2024
Copy link
Contributor Author

@rpoyner-tri rpoyner-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+(release notes: none)
+@jwnimmer-tri or delegate for feature review.
FYI @EricCousineau-TRI

Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: feature. After the next push, I suggest assigning Eric as a second feature reviewer, prior to platform review.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: 8 unresolved discussions, needs at least two assigned reviewers


bindings/pydrake/common/ref_cycle_pybind.h line 16 at r1 (raw file):

  Wherever pybind11::keep_alive<M, N>() can be used, ref_cycle<M, N>() can be
  used similarly.
 */

BTW When I read this API doc before the implementation, for some reason I took ref_cycle<M, N> to mean a one-way reference, i.e., the same semantics as keep_alive<M, N>. Of course the name "cycle" should have clued me otherwise, but I suppose its worth adding really clear text in the docs here:

  • M keeps N alive, and N keeps M alive.
  • Neither one is collected until both are unreachable.
    *The template order of M vs N does not matter.

bindings/pydrake/common/ref_cycle_pybind.h line 26 at r1 (raw file):

/* This function creates the cycle, given peers as pybind11::handles. */
void do_ref_cycle_impl(pybind11::handle p0, pybind11::handle p1);

I don't see any reason why this should be declared in the header? It is only ever called from the cc file.


bindings/pydrake/common/ref_cycle_pybind.cc line 31 at r1 (raw file):

  // if it is not yet existing.
  auto make_link = [](handle a, handle b) {
    static const char refcycle_peers[] = "_pydrake_ref_cycle_peers";

BTW Given the consequences of a user ever accessing this (the DEMAND on line 44 will fail), maybe it should have an even scarier name, like a double underscore or "do not use" or something?


bindings/pydrake/common/ref_cycle_pybind.cc line 63 at r1 (raw file):

      return call.args[n - 1];
    }
    return handle();

Why doesn't this throw an exception? It seems to me like we only reach this spot if a developer types a number wrong in a bindings file.

If it's because it already trips the "Could not activate" clause up top, that should be more clear. But in that case, it seems like if we throw here, then we could give a more useful exception message, e.g., with the name of the function in the error message. It might be best for this function to check that truth condition (for all of its return branches?) so that developer errors are clear.


bindings/pydrake/common/test/ref_cycle_test.py line 28 at r1 (raw file):

            # Recall that sys.getrefcount() adds 1 to the actual count.
            self.assertEqual(sys.getrefcount(x.__dict__), 2)
            self.assertEqual(sys.getrefcount(x._pydrake_ref_cycle_peers), 2)

BTW more clear?

Suggestion:

            self.assertEqual(sys.getrefcount(x.__dict__) - 1, 1)
            self.assertEqual(sys.getrefcount(x._pydrake_ref_cycle_peers) - 1, 1)

bindings/pydrake/common/test/ref_cycle_test.py line 73 at r1 (raw file):

        self.check_no_cycle(dut, peer)
        # Annotated call dies because dut is not py::dynamic_attr().
        with self.assertRaisesRegex(SystemExit, ".*PyType_IS_GC.*"):

BTW Here is an example where "AddIsCycle" in the message would be a nice win for Drake developers.

Ditto throughout.


bindings/pydrake/common/test/ref_cycle_test.py line 89 at r1 (raw file):

        # Un-annotated call is fine.
        self.assertIsNone(dut.ReturnNullIs())
        # Annotated call does not die because one peer is missing.

This result is fine in case it's hard to implement otherwise, but really this situation (using ref_cycle on two classes where either one is not dynamic-attr) is a mistake by the drake developer and should definitely throw, if we can swing it.

Failing that, repeating my comment here into the test case for posterity (that we would hope for a throw) would be fine, too.

Ditto throughout.


bindings/pydrake/common/test/ref_cycle_test_util_py.cc line 29 at r1 (raw file):

};
using IsDynamic = TestDummyBase<true>;
using NotDynamic = TestDummyBase<false>;

nit These lines duplicate the in-class using above. To better communicate they we are using the same meaning everywhere, these should cite the nested types instead of defining them de-novo.

Copy link
Contributor Author

@rpoyner-tri rpoyner-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 unresolved discussions, needs at least two assigned reviewers


bindings/pydrake/common/ref_cycle_pybind.h line 26 at r1 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

I don't see any reason why this should be declared in the header? It is only ever called from the cc file.

Done.


bindings/pydrake/common/ref_cycle_pybind.cc line 63 at r1 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Why doesn't this throw an exception? It seems to me like we only reach this spot if a developer types a number wrong in a bindings file.

If it's because it already trips the "Could not activate" clause up top, that should be more clear. But in that case, it seems like if we throw here, then we could give a more useful exception message, e.g., with the name of the function in the error message. It might be best for this function to check that truth condition (for all of its return branches?) so that developer errors are clear.

Done.


bindings/pydrake/common/test/ref_cycle_test.py line 89 at r1 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

This result is fine in case it's hard to implement otherwise, but really this situation (using ref_cycle on two classes where either one is not dynamic-attr) is a mistake by the drake developer and should definitely throw, if we can swing it.

Failing that, repeating my comment here into the test case for posterity (that we would hope for a throw) would be fine, too.

Ditto throughout.

This is not a case that should throw. I'm following the lead of the py::keep_alive<>() implementers here, as far as the action to take. In this case, we happen to be using invalid types, but the check for None in either peer index position has to supersede that.

Copy link
Contributor Author

@rpoyner-tri rpoyner-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@EricCousineau-TRI for second feature review.

Reviewable status: 3 unresolved discussions, LGTM missing from assignee EricCousineau-TRI(platform)

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: LGTM missing from assignee EricCousineau-TRI(platform)


bindings/pydrake/common/test/ref_cycle_test.py line 89 at r1 (raw file):

Previously, rpoyner-tri (Rick Poyner (rico)) wrote…

This is not a case that should throw. I'm following the lead of the py::keep_alive<>() implementers here, as far as the action to take. In this case, we happen to be using invalid types, but the check for None in either peer index position has to supersede that.

Yes, now I realize we're only using the dynamic type, not the static type. If we were checking the static type we could complain, but we don't always have that, and it's not easy to obtain even when we do.

Copy link
Contributor Author

@rpoyner-tri rpoyner-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: LGTM missing from assignee EricCousineau-TRI(platform)


bindings/pydrake/common/ref_cycle_pybind.h line 26 at r1 (raw file):

Previously, rpoyner-tri (Rick Poyner (rico)) wrote…

Done.

I will likely need to re-declare this later, to implement the (ZOMGWTFBBQ) AddMultibodyPlantSceneGraph bindings. That's fine -- I'll just plumb it when I need it.

Copy link
Contributor

@EricCousineau-TRI EricCousineau-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: with a small design-level FYI

Reviewed 3 of 5 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! all discussions resolved, LGTM from assignees jwnimmer-tri(platform),EricCousineau-TRI(platform)


bindings/pydrake/common/ref_cycle_pybind.h line 14 at r3 (raw file):

   https://pybind11.readthedocs.io/en/stable/advanced/functions.html#additional-call-policies

  `ref_cycle` creates a reference count cycle that Python's cyclic garbage

FYI Given this design, this will prevent cycles when replaced directly.
However, we could still eventually have non-discoverable cycles if we somehow design bindings that allow transitive keep_alive calls to be strung together.
The more holistic solution would be to replace keep_alive with this custom implementation (but only in one direction), s.t. it's no longer an issue.

However, the cost is that anything wanting to extend lifetimes must use dynamic_attr() declarations.

Copy link
Contributor Author

@rpoyner-tri rpoyner-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all discussions resolved, LGTM from assignees jwnimmer-tri(platform),EricCousineau-TRI(platform)


bindings/pydrake/common/ref_cycle_pybind.h line 14 at r3 (raw file):

Previously, EricCousineau-TRI (Eric Cousineau) wrote…

FYI Given this design, this will prevent cycles when replaced directly.
However, we could still eventually have non-discoverable cycles if we somehow design bindings that allow transitive keep_alive calls to be strung together.
The more holistic solution would be to replace keep_alive with this custom implementation (but only in one direction), s.t. it's no longer an issue.

However, the cost is that anything wanting to extend lifetimes must use dynamic_attr() declarations.

Good point. To restate things, it's still possible to accidentally create immortal objects by arranging py::keep_alive links in a loop.

I suppose I was at least unconsciously aware of that, but I think the danger is slight, or at least no greater than it is for any other pybind11 users. I think I am content with the notion that we make obvious badness better, and remain aware of potential problems.

If we were to discover a new immortality problem, I would recommend:

  • (maybe) build some instrumentation that can dump out graphs of gc-trackable and py::keep_alive reference relations.
  • Modify this code to support a call policy that makes a one-way gc-trackable reference relation.

I don't think either of those is difficult, if we hit the problem.

Copy link
Contributor Author

@rpoyner-tri rpoyner-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs platform: +@SeanCurtis-TRI for platform review, by schedule.

Reviewable status: LGTM missing from assignee SeanCurtis-TRI(platform)

Copy link
Contributor Author

@rpoyner-tri rpoyner-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, LGTM missing from assignee SeanCurtis-TRI(platform)

a discussion (no related file):
a thread to disable the squash button while awaiting platform review.


Copy link
Contributor

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:LGTM:

Reviewed 3 of 5 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: 8 unresolved discussions


bindings/pydrake/common/ref_cycle_pybind.h line 16 at r3 (raw file):

  `ref_cycle` creates a reference count cycle that Python's cyclic garbage
  collection can find and collect, once the cycle's objects are no longer
  reachable. Both peer objects must by pybind11 classes that were defined with

nit: typo

Suggestion:

be

bindings/pydrake/common/ref_cycle_pybind.h line 17 at r3 (raw file):

  collection can find and collect, once the cycle's objects are no longer
  reachable. Both peer objects must by pybind11 classes that were defined with
  the dynamic_attr() annotation.

BTW It's worth elaborating on this requirement. It's detected at runtime with an exception. (Although, the exception message doesn't allude to this language.) I'd like to see this documentation and that error message converge a bit closer.

Code quote:

dynamic_attr() annotation

bindings/pydrake/common/ref_cycle_pybind.h line 26 at r3 (raw file):

  * M keeps N alive, and N keeps M alive.
  * Neither object is finalized until:

BTW I realize it's a python term of art, but given that this is in the context of Drake, the term "finalize" is a bit overloaded. Perhaps a bit less terse?

Code quote:

finalized

bindings/pydrake/common/ref_cycle_pybind.h line 68 at r3 (raw file):

 private:
  static constexpr bool needs_result() { return Peer0 == 0 || Peer1 == 0; }

BTW Initially, this whole thing was completely opaque to me (as a non-member of the Jeremy/Eric/Rico group). Initially, I was calling for a life ring.

After reading this a bit more, I have a sense of what this is about. I think it's still worth making a note for drake developers that aren't steeped in pybind.


bindings/pydrake/common/ref_cycle_pybind.cc line 37 at r3 (raw file):

      DRAKE_DEMAND(PyType_IS_GC(Py_TYPE(peers.ptr())));
      a.attr(refcycle_peers) = peers;
      Py_DECREF(peers.ptr());

BTW An explanation of why this explicit reference management is necessary would be nice.


bindings/pydrake/common/test/ref_cycle_test.py line 25 at r3 (raw file):

    # python call) cost.
    @functools.cache
    def wrapped_refcount_cost():

BTW CRAZY!


bindings/pydrake/common/test/ref_cycle_test_util_py.cc line 12 at r3 (raw file):

namespace {

template <bool AmIDynamic>

BTW While this only appears twice, bool expressions are typically statements and not questions. However, given that you've taken the most likely candidate for predicate and turned it into a thing, meh.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 5 unresolved discussions


bindings/pydrake/common/ref_cycle_pybind.h line 68 at r3 (raw file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

BTW Initially, this whole thing was completely opaque to me (as a non-member of the Jeremy/Eric/Rico group). Initially, I was calling for a life ring.

After reading this a bit more, I have a sense of what this is about. I think it's still worth making a note for drake developers that aren't steeped in pybind.

I suppose what's missing is "zero refers to the return value" with a citation to here?

Copy link
Contributor Author

@rpoyner-tri rpoyner-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all discussions resolved, LGTM from assignees jwnimmer-tri(platform),SeanCurtis-TRI(platform),EricCousineau-TRI(platform)


bindings/pydrake/common/ref_cycle_pybind.cc line 37 at r3 (raw file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

BTW An explanation of why this explicit reference management is necessary would be nice.

got rid of it.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! all discussions resolved, LGTM from assignees jwnimmer-tri(platform),SeanCurtis-TRI(platform),EricCousineau-TRI(platform)

Copy link
Contributor

@EricCousineau-TRI EricCousineau-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! all discussions resolved, LGTM from assignees jwnimmer-tri(platform),SeanCurtis-TRI(platform),EricCousineau-TRI(platform)


bindings/pydrake/common/ref_cycle_pybind.h line 18 at r4 (raw file):

  reachable.

  Both peer objects must be pybind11 classes that were defined with the

BTW From reading the code, I'm not sure if it stictly "must" be a pybind11 class.
For example, someone could define their own Python class and pass it here.


bindings/pydrake/common/ref_cycle_pybind.h line 86 at r4 (raw file):

 private:
  // Returns true if either template parameter denotes the return value.
  static constexpr bool needs_result() { return Peer0 == 0 || Peer1 == 0; }

BTW result() is ambiguous. Consider renaming to needs_return_value(). I was initially confused until I had read the source code in my first review.


bindings/pydrake/common/ref_cycle_pybind.cc line 33 at r4 (raw file):

    // N.B.: `object` does automatic ref counting; `handle` does not.
    if (!hasattr(a, refcycle_peers)) {
      auto new_set = pybind11::reinterpret_steal<object>(PySet_New(nullptr));

BTW This could be simplified by using py::set().

Copy link
Contributor

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: 3 unresolved discussions


bindings/pydrake/common/ref_cycle_pybind.h line 68 at r3 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

I suppose what's missing is "zero refers to the return value" with a citation to here?

The new documentation does that, and @EricCousineau-TRI 's comment disambiguating "result" from "return value" also captures it.


bindings/pydrake/common/ref_cycle_pybind.h line 86 at r4 (raw file):

Previously, EricCousineau-TRI (Eric Cousineau) wrote…

BTW result() is ambiguous. Consider renaming to needs_return_value(). I was initially confused until I had read the source code in my first review.

Eric has succinctly summarized my confusion.

Introduce the new annotation internal::ref_cycle<M, N>(). It will eventually
replace existing cyclic py::keep_alive<>() annotations (which do their job, but
leak their participants forever). The participants of ref_cycle<>() will be
garbage collectible, so that applications can run loops that use various drake
components without exhausting memory.

This patch just adds the implementation and its unit test.
Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! all discussions resolved, LGTM from assignees jwnimmer-tri(platform),SeanCurtis-TRI(platform),EricCousineau-TRI(platform)

Copy link
Contributor

@EricCousineau-TRI EricCousineau-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! all discussions resolved, LGTM from assignees jwnimmer-tri(platform),SeanCurtis-TRI(platform),EricCousineau-TRI(platform)

@rpoyner-tri rpoyner-tri merged commit 8bb6da2 into RobotLocomotion:master Oct 25, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release notes: none This pull request should not be mentioned in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants