Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a node is terminated improperly the registered_collections is not initialized for the node replacing it when using --dist loadgroup #1189

Open
orlp opened this issue Mar 21, 2025 · 0 comments

Comments

@orlp
Copy link

orlp commented Mar 21, 2025

I don't exactly have a minimal reproducer right now, but when tests improperly terminate (e.g. call sys.exit(1)) we find that the replacing worker will try to access registered_collections while it is not in it. For example when running with pytest -n 2 --dist loadgroup we see the following:

....................................................................................... [ 25%]
....................................................................................... [ 25%]
....................................................................................... [ 26%]
...............................................................................[gw0] node down: Not properly terminated
F
replacing crashed worker gw0
collecting: 2/3 workers[gw1] node down: Not properly terminated
attempted to index with <WorkerController gw2>

replacing crashed worker gw1
collecting: 3/4 workersattempted to index with <WorkerController gw3>
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/_pytest/main.py", line 283, in wrap_session
INTERNALERROR>     session.exitstatus = doit(config, session) or 0
INTERNALERROR>                          ^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/_pytest/main.py", line 337, in _main
INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/pluggy/_hooks.py", line 513, in __call__
INTERNALERROR>     return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/pluggy/_manager.py", line 120, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/pluggy/_callers.py", line 139, in _multicall
INTERNALERROR>     raise exception.with_traceback(exception.__traceback__)
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/pluggy/_callers.py", line 122, in _multicall
INTERNALERROR>     teardown.throw(exception)  # type: ignore[union-attr]
INTERNALERROR>     ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/_pytest/logging.py", line 805, in pytest_runtestloop
INTERNALERROR>     return (yield)  # Run all the tests.
INTERNALERROR>             ^^^^^
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/pluggy/_callers.py", line 122, in _multicall
INTERNALERROR>     teardown.throw(exception)  # type: ignore[union-attr]
INTERNALERROR>     ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/_pytest/terminal.py", line 673, in pytest_runtestloop
INTERNALERROR>     result = yield
INTERNALERROR>              ^^^^^
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/pluggy/_callers.py", line 103, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>           ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/xdist/dsession.py", line 138, in pytest_runtestloop
INTERNALERROR>     self.loop_once()
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/xdist/dsession.py", line 163, in loop_once
INTERNALERROR>     call(**kwargs)
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/xdist/dsession.py", line 306, in worker_collectionfinish
INTERNALERROR>     self.sched.schedule()
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/xdist/scheduler/loadscope.py", line 359, in schedule
INTERNALERROR>     self._reschedule(node)
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/xdist/scheduler/loadscope.py", line 341, in _reschedule
INTERNALERROR>     self._assign_work_unit(node)
INTERNALERROR>   File "/Users/orlp/programming/rust/polars/.venv/lib/python3.11/site-packages/xdist/scheduler/loadscope.py", line 276, in _assign_work_unit
INTERNALERROR>     worker_collection = self.registered_collections[node]
INTERNALERROR>                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
INTERNALERROR> KeyError: <WorkerController gw3>

Note the following lines:

attempted to index with <WorkerController gw2>
attempted to index with <WorkerController gw3>

These are the replacement workers for the crashed workers gw0 and gw1. These lines were printed by the following debug statement I added in loadscope.py:

        # Ask the node to execute the workload
        try:
            worker_collection = self.registered_collections[node]
        except:
            print("attempted to index with", node)
            raise

Turning off --dist loadgroup fixes the issue. I've reproduced with pytest==8.3.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant