Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix grcp hangs and upgrade grpcio to 1.62.2 #909

Open
wants to merge 28 commits into
base: feature/pdeathsigger
Choose a base branch
from

Conversation

slicklash
Copy link
Contributor

@slicklash slicklash commented Jun 10, 2024

Problem

  • grcp hangs (broken threading control) when multiple CRI clients are trying to query newly created container with (thd.cc:160 pthread_create failed: Invalid argument source )

Solution

use single CRI client

@slicklash slicklash force-pushed the grpcio-1-62-2 branch 4 times, most recently from 2faadf2 to a267901 Compare June 11, 2024 12:31
@slicklash slicklash force-pushed the grpcio-1-62-2 branch 11 times, most recently from a1806e9 to a411db4 Compare June 17, 2024 08:24
@slicklash slicklash marked this pull request as ready for review June 17, 2024 15:43
@slicklash slicklash requested review from Jongy June 17, 2024 15:44
@Jongy
Copy link
Contributor

Jongy commented Sep 4, 2024

@slicklash please ping @d3dave and me when it's ready for re-review?

@slicklash
Copy link
Contributor Author

@slicklash please ping @d3dave and me when it's ready for re-review?

Sure. Introduction of pdeathsigger has broken many things. I am fixing them one by one.

@slicklash slicklash force-pushed the grpcio-1-62-2 branch 2 times, most recently from c643b91 to 664141c Compare September 4, 2024 16:08
@slicklash
Copy link
Contributor Author

@Jongy @d3dave ready

@@ -144,6 +145,7 @@ def test_perf_comm_change(
I'm not sure it can be done, i.e is this info even kept anywhere).
"""
with system_profiler as profiler:
time.sleep(2)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the sleeps necessary? Is there a way to make the test deterministic instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are very flaky. The conditions for them to pass are:

  • build container
  • run native app
  • app must change comm before 1st profiling

To make this deterministic we need signaling from the app that it actually changed comm. It will add more complexity. Code base has very low commit frequency and is not worth the trouble.

Copy link
Contributor

@d3dave d3dave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open a separate PR for preexec_fn and pdeathsigger change as it's independent of grpcio version we are using. Let's keep the CRI client change along with the grpcio version PR.

@slicklash slicklash changed the base branch from master to feature/pdeathsigger October 8, 2024 22:00
@slicklash slicklash changed the title grpcio 1.62.2 fix grcp hangs and upgrade grpcio to 1.62.2 Oct 8, 2024
@slicklash
Copy link
Contributor Author

Please open a separate PR for preexec_fn and pdeathsigger change as it's independent of grpcio version we are using. Let's keep the CRI client change along with the grpcio version PR.

moved to #931

@Jongy
Copy link
Contributor

Jongy commented Oct 10, 2024

@slicklash I'm unsure what this PR solves now. gProfiler already creates only a single ContainersClient object, only during tests we create more than one (and in diagnostics.py mode).

Inside the single ContainersClient, we do create several CriClients but only upon startup (as we try v1 and v1alpha2 and check containerd and crio).

@slicklash
Copy link
Contributor Author

@slicklash I'm unsure what this PR solves now. gProfiler already creates only a single ContainersClient object, only during tests we create more than one (and in diagnostics.py mode).

Inside the single ContainersClient, we do create several CriClients but only upon startup (as we try v1 and v1alpha2 and check containerd and crio).

Tests show the problem with implementation of ContainerNamesClient itself. There should be only one live CRI instance in multi-threaded context otherwise gRPC hangs. Tests are part of code base and need to be maintained with same care as production code.

@Jongy
Copy link
Contributor

Jongy commented Oct 13, 2024

Tests show the problem with implementation of ContainerNamesClient itself. There should be only one live CRI instance in multi-threaded context otherwise gRPC hangs. Tests are part of code base and need to be maintained with same care as production code.

I see, so this fixes the ContainerNamesClient code and effectively fixes the tests.

Comment on lines 33 to 36
global _containers_client
try:
self._containers_client: Optional[ContainersClient] = ContainersClient()
logger.info(f"Discovered container runtimes: {self._containers_client.get_runtimes()}")
if _containers_client is None:
_containers_client = ContainersClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract this to a method get_containers_client which will be locked (to avoid any concurrency issues) and maintain the cache.

Then, ContainerNamesClient runs self._containers_client = get_containers_client().

Copy link
Contributor

@Jongy Jongy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@slicklash slicklash requested a review from Jongy October 24, 2024 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants