fix: Improve PythonAsyncEngine error handling and Increase Tokio thread count #129

ryanolson · 2025-02-07T20:39:11Z

What does the PR do?

Improved error handling in the PythonAsyncEngine
Update the default Tokio Runtime to 16 async threads and 16 offload threads.

The default single threaded Tokio runtime was being starved, likely due to GIL contention in the python soak.py test.

When we grab the GIL, we are actually blocking our rust tokio async threads. We might consider scheduling GIL calls ot the offload threads meant for blocking calls.

We might also consider using separate single threaded tokio runtime to handle all task that will touch the GIL.

A third, and likely best option is to use static tokio mutex handle which need to be acquired before accessing the GIL, thus put a yielding tokio lock around the GIL.

The multi-threaded RT is sufficient now to unblock the soak test, but long-term strategies on GIL handling need to be considered.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

Where should the reviewer start?

If the channel is closed, the send errors. We were not checking the return status and unwrapping them.

Test plan:

CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

runtime/rust/python-wheel/tests/soak.py

rmccorm4 · 2025-02-10T01:17:55Z

runtime/rust/src/config.rs

+            max_worker_threads: 16,
+            max_blocking_threads: 16,


Few notes/questions:

Should these be env vars similar to the from_settings support for configurability? ex:

TRITON_RUNTIME_NUM_WORKER_THREADS=16 TRITON_RUNTIME_MAX_BLOCKING_THREADS=16

It looks like there is already a native tokio env var for worker threads: TOKIO_WORKER_THREADS - but I'm fine with defining our own for consistency with other config settings

From the docs - it looks like the default behavior, if left unset, is to use the num cpu cores rather than trying to apply a fixed number to various hardwares and scenarios. Do we want to define a fixed number here?

The default value is the number of cores available to the system.

Docs seem to say that worker_threads will spawn/use that many threads - so I think it's more aptly named "num_worker_threads" than "max_worker_threads" (which implies it may not spawn/use them all). For max_blocking_threads, I think that name fits well based on the docs:

Unlike the worker_threads, they are not always active and will exit if left idle for too long. You can change this timeout duration with thread_keep_alive.

let's do make these configurable -

The TRITON_RUNTIME_ envs are the correct envs. We can also use TOML files in specific paths to override compiled defaults.

we could rename NUM_WORKER_THREADS to NUM_ASYNC_THREADS

Added #146 as a good intro task for someone to pick up - but feel free to commit it if you already have it done

Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: Ryan Olson <[email protected]>

nnshah1 · 2025-02-10T16:15:37Z

@rmccorm4 - LGTM - good with your review -

nnshah1 · 2025-02-10T16:17:57Z

runtime/rust/python-wheel/tests/soak.py

can we run this via pytest - maybe nightly?

I think so - we can test-ify and add the pytest.mark.nightly marker here - but I think the CI needs an update to run a nightly scheduled job @nv-anants

DLIS-7968

DLIS-7969

i marked the rust soak.rs test as a feature, so by default it will not run.

cargo test --features integration

we'll want a mark for test that depend on etcd/nats

nnshah1

LGTM- defer to @rmccorm4 for final review

…; adding soak test

…server/triton_distributed into ryan/250207-push-fixes

rmccorm4 · 2025-02-10T20:41:05Z

runtime/rust/python-wheel/rust/engine.rs

+                                let msg = format!("critical error: failed to offload the python async generator to a new thread: {}", e);
+                                log::error!(request_id, "{}", msg);
+                                msg
+                            }


is it possible to match none of the cases here? if so, should we have a default case here? Or it's guaranteed to be a ResponseProcessingError and will compile-time fail if new errors are added but no corresponding match case?

the latter. its good practice not to have a default, because the compiler will enforce that all possible arms of the match statement be defined.

if you add another arm in the enum, you will see a compiler error on the match, or any where in the code base this is used. in this case the enum is private, so it's static to the file for now.

handle errors in the error handler

b82de18

ryanolson temporarily deployed to GITLAB February 7, 2025 20:39 — with GitHub Actions Inactive

ryanolson marked this pull request as draft February 7, 2025 20:39

ryanolson temporarily deployed to GITLAB February 7, 2025 20:44 — with GitHub Actions Inactive

ryanolson added 3 commits February 7, 2025 23:32

adding soak test

2d6f1be

improving error handling and logging in the python async engine

6a858c3

make the default runtime be multi-threaded

93b17f7

ryanolson temporarily deployed to GITLAB February 8, 2025 07:34 — with GitHub Actions Inactive

ryanolson temporarily deployed to GITLAB February 8, 2025 07:35 — with GitHub Actions Inactive

typos/pre-commit

2cc08d7

ryanolson temporarily deployed to GITLAB February 8, 2025 07:43 — with GitHub Actions Inactive

ryanolson requested review from nnshah1 and rmccorm4 February 8, 2025 07:43

ryanolson temporarily deployed to GITLAB February 8, 2025 07:43 — with GitHub Actions Inactive

ryanolson marked this pull request as ready for review February 8, 2025 08:05

Merge branch 'main' into ryan/250207-push-fixes

fad2e4b

ryanolson temporarily deployed to GITLAB February 8, 2025 17:35 — with GitHub Actions Inactive

rmccorm4 reviewed Feb 10, 2025

View reviewed changes

runtime/rust/python-wheel/tests/soak.py Outdated Show resolved Hide resolved

rmccorm4 reviewed Feb 10, 2025

View reviewed changes

rmccorm4 changed the title ~~fix: improved error handling~~ fix: Improve PythonAsyncEngine error handling and Increase Tokio thread count Feb 10, 2025

Update runtime/rust/python-wheel/tests/soak.py

78dfb10

Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: Ryan Olson <[email protected]>

ryanolson temporarily deployed to GITLAB February 10, 2025 16:08 — with GitHub Actions Inactive

ryanolson temporarily deployed to GITLAB February 10, 2025 16:09 — with GitHub Actions Inactive

nnshah1 reviewed Feb 10, 2025

View reviewed changes

nnshah1 previously approved these changes Feb 10, 2025

View reviewed changes

offloading gil access in the response path

fa9d73f

ryanolson dismissed nnshah1’s stale review via fa9d73f February 10, 2025 17:00

ryanolson temporarily deployed to GITLAB February 10, 2025 17:00 — with GitHub Actions Inactive

ryanolson temporarily deployed to GITLAB February 10, 2025 17:01 — with GitHub Actions Inactive

rmccorm4 mentioned this pull request Feb 10, 2025

Add environment variables for tokio runtime thread counts #146

Closed

Merge branch 'main' into ryan/250207-push-fixes

3b01fa0

rmccorm4 temporarily deployed to GITLAB February 10, 2025 17:24 — with GitHub Actions Inactive

rmccorm4 temporarily deployed to GITLAB February 10, 2025 17:25 — with GitHub Actions Inactive

ryanolson added 2 commits February 10, 2025 09:42

adding feature flag for integration tests which rely on etcd and nats…

ca63e26

…; adding soak test

Merge branch 'ryan/250207-push-fixes' of github.com:triton-inference-…

abc3637

…server/triton_distributed into ryan/250207-push-fixes

ryanolson temporarily deployed to GITLAB February 10, 2025 17:43 — with GitHub Actions Inactive

ryanolson temporarily deployed to GITLAB February 10, 2025 17:44 — with GitHub Actions Inactive

ryanolson self-assigned this Feb 10, 2025

ryanolson requested review from rmccorm4 and nnshah1 February 10, 2025 20:00

Merge branch 'main' into ryan/250207-push-fixes

f32d463

rmccorm4 temporarily deployed to GITLAB February 10, 2025 20:22 — with GitHub Actions Inactive

rmccorm4 temporarily deployed to GITLAB February 10, 2025 20:23 — with GitHub Actions Inactive

Merge branch 'main' into ryan/250207-push-fixes

ec67830

rmccorm4 temporarily deployed to GITLAB February 10, 2025 20:34 — with GitHub Actions Inactive

rmccorm4 temporarily deployed to GITLAB February 10, 2025 20:35 — with GitHub Actions Inactive

This was referenced Feb 10, 2025

Add rust integration tests (ex: with etcd/nats) #150

Open

Add rust integration (ex: etcd/nats) tests #151

Closed

rmccorm4 reviewed Feb 10, 2025

View reviewed changes

rmccorm4 approved these changes Feb 10, 2025

View reviewed changes

Merge branch 'main' into ryan/250207-push-fixes

106300e

rmccorm4 temporarily deployed to GITLAB February 10, 2025 21:34 — with GitHub Actions Inactive

rmccorm4 temporarily deployed to GITLAB February 10, 2025 21:36 — with GitHub Actions Inactive

ryanolson merged commit 99c126a into main Feb 10, 2025
6 checks passed

ryanolson deleted the ryan/250207-push-fixes branch February 10, 2025 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Improve PythonAsyncEngine error handling and Increase Tokio thread count #129

fix: Improve PythonAsyncEngine error handling and Increase Tokio thread count #129

ryanolson commented Feb 7, 2025 •

edited

Loading

rmccorm4 Feb 10, 2025

nnshah1 Feb 10, 2025

ryanolson Feb 10, 2025

ryanolson Feb 10, 2025

rmccorm4 Feb 10, 2025

nnshah1 commented Feb 10, 2025

nnshah1 Feb 10, 2025

rmccorm4 Feb 10, 2025

ryanolson Feb 10, 2025

rmccorm4 Feb 10, 2025

nnshah1 left a comment

rmccorm4 Feb 10, 2025 •

edited

Loading

ryanolson Feb 10, 2025

fix: Improve PythonAsyncEngine error handling and Increase Tokio thread count #129

fix: Improve PythonAsyncEngine error handling and Increase Tokio thread count #129

Conversation

ryanolson commented Feb 7, 2025 • edited Loading

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nnshah1 commented Feb 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nnshah1 left a comment

Choose a reason for hiding this comment

rmccorm4 Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanolson commented Feb 7, 2025 •

edited

Loading

rmccorm4 Feb 10, 2025 •

edited

Loading