Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improve PythonAsyncEngine error handling and Increase Tokio thread count #129

Merged
merged 14 commits into from
Feb 10, 2025

Conversation

ryanolson
Copy link
Contributor

@ryanolson ryanolson commented Feb 7, 2025

What does the PR do?

  • Improved error handling in the PythonAsyncEngine
  • Update the default Tokio Runtime to 16 async threads and 16 offload threads.

The default single threaded Tokio runtime was being starved, likely due to GIL contention in the python soak.py test.

When we grab the GIL, we are actually blocking our rust tokio async threads. We might consider scheduling GIL calls ot the offload threads meant for blocking calls.

We might also consider using separate single threaded tokio runtime to handle all task that will touch the GIL.

A third, and likely best option is to use static tokio mutex handle which need to be acquired before accessing the GIL, thus put a yielding tokio lock around the GIL.

The multi-threaded RT is sufficient now to unblock the soak test, but long-term strategies on GIL handling need to be considered.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

Where should the reviewer start?

If the channel is closed, the send errors. We were not checking the return status and unwrapping them.

Test plan:

  • CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@ryanolson ryanolson requested review from nnshah1 and rmccorm4 February 8, 2025 07:43
@ryanolson ryanolson marked this pull request as ready for review February 8, 2025 08:05
Comment on lines +121 to +122
max_worker_threads: 16,
max_blocking_threads: 16,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few notes/questions:

  1. Should these be env vars similar to the from_settings support for configurability? ex:
TRITON_RUNTIME_NUM_WORKER_THREADS=16
TRITON_RUNTIME_MAX_BLOCKING_THREADS=16
  1. It looks like there is already a native tokio env var for worker threads: TOKIO_WORKER_THREADS - but I'm fine with defining our own for consistency with other config settings

  2. From the docs - it looks like the default behavior, if left unset, is to use the num cpu cores rather than trying to apply a fixed number to various hardwares and scenarios. Do we want to define a fixed number here?

The default value is the number of cores available to the system.

  1. Docs seem to say that worker_threads will spawn/use that many threads - so I think it's more aptly named "num_worker_threads" than "max_worker_threads" (which implies it may not spawn/use them all). For max_blocking_threads, I think that name fits well based on the docs:

Unlike the worker_threads, they are not always active and will exit if left idle for too long. You can change this timeout duration with thread_keep_alive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do make these configurable -

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TRITON_RUNTIME_ envs are the correct envs. We can also use TOML files in specific paths to override compiled defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could rename NUM_WORKER_THREADS to NUM_ASYNC_THREADS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #146 as a good intro task for someone to pick up - but feel free to commit it if you already have it done

@rmccorm4 rmccorm4 changed the title fix: improved error handling fix: Improve PythonAsyncEngine error handling and Increase Tokio thread count Feb 10, 2025
Co-authored-by: Ryan McCormick <[email protected]>
Signed-off-by: Ryan Olson <[email protected]>
@nnshah1
Copy link
Collaborator

nnshah1 commented Feb 10, 2025

@rmccorm4 - LGTM - good with your review -

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we run this via pytest - maybe nightly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so - we can test-ify and add the pytest.mark.nightly marker here - but I think the CI needs an update to run a nightly scheduled job @nv-anants

  • DLIS-7968
  • DLIS-7969

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i marked the rust soak.rs test as a feature, so by default it will not run.

cargo test --features integration

we'll want a mark for test that depend on etcd/nats

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #150

nnshah1
nnshah1 previously approved these changes Feb 10, 2025
Copy link
Collaborator

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM- defer to @rmccorm4 for final review

let msg = format!("critical error: failed to offload the python async generator to a new thread: {}", e);
log::error!(request_id, "{}", msg);
msg
}
Copy link
Contributor

@rmccorm4 rmccorm4 Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to match none of the cases here? if so, should we have a default case here? Or it's guaranteed to be a ResponseProcessingError and will compile-time fail if new errors are added but no corresponding match case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the latter. its good practice not to have a default, because the compiler will enforce that all possible arms of the match statement be defined.

if you add another arm in the enum, you will see a compiler error on the match, or any where in the code base this is used. in this case the enum is private, so it's static to the file for now.

@ryanolson ryanolson merged commit 99c126a into main Feb 10, 2025
6 checks passed
@ryanolson ryanolson deleted the ryan/250207-push-fixes branch February 10, 2025 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants