-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Windows Threading Issues #385
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #385 +/- ##
===================================
- Coverage 91% 91% -1%
===================================
Files 25 25
Lines 1783 1800 +17
===================================
+ Hits 1631 1635 +4
- Misses 152 165 +13 |
hi @FrsECM, thank you so much for the PR! Could you also add additional information about why setting |
Sure. By default, the number of worker is set to 1 and it crashes when we have multiples worker per devices on windows. |
I have to investigate a little more on another issue that cause [WinError 10022]. It seems that in some case (not completely reproductible), the socket is not ready to listen when we start uvicorn servers. |
@aniketmaurya did you have a look on the PR ? It seems to fix my problem on windows. It doesn't fix issue number #372 but it allows to make multiple worker on windows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks for creating the fix 🚀
… into bugfix/windows_multiple_workers
for more information, see https://pre-commit.ci
@aniketmaurya i got an idea for #372. Unlike on Linux, i discovered that inference workers were stopped first instead of uvicorn's. A fix is to join on inference worker on windows and to cleanly give a signal to uvicorn in order to end threads properly. I renamed some variables to be a little more explicit about their content. |
Fix Type hint Co-authored-by: Aniket Maurya <[email protected]>
Remove windows comment. Co-authored-by: Aniket Maurya <[email protected]>
for more information, see https://pre-commit.ci
@FrsECM it seems like the Windows tests are stuck and have timed out. |
Hum. Curious. On my windows machine I don’t have issues.
I don’t know well the test stack.
I'll try to have a look tomorrow.
Envoyé à partir de Outlook pour iOS<https://aka.ms/o0ukef>
…________________________________
De : Aniket Maurya ***@***.***>
Envoyé : Thursday, December 12, 2024 8:09:16 PM
À : Lightning-AI/LitServe ***@***.***>
Cc : François Ponchon ***@***.***>; Mention ***@***.***>
Objet : Re: [Lightning-AI/LitServe] Fix Windows Threading Issues (PR #385)
@FrsECM<https://github.com/FrsECM> it seems like the Windows tests are stuck and have timed out.
—
Reply to this email directly, view it on GitHub<#385 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGG5F7H2TWURUOQHJ6OVXRT2FHNNZAVCNFSM6AAAAABTBG2HB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZZHAYDQMJZHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@aniketmaurya i did a first try and suspected httpx>=0.28.0. But i was not up to date. On python 3.10.15, i have no issues on my computer, every tests are running. (litserve) PS C:\BUSCODE\packages\LitServe> python -m pytest
C:\Users\F296849\AppData\Local\miniforge3\envs\litserve\lib\site-packages\pytest_asyncio\plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
==================================================================== test session starts ====================================================================
platform win32 -- Python 3.10.15, pytest-8.3.4, pluggy-1.5.0
rootdir: C:\BUSCODE\packages\LitServe
configfile: pytest.ini
plugins: anyio-4.6.2.post1, asyncio-0.25.0, cov-6.0.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 169 items
tests\e2e\test_e2e.py ............. [ 7%]
tests\test_auth.py .... [ 10%]
tests\test_batch.py ........... [ 16%]
tests\test_callbacks.py .... [ 18%]
tests\test_cli.py .. [ 20%]
tests\test_compression.py . [ 20%]
tests\test_connector.py s..ssssssss. [ 27%]
tests\test_docker_builder.py .. [ 28%]
tests\test_examples.py .......... [ 34%]
tests\test_form.py ... [ 36%]
tests\test_lit_server.py .........sssss..ss........... [ 53%]
tests\test_litapi.py .................. [ 64%]
tests\test_logger.py ........ [ 69%]
tests\test_logging.py ... [ 71%]
tests\test_loops.py ............. [ 78%]
tests\test_middlewares.py ... [ 80%]
tests\test_pydantic.py . [ 81%]
tests\test_readme.py s [ 81%]
tests\test_schema.py .. [ 82%]
tests\test_simple.py ....... [ 86%]
tests\test_specs.py .................. [ 97%]
tests\test_torch.py .s [ 98%]
tests\test_utils.py .. [100%]
================================================= 151 passed, 18 skipped, 50 warnings in 204.37s (0:03:24) ================================================== On python 3.11.11, i have no issues on my computer, every tests are running. (litserve3.11) PS C:\BUSCODE\packages\LitServe> python -m pytest
C:\Users\F296849\AppData\Local\miniforge3\envs\litserve3.11\Lib\site-packages\pytest_asyncio\plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
==================================================================== test session starts ====================================================================
platform win32 -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: C:\BUSCODE\packages\LitServe
configfile: pytest.ini
plugins: anyio-4.7.0, asyncio-0.25.0, cov-6.0.0, retry-1.6.3
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 169 items
tests\e2e\test_e2e.py ............. [ 7%]
tests\test_auth.py .... [ 10%]
tests\test_batch.py ........... [ 16%]
tests\test_callbacks.py .... [ 18%]
tests\test_cli.py .. [ 20%]
tests\test_compression.py . [ 20%]
tests\test_connector.py s..ssssssss. [ 27%]
tests\test_docker_builder.py .. [ 28%]
tests\test_examples.py .......... [ 34%]
tests\test_form.py ... [ 36%]
tests\test_lit_server.py .........sssss..ss........... [ 53%]
tests\test_litapi.py .................. [ 64%]
tests\test_logger.py ........ [ 69%]
tests\test_logging.py ... [ 71%]
tests\test_loops.py ............. [ 78%]
tests\test_middlewares.py ... [ 80%]
tests\test_pydantic.py . [ 81%]
tests\test_readme.py s [ 81%]
tests\test_schema.py .. [ 82%]
tests\test_simple.py ....... [ 86%]
tests\test_specs.py .................. [ 97%]
tests\test_torch.py .s [ 98%]
tests\test_utils.py .. [100%]
================================================= 151 passed, 18 skipped, 48 warnings in 205.77s (0:03:25) ================================================== |
@aniketmaurya I tried to increase the timeout. The usage of threading instead of processes and the absence of uvloop on windows makes it slower. It may also depend on the runner. |
@@ -48,7 +48,7 @@ jobs: | |||
pip list | |||
|
|||
- name: Tests | |||
timeout-minutes: 10 | |||
timeout-minutes: 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's switch it back after figuring out the reason CI is stuck since we don't want to run tests for 30 mins.
timeout-minutes: 30 | |
timeout-minutes: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems CI is stuck since this commit 2bbed42
Maybe due to version 3.11 of python for a reason i ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, since it's only specific to Python 3.11 on Windows latest, it probably means that something is not working as expected.
hi @FrsECM, Happy New Year! How is it going here? Please let me know if you need any help? |
Hi !
Happy new year !
Unfortunately I’m unable to reproduce what happens on the stuck CICD locally.
And I don’t have ressource to do this locally. (I can only use docker through wsl (eg. linux) with my company’s computer.
For me the fix would be to increase the timeout since the issue was present before my proposal to fix multi worker problems.
I suspect a problem with virtualization since my computer is not a VM.
Envoyé à partir de Outlook pour iOS<https://aka.ms/o0ukef>
…________________________________
De : Aniket Maurya ***@***.***>
Envoyé : Monday, January 6, 2025 12:06:13 AM
À : Lightning-AI/LitServe ***@***.***>
Cc : François Ponchon ***@***.***>; Mention ***@***.***>
Objet : Re: [Lightning-AI/LitServe] Fix Windows Threading Issues (PR #385)
hi @FrsECM<https://github.com/FrsECM>, Happy New Year! How is it going here? Please let me know if you need any help?
—
Reply to this email directly, view it on GitHub<#385 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGG5F7AL7MTOLQDFOMNAV4T2JG3GLAVCNFSM6AAAAABTBG2HB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZRG44DCNZSHE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
logger.debug("Enable Windows explicit socket sharing...") | ||
# We make sure sockets is listening... | ||
# It prevents further [WinError 10022] | ||
[sock.listen(config.backlog) for sock in sockets] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[sock.listen(config.backlog) for sock in sockets] | |
for sock in sockets: | |
sock.listen(config.backlog) |
no need to create a list only to discard it afterwards :)
@@ -232,6 +235,7 @@ def __init__( | |||
self.model_metadata = model_metadata | |||
self._connector = _Connector(accelerator=accelerator, devices=devices) | |||
self._callback_runner = CallbackRunner(callbacks) | |||
self._uvicorn_servers = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's init this immediately as an empty list? no need to first have None
here and properly init later on. Also maybe type it?
Before submitting
As a user, i need to serve a model with multiple worker per device on a windows machine
What does this PR do?
This PR propose a fix to bug from Issue #384.
Uvicorn reference: Source
For example, with this simple server :
Bellow, you'll find the difference between before and after the PR :
Before the PR
After the PR
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃