Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network failures with self-hosted servers #7512

Closed
zanieb opened this issue Nov 11, 2022 · 28 comments
Closed

Network failures with self-hosted servers #7512

zanieb opened this issue Nov 11, 2022 · 28 comments
Assignees
Labels
bug Something isn't working upstream dependency An upstream issue caused by a bug in one of our dependencies

Comments

@zanieb
Copy link
Contributor

zanieb commented Nov 11, 2022

This is tracking issue for various reports of network failures when self-hosting Prefect Orion.

Notably, these issues seem focused to usage of Prefect 2.6.6.

If adding a report to this issue, please include the following information:

  • If using Prefect official Docker images for the client or server, provide the image tags
  • On the server, we are interested in the Prefect version, the database, and server library versions
prefect version
pip freeze | grep -E '(uvicorn|starlette)'
  • On the client, we are interested in Prefect versions and the client HTTP library versions
prefect version
pip freeze | grep -E '(httpx|httpcore)'
  • Please include the full traceback for the error
  • Check for any related error logs on the server

Related to:

@zanieb zanieb added the bug Something isn't working label Nov 11, 2022
@carlo-catalyst
Copy link

carlo-catalyst commented Nov 11, 2022

Note

  • We observed this with flows calling run_deployment
  • Looks like our agents were inadvertently running on 2.6.4, but the server was on 2.6.6.

Server

  • We run with docker using this image: prefecthq/prefect:2.6.6-python3.9
- Using postgres

**Agent**
- Our own custom image, with dependencies installed
- using that image, pulled these: 

# pip freeze | grep -E '(httpx|httpcore)'
httpcore==0.15.0
httpx==0.23.0
# prefect version
Version:             2.6.4
API version:         0.8.2
Python version:      3.9.14
Git commit:          51e92dda
Built:               Thu, Oct 20, 2022 3:11 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.34.1

Stack Trace

run_deployment(name=run_deployment_name(flow=**_flow, env=env), parameters=parameters)
 File "/usr/local/lib/python3.9/site-packages/prefect/utilities/asyncutils.py", line 197, in coroutine_wrapper
   return run_async_from_worker_thread(async_fn, *args, **kwargs)
 File "/usr/local/lib/python3.9/site-packages/prefect/utilities/asyncutils.py", line 148, in run_async_from_worker_thread
   return anyio.from_thread.run(call)
 File "/usr/local/lib/python3.9/site-packages/anyio/from_thread.py", line 49, in run
   return asynclib.run_async_from_thread(func, *args)
 File "/usr/local/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 970, in run_async_from_thread
   return f.result()
 File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 446, in result
   return self.__get_result()
 File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
   raise self._exception
 File "/usr/local/lib/python3.9/site-packages/prefect/client/utilities.py", line 47, in with_injected_client
   return await fn(*args, **kwargs)
 File "/usr/local/lib/python3.9/site-packages/prefect/deployments.py", line 127, in run_deployment
   flow_run = await client.read_flow_run(flow_run_id)
 File "/usr/local/lib/python3.9/site-packages/prefect/client/orion.py", line 1439, in read_flow_run
   response = await self._client.get(f"/flow_runs/{flow_run_id}")
 File "/usr/local/lib/python3.9/site-packages/httpx/_client.py", line 1751, in get
   return await self.request(
 File "/usr/local/lib/python3.9/site-packages/httpx/_client.py", line 1527, in request
   return await self.send(request, auth=auth, follow_redirects=follow_redirects)
 File "/usr/local/lib/python3.9/site-packages/prefect/client/base.py", line 159, in send
   await super().send(*args, **kwargs)
 File "/usr/local/lib/python3.9/site-packages/httpx/_client.py", line 1614, in send
   response = await self._send_handling_auth(
 File "/usr/local/lib/python3.9/site-packages/httpx/_client.py", line 1642, in _send_handling_auth
   response = await self._send_handling_redirects(
 File "/usr/local/lib/python3.9/site-packages/httpx/_client.py", line 1679, in _send_handling_redirects
   response = await self._send_single_request(request)
 File "/usr/local/lib/python3.9/site-packages/httpx/_client.py", line 1716, in _send_single_request
   response = await transport.handle_async_request(request)
 File "/usr/local/lib/python3.9/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
   resp = await self._pool.handle_async_request(req)
 File "/usr/local/lib/python3.9/contextlib.py", line 137, in __exit__
   self.gen.throw(typ, value, traceback)
 File "/usr/local/lib/python3.9/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
   raise mapped_exc(message) from exc
httpx.RemoteProtocolError: Server disconnected without sending a response.

@zanieb
Copy link
Contributor Author

zanieb commented Nov 12, 2022

Possibly helpful context from an httpx discussion at encode/httpx#2056

@ikeepo
Copy link

ikeepo commented Nov 14, 2022

Note
We observed this while using agent to connect the remote self host server
Server:
#prefect

Version:             2.6.5
API version:         0.8.3
Python version:      3.8.3
Git commit:          9fc2658f
Built:               Thu, Oct 27, 2022 2:24 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         hosted

#pip freeze | grep -E '(uvicorn|starlette)'

starlette==0.19.1
uvicorn==0.17.6

Client
#prefect

Version:             2.6.5
API version:         0.8.3
Python version:      3.8.10
Git commit:          9fc2658f
Built:               Thu, Oct 27, 2022 2:24 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         hosted

#pip freeze | grep -E '(httpx|httpcore)'

httpcore==0.15.0
httpx==0.23.0

Stack Trace:

Encountered exception during execution:
Traceback (most recent call last):
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
    yield
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpcore/_async/connection_pool.py", line 253, in handle_async_request
    raise exc
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpcore/_async/connection_pool.py", line 237, in handle_async_request
    response = await connection.handle_async_request(request)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpcore/_async/connection.py", line 90, in handle_async_request
    return await self._connection.handle_async_request(request)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpcore/_async/http11.py", line 105, in handle_async_request
    raise exc
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpcore/_async/http11.py", line 84, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpcore/_async/http11.py", line 148, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpcore/_async/http11.py", line 191, in _receive_event
    raise RemoteProtocolError(msg)
httpcore.RemoteProtocolError: Server disconnected without sending a response.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/wind/miniconda3/lib/python3.8/site-packages/prefect/engine.py", line 580, in orchestrate_flow_run
    result = await run_sync(flow_call)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 68, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(call, cancellable=True)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/wind/miniconda3/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/wind/miniconda3/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/tmp/tmpd9i_ocaqprefect/src/prefect_deployment/win_server/__init__.py", line 255, in orche_oxen_settle_fields_to_xms
    res_rsync = run_deployment(name=f"manual-handler/oxen_settlefields_{file_type}", flow_run_name=f"{func_name}_2_2")
  File "/home/wind/miniconda3/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 197, in coroutine_wrapper
    return run_async_from_worker_thread(async_fn, *args, **kwargs)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 148, in run_async_from_worker_thread
    return anyio.from_thread.run(call)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/anyio/from_thread.py", line 49, in run
    return asynclib.run_async_from_thread(func, *args)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 970, in run_async_from_thread
    return f.result()
  File "/home/wind/miniconda3/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/home/wind/miniconda3/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/home/wind/miniconda3/lib/python3.8/site-packages/prefect/client/utilities.py", line 47, in with_injected_client
    return await fn(*args, **kwargs)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/prefect/deployments.py", line 127, in run_deployment
    flow_run = await client.read_flow_run(flow_run_id)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/prefect/client/orion.py", line 1432, in read_flow_run
    response = await self._client.get(f"/flow_runs/{flow_run_id}")
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpx/_client.py", line 1751, in get
    return await self.request(
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpx/_client.py", line 1527, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/prefect/client/base.py", line 159, in send
    await super().send(*args, **kwargs)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpx/_client.py", line 1614, in send
    response = await self._send_handling_auth(
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpx/_client.py", line 1642, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpx/_client.py", line 1679, in _send_handling_redirects
    response = await self._send_single_request(request)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpx/_client.py", line 1716, in _send_single_request
    response = await transport.handle_async_request(request)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/home/wind/miniconda3/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/wind/miniconda3/lib/python3.8/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.RemoteProtocolError: Server disconnected without sending a response.

@zanieb
Copy link
Contributor Author

zanieb commented Nov 14, 2022

Thanks for the report! We are currently suspicious that this is a bug in uvicorn. We'd like to find the root cause, but may explore client-side solutions in the interim.

Note for other maintainers: This is a confirmed reproduction on 2.6.5.

@ikeepo does this reproduce on the latest versions of uvicorn/starlette? We've also heard users report that this is solved on Prefect 2.6.4, do you see it there?

@eudyptula
Copy link
Contributor

We seen it on Prefect 2.6.3

Server

Version:             2.6.3
API version:         0.8.2
Python version:      3.10.6
Git commit:          9e7da96e
Built:               Tue, Oct 18, 2022 1:55 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         <client error>
starlette==0.20.4
uvicorn==0.18.3

Agent

We are running a custom build version, with our two pull requests:

Doubt they have anything to do with it though - specially since other people are seeing the same.

Version:             2.6.3+3
API version:         0.8.2
Python version:      3.10.6
Git commit:          e7fd68e7
Built:               Tue, Nov 1, 2022 10:29 AM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.37.2
httpcore==0.15.0
httpx==0.23.0

Docker container

Custom built from prefecthq/prefect:2.6.3-python3.10).

Version:             2.6.3
API version:         0.8.2
Python version:      3.10.8
Git commit:          9e7da96e
Built:               Tue, Oct 18, 2022 1:55 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.34.1
httpcore==0.15.0
httpx==0.23.0

Stacktrace

Crash details:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 237, in handle_async_request
    response = await connection.handle_async_request(request)
  File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection.py", line 88, in handle_async_request
    raise ConnectionNotAvailable()
httpcore.ConnectionNotAvailable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 1332, in report_task_run_crashes
    yield
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 1068, in begin_task_run
    connect_error = await client.api_healthcheck()
  File "/usr/local/lib/python3.10/site-packages/prefect/client/orion.py", line 183, in api_healthcheck
    await self._client.get("/health")
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1751, in get
    return await self.request(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1527, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/usr/local/lib/python3.10/site-packages/prefect/client/base.py", line 159, in send
    await super().send(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1614, in send
    response = await self._send_handling_auth(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1642, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1679, in _send_handling_redirects
    response = await self._send_single_request(request)
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1716, in _send_single_request
    response = await transport.handle_async_request(request)
  File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 246, in handle_async_request
    async with self._pool_lock:
  File "/usr/local/lib/python3.10/site-packages/anyio/_core/_synchronization.py", line 130, in acquire
    await event.wait()
  File "/usr/local/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError

Hope it helps.

@ikeepo
Copy link

ikeepo commented Nov 14, 2022

@madkinsz The latest versions of uvicorn/starlette doesn't work, it still reproduces and even much more frequently.
Never use 2.6.4 before, I'll try today.

@MuFaheemkhan
Copy link

MuFaheemkhan commented Nov 15, 2022

  • prefecthq/prefect:2.6.6-python3.10,
  • local server in docker containers.
  • self hosted postgress server in docker container
  • prefect-dask = "0.2.1"
  • Dask task runner

Errors:
BrokenPipeError: [Errno 32] Broken pipe
ConnectionResetError: [Errno 104] Connection reset by peer

Even without running any flows I get the above errors after some time and when I run a flow and the error appears, it crashes the flow.

Further details in this thread https://prefect-community.slack.com/archives/CL09KU1K7/p1667880546594239

Connection Reset error (agent logs)

prefect-agent_1   | 23:07:49.438 | DEBUG   | prefect.agent - Checking for flow runs...
prefect-agent_1   | 23:07:49.447 | ERROR   | prefect.agent -
prefect-agent_1   | Traceback (most recent call last):
prefect-agent_1   |   File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 854, in _read_ready__data_received
prefect-agent_1   |     data = self._sock.recv(self.max_size)
prefect-agent_1   | ConnectionResetError: [Errno 104] Connection reset by peer
prefect-agent_1   |
prefect-agent_1   | The above exception was the direct cause of the following exception:
prefect-agent_1   |
prefect-agent_1   | Traceback (most recent call last):
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions
prefect-agent_1   |     yield
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/backends/asyncio.py", line 33, in read
prefect-agent_1   |     return await self._stream.receive(max_bytes=max_bytes)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 1274, in receive
prefect-agent_1   |     raise self._protocol.exception
prefect-agent_1   | anyio.BrokenResourceError
prefect-agent_1   |
prefect-agent_1   | During handling of the above exception, another exception occurred:
prefect-agent_1   |
prefect-agent_1   | Traceback (most recent call last):
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
prefect-agent_1   |     yield
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
prefect-agent_1   |     resp = await self._pool.handle_async_request(req)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 253, in handle_async_request
prefect-agent_1   |     raise exc
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 237, in handle_async_request
prefect-agent_1   |     response = await connection.handle_async_request(request)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection.py", line 90, in handle_async_request
prefect-agent_1   |     return await self._connection.handle_async_request(request)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/http11.py", line 105, in handle_async_request
prefect-agent_1   |     raise exc
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/http11.py", line 84, in handle_async_request
prefect-agent_1   |     ) = await self._receive_response_headers(**kwargs)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/http11.py", line 148, in _receive_response_headers
prefect-agent_1   |     event = await self._receive_event(timeout=timeout)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/http11.py", line 177, in _receive_event
prefect-agent_1   |     data = await self._network_stream.read(
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/backends/asyncio.py", line 30, in read
prefect-agent_1   |     with map_exceptions(exc_map):
prefect-agent_1   |   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
prefect-agent_1   |     self.gen.throw(typ, value, traceback)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_exceptions.py", line 12, in map_exceptions
prefect-agent_1   |     raise to_exc(exc)
prefect-agent_1   | httpcore.ReadError
prefect-agent_1   |
prefect-agent_1   | The above exception was the direct cause of the following exception:
prefect-agent_1   |
prefect-agent_1   | Traceback (most recent call last):
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 154, in get_and_submit_flow_runs
prefect-agent_1   |     queue_runs = await self.client.get_runs_in_work_queue(
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/prefect/client/orion.py", line 759, in get_runs_in_work_queue
prefect-agent_1   |     response = await self._client.post(
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1842, in post
prefect-agent_1   |     return await self.request(
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1527, in request
prefect-agent_1   |     return await self.send(request, auth=auth, follow_redirects=follow_redirects)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/prefect/client/base.py", line 159, in send
prefect-agent_1   |     await super().send(*args, **kwargs)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1614, in send
prefect-agent_1   |     response = await self._send_handling_auth(
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1642, in _send_handling_auth
prefect-agent_1   |     response = await self._send_handling_redirects(
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1679, in _send_handling_redirects
prefect-agent_1   |     response = await self._send_single_request(request)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1716, in _send_single_request
prefect-agent_1   |     response = await transport.handle_async_request(request)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 352, in handle_async_request
prefect-agent_1   |     with map_httpcore_exceptions():
prefect-agent_1   |   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
prefect-agent_1   |     self.gen.throw(typ, value, traceback)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
prefect-agent_1   |     raise mapped_exc(message) from exc
prefect-agent_1   | httpx.ReadError
prefect-agent_1   | 23:07:54.465 | DEBUG   | prefect.agent - Checking for flow runs...

BROKEN PIPE ERROR(AGENT LOGS)

prefect-agent_1   | 23:07:49.438 | DEBUG   | prefect.agent - Checking for flow runs...
prefect-agent_1   | 23:07:49.447 | ERROR   | prefect.agent -
prefect-agent_1   | Traceback (most recent call last):
prefect-agent_1   |   File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 854, in _read_ready__data_received
prefect-agent_1   |     data = self._sock.recv(self.max_size)
prefect-agent_1   | ConnectionResetError: [Errno 104] Connection reset by peer
prefect-agent_1   |
prefect-agent_1   | The above exception was the direct cause of the following exception:
prefect-agent_1   |
prefect-agent_1   | Traceback (most recent call last):
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions
prefect-agent_1   |     yield
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/backends/asyncio.py", line 33, in read
prefect-agent_1   |     return await self._stream.receive(max_bytes=max_bytes)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 1274, in receive
prefect-agent_1   |     raise self._protocol.exception
prefect-agent_1   | anyio.BrokenResourceError
prefect-agent_1   |
prefect-agent_1   | During handling of the above exception, another exception occurred:
prefect-agent_1   |
prefect-agent_1   | Traceback (most recent call last):
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
prefect-agent_1   |     yield
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
prefect-agent_1   |     resp = await self._pool.handle_async_request(req)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 253, in handle_async_request
prefect-agent_1   |     raise exc
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 237, in handle_async_request
prefect-agent_1   |     response = await connection.handle_async_request(request)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection.py", line 90, in handle_async_request
prefect-agent_1   |     return await self._connection.handle_async_request(request)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/http11.py", line 105, in handle_async_request
prefect-agent_1   |     raise exc
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/http11.py", line 84, in handle_async_request
prefect-agent_1   |     ) = await self._receive_response_headers(**kwargs)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/http11.py", line 148, in _receive_response_headers
prefect-agent_1   |     event = await self._receive_event(timeout=timeout)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/http11.py", line 177, in _receive_event
prefect-agent_1   |     data = await self._network_stream.read(
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/backends/asyncio.py", line 30, in read
prefect-agent_1   |     with map_exceptions(exc_map):
prefect-agent_1   |   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
prefect-agent_1   |     self.gen.throw(typ, value, traceback)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpcore/_exceptions.py", line 12, in map_exceptions
prefect-agent_1   |     raise to_exc(exc)
prefect-agent_1   | httpcore.ReadError
prefect-agent_1   |
prefect-agent_1   | The above exception was the direct cause of the following exception:
prefect-agent_1   |
prefect-agent_1   | Traceback (most recent call last):
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 154, in get_and_submit_flow_runs
prefect-agent_1   |     queue_runs = await self.client.get_runs_in_work_queue(
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/prefect/client/orion.py", line 759, in get_runs_in_work_queue
prefect-agent_1   |     response = await self._client.post(
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1842, in post
prefect-agent_1   |     return await self.request(
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1527, in request
prefect-agent_1   |     return await self.send(request, auth=auth, follow_redirects=follow_redirects)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/prefect/client/base.py", line 159, in send
prefect-agent_1   |     await super().send(*args, **kwargs)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1614, in send
prefect-agent_1   |     response = await self._send_handling_auth(
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1642, in _send_handling_auth
prefect-agent_1   |     response = await self._send_handling_redirects(
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1679, in _send_handling_redirects
prefect-agent_1   |     response = await self._send_single_request(request)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1716, in _send_single_request
prefect-agent_1   |     response = await transport.handle_async_request(request)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 352, in handle_async_request
prefect-agent_1   |     with map_httpcore_exceptions():
prefect-agent_1   |   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
prefect-agent_1   |     self.gen.throw(typ, value, traceback)
prefect-agent_1   |   File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
prefect-agent_1   |     raise mapped_exc(message) from exc
prefect-agent_1   | httpx.ReadError
prefect-agent_1   | 23:07:54.465 | DEBUG   | prefect.agent - Checking for flow runs...

@zanieb
Copy link
Contributor Author

zanieb commented Nov 15, 2022

@MuFaheemkhan could you please edit your post to contain the full Docker image tag if you are using one of our official images or include the versions of the libraries as requested? A full traceback for the error would also be really helpful. Thanks!

@eudyptula

This comment was marked as off-topic.

@MuFaheemkhan
Copy link

@eudyptula what's your prefect version? it's a different issue, I experienced the same errors as yours but got fixed after I upgraded to prefect 2.6.6
with 2.6.6 we are having the following errors:
Errors:
BrokenPipeError: [Errno 32] Broken pipe
ConnectionResetError: [Errno 104] Connection reset by peer

@zanieb
Copy link
Contributor Author

zanieb commented Nov 16, 2022

@eudyptula as noted by MuFaheemkhan, that is not a connection / network failure but rather a 500 error from the server. Please open a separate issue for that and include the relevant versions. There should be server logs with the error if you are hosting your own server. If you're using Cloud, we will find the server error.

It's a huge issue that prefect leaves docker containers up and running with crashed flows - eats up resources that are never released

This should not be the case. When the flow run crashes, the process should exit. If you can reproduce this, please open an issue so we can address it.

@eudyptula
Copy link
Contributor

eudyptula commented Nov 17, 2022

Well, in that case we have several issues.

We were on 2.6.3, just upgraded to 2.6.4 as @madkinsz said that some users didn't experience the issue on that version. At the same time I upgraded all the dependencies of Prefect on both the agents and the server - but not on the docker containers.

Was going to see if 2.6.4 was running better before I upgraded all the way to the new version (and hopefully provide some helpful information to you at the same time). But thanks @MuFaheemkhan will definitely check out 2.6.6 and see if it helps!

@voidel
Copy link

voidel commented Nov 17, 2022

Getting this on 2.6.7. There is no information in the traceback on why the connection failed.
Can provide additional context if required but mostly echoes what others have said.
I am using run_deployment and the error sporadically occurs as the subflow is running.
Error messages are:
httpx.RemoteProtocolError: Server disconnected without sending a response
and at other times
httpx.ReadError caused by ConnectionResetError: [Errno 104] Connection reset by peer.

To confirm, I did not have these issues on 2.6.4.

@andreas-ntonas
Copy link

andreas-ntonas commented Nov 17, 2022

I am getting the same errors.
I am running a custom Prefect 2.6.5 version (same for server and clients) with the agent-limit feature from this PR (but I do not think this is related). The logs are the same as what others mention like:
ConnectionResetError: [Errno 104] Connection reset by peer
and
anyio.BrokenResourceError
and
httpcore.ReadError, httpx.ReadError

It happens when trying to call run_deployment from inside a task in the parent flow

This issue makes Prefect 2 pretty unstable for production environments

@peytonrunyan peytonrunyan self-assigned this Nov 17, 2022
@andreas-ntonas
Copy link

andreas-ntonas commented Nov 17, 2022

In my case I managed to stop getting these network related errors by not using the run_deployment function (with None timeout) inside tasks submitted to task runners. Instead I created my own custom task that calls directly the Orion API by using orion's client and its functions like create_flow_run_from_deployment to start runs from deployments and
read_flow_run to regularly poll the new flow_run state

I make sure to create a new client instance instead of getting the flow context's one in contrast to what run_deployment is doing when it uses @inject_client.

Somewhere there I believe lies the root cause of the issue: Using run_deployment in tasks submitted to a task_runner in parallel that also use orion's client from flow's context to poll the server and wait for the run to finish.

@carlo-catalyst
Copy link

We are running run_deployment within tasks, so it's possible that is having an impact.

@zanieb
Copy link
Contributor Author

zanieb commented Nov 17, 2022

I believe these issues are basically caused by a high volume of requests — we see these issues with the agent which polls frequently and now with run_deploment which also polls frequently.

@MuFaheemkhan
Copy link

In my case I stopped using run_deployment, now my flows doesn't crash, but connection reset and broken pipe error occur randomly in agent logs although this time it doesn't crash my flows.

@eudyptula
Copy link
Contributor

We are seeing our BrokenPipeError: [Errno 32] Broken pipe in flows that are starting like 800 tasks in one go with map.
The logs also include a RuntimeError: The connection pool was closed while 325 HTTP requests/responses were still in-flight.

Also, the reason behind the http 500 was a database timeout, so had to increase the query timeout from the default 1 second.

So, looking from our perspective, it could very well be issues with high volumen.

@MuFaheemkhan
Copy link

MuFaheemkhan commented Nov 18, 2022

@eudyptula I am using local postgress server, one thing I did was to increase the no of connection on postgress server/container.
You are right about the tasks, running the same script on prefect 2.0.4 vs prefect 2.6.6, prefect 2.0.4 starts fewer tasks in one go.
I was wondering if we can reduce the no of task that are starting in one go.

@peytonrunyan
Copy link
Contributor

peytonrunyan commented Nov 22, 2022

Howdy yall! Any chance anyone here would be interested in giving this branch a shot to see if it resolves the problems? #7593

@MuFaheemkhan , @eudyptula , @carlo-catalyst , @andreas-ntonas ,

@voidel
Copy link

voidel commented Nov 28, 2022

Really looking forward to a fix for this, if there's a workaround we can use to stop the parent flows failing it would be much appreciated!

@eudyptula
Copy link
Contributor

@peytonrunyan: Haven't tried the branch, but is retry the right solution here? Sounds a bit like playing the lottery to me, just keep resending the same request, hoping that eventually it would go through.

If the issue is caused by high volume of requests, as @madkinsz suggests - shouldn't a solution involve some way to throttle/queue requests on the agents? (just thoughts, not that familiar with how Prefect is designed)

@zanieb
Copy link
Contributor Author

zanieb commented Nov 29, 2022

@eudyptula we believe this is an issue with the upstream HTTP libraries. We are implementing retries to solve the immediate issue for our users until we can work on fixing the bug upstream.

@zanieb
Copy link
Contributor Author

zanieb commented Dec 1, 2022

Note we believe this is addressed in 2.7.0 with #7593

@voidel
Copy link

voidel commented Dec 2, 2022

We haven't had this error since we upgraded to 2.7.0 earlier today 👍

@zanieb zanieb added upstream dependency An upstream issue caused by a bug in one of our dependencies priority:low and removed priority:high labels Dec 2, 2022
@colindunn
Copy link

I can also confirm that flows that were failing regularly for me with this issue now seem to be working fine

@peytonrunyan
Copy link
Contributor

peytonrunyan commented Dec 5, 2022

@madkinsz it looks like we got it addressed, so I'm going to go ahead and close this issue. Feel free to reopen it if you think there's something else that needs handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working upstream dependency An upstream issue caused by a bug in one of our dependencies
Projects
None yet
Development

No branches or pull requests

9 participants