Connection Reset on prefect agent #7472

tjgalvin · 2022-11-08T05:17:18Z

First check

I added a descriptive title to this issue.
I used the GitHub search to find a similar issue and didn't find it.
I searched the Prefect documentation for this issue.
I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

I am running both orion and postgres as self hosted services, and have created a deployed agent running on my compute. My deployed agent is picking up a pushed flow run, which has completed successfully, as judge by my logs, expected data products, and orion web UI. However, the deployed agent seems to consistently produce a httpx.ReadError error raised by a ConnectionResetError error.

Reproduction

I have not been able to produce a MWE yet. :(.

Error

11:05:16.247 | INFO    | Flow run 'quixotic-sponge' - Finished in state Completed('All states completed.')
11:05:20.310 | INFO    | Flow run 'sceptical-axolotl' - Finished in state Completed('All states completed.')
Successful readonly open of default-locked table /askapbuffer/payne/tgalvin/holography/prefect2_44641/44641/2022-10-07_184525_0.ms/FIELD: 9 columns, 1 rows
11:05:22.837 | INFO    | prefect.infrastructure.process - Process 'sceptical-axolotl' exited cleanly.
11:15:04.001 | ERROR   | prefect.agent -
Traceback (most recent call last):
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/asyncio/selector_events.py", line 854, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions
    yield
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/backends/asyncio.py", line 33, in read
    return await self._stream.receive(max_bytes=max_bytes)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 1274, in receive
    raise self._protocol.exception
anyio.BrokenResourceError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
    yield
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/connection_pool.py", line 253, in handle_async_request
    raise exc
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/connection_pool.py", line 237, in handle_async_request
    response = await connection.handle_async_request(request)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/connection.py", line 90, in handle_async_request
    return await self._connection.handle_async_request(request)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/http11.py", line 105, in handle_async_request
    raise exc
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/http11.py", line 84, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/http11.py", line 148, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/http11.py", line 177, in _receive_event
    data = await self._network_stream.read(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/backends/asyncio.py", line 35, in read
    return b""
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_exceptions.py", line 12, in map_exceptions
    raise to_exc(exc)
httpcore.ReadError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/prefect/agent.py", line 154, in get_and_submit_flow_runs
    queue_runs = await self.client.get_runs_in_work_queue(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/prefect/client/orion.py", line 759, in get_runs_in_work_queue
    response = await self._client.post(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1842, in post
    return await self.request(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1527, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/prefect/client/base.py", line 159, in send
    await super().send(*args, **kwargs)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1614, in send
    response = await self._send_handling_auth(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1642, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1679, in _send_handling_redirects
    response = await self._send_single_request(request)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1716, in _send_single_request
    response = await transport.handle_async_request(request)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadError

Versions

Version:             2.6.6
API version:         0.8.3
Python version:      3.9.13
Git commit:          87767cda
Built:               Thu, Nov 3, 2022 1:15 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         hosted

Additional context

The above version is the environment running the deployed agent.

I have a virtual machine that is hosting a postgres server connected to an orion instance. My compute infrastructure is a HPC facility using a SLURM based workflow. My processing script seems to work really well with prefect2 and this set up. The main flow with spin up a DaskTaskRunner running on some compute nodes from a SLURM request managed by dask_jobqueue.SLURMCluster.

I would like to demonstrate the ability to deploy agents to orchestrate workflows, and remotely kick them off. This seems to be working well. I can successful use prefect deployment build --apply to construct and register the workflow with my orion server. I can also use prefect deployment run to kick of this registered workflow, and the corresponding prefect agent I have started on my cluster picking up the job and successfully kicks them off.

The logs that are recorded all look correct, and the data-products created are all correct, and the flow run states recorded by orion all indicate success. The prefect agent also outputs a note about the flow/subflow successfully finishing. However, just after those messages from the prefect agent (that I have included in the log) I get this connect reset error. The agent still seems to be running fine, and seems to be accepting new flow runs without issue.

I have not been able to create a MWE that replicates the issue.

The text was updated successfully, but these errors were encountered:

zanieb · 2022-11-08T05:39:12Z

Thanks for opening the issue! This does not seem related to finishing a flow run, it's an error when attempting to get runs to submit (get_runs_in_work_queue). I'm not sure why the connection is being reset or why the connection pool is not resilient to this. Possible fixes include

Add request retries for httpx.ReadError
Wrap this query in the agent with a try/catch and log errors without crashing the agent. Intermittent failures can be ignored.

I'm leaning towards the second as the simplest approach, it'd help with #7442 as well :)

mathijscarlu · 2022-11-09T08:38:12Z

@madkinsz I don't think it's just this query that is being affected. Me and some others are having issues with this as well, albeit with the creation of task_runs. Discussion can be found in this slack thread. Since all queries are probably using the same code underneath, it seems quite realistic that all API queries are affected...

MuFaheemkhan · 2022-11-10T03:49:17Z

prefect 2.6.6, local server in docker containers. self hosted postgress server in docker container.
I am facing the same issue as well. Even without running any flows I get the below errors after some time and when I run a flow and the error appears, it crashes the flow.
BrokenPipeError: [Errno 32] Broken pipe
ConnectionResetError: [Errno 104] Connection reset by peer

Further details in this thread https://prefect-community.slack.com/archives/CL09KU1K7/p1667880546594239

zanieb · 2023-05-25T16:39:54Z

I believe this should be resolved in the most recent versions by retries.

tjgalvin added bug Something isn't working status:triage labels Nov 8, 2022

zanieb added status:accepted and removed status:triage labels Nov 8, 2022

zanieb changed the title ~~Connection Reset when prefect agent has finished a flowrun~~ Connection Reset on prefect agent Nov 11, 2022

zanieb mentioned this issue Nov 11, 2022

Network failures with self-hosted servers #7512

Closed

zanieb closed this as completed May 25, 2023

github-actions bot removed the status:accepted label May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection Reset on prefect agent #7472

Connection Reset on prefect agent #7472

tjgalvin commented Nov 8, 2022

zanieb commented Nov 8, 2022

mathijscarlu commented Nov 9, 2022 •

edited

Loading

MuFaheemkhan commented Nov 10, 2022 •

edited

Loading

zanieb commented May 25, 2023

Connection Reset on prefect agent #7472

Connection Reset on prefect agent #7472

Comments

tjgalvin commented Nov 8, 2022

First check

Bug summary

Reproduction

Error

Versions

Additional context

zanieb commented Nov 8, 2022

mathijscarlu commented Nov 9, 2022 • edited Loading

MuFaheemkhan commented Nov 10, 2022 • edited Loading

zanieb commented May 25, 2023

mathijscarlu commented Nov 9, 2022 •

edited

Loading

MuFaheemkhan commented Nov 10, 2022 •

edited

Loading