Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection Reset on prefect agent #7472

Closed
4 tasks done
tjgalvin opened this issue Nov 8, 2022 · 4 comments
Closed
4 tasks done

Connection Reset on prefect agent #7472

tjgalvin opened this issue Nov 8, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@tjgalvin
Copy link

tjgalvin commented Nov 8, 2022

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar issue and didn't find it.
  • I searched the Prefect documentation for this issue.
  • I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

I am running both orion and postgres as self hosted services, and have created a deployed agent running on my compute. My deployed agent is picking up a pushed flow run, which has completed successfully, as judge by my logs, expected data products, and orion web UI. However, the deployed agent seems to consistently produce a httpx.ReadError error raised by a ConnectionResetError error.

Reproduction

I have not been able to produce a MWE yet. :(.

Error

11:05:16.247 | INFO    | Flow run 'quixotic-sponge' - Finished in state Completed('All states completed.')
11:05:20.310 | INFO    | Flow run 'sceptical-axolotl' - Finished in state Completed('All states completed.')
Successful readonly open of default-locked table /askapbuffer/payne/tgalvin/holography/prefect2_44641/44641/2022-10-07_184525_0.ms/FIELD: 9 columns, 1 rows
11:05:22.837 | INFO    | prefect.infrastructure.process - Process 'sceptical-axolotl' exited cleanly.
11:15:04.001 | ERROR   | prefect.agent -
Traceback (most recent call last):
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/asyncio/selector_events.py", line 854, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions
    yield
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/backends/asyncio.py", line 33, in read
    return await self._stream.receive(max_bytes=max_bytes)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 1274, in receive
    raise self._protocol.exception
anyio.BrokenResourceError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
    yield
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/connection_pool.py", line 253, in handle_async_request
    raise exc
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/connection_pool.py", line 237, in handle_async_request
    response = await connection.handle_async_request(request)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/connection.py", line 90, in handle_async_request
    return await self._connection.handle_async_request(request)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/http11.py", line 105, in handle_async_request
    raise exc
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/http11.py", line 84, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/http11.py", line 148, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_async/http11.py", line 177, in _receive_event
    data = await self._network_stream.read(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/backends/asyncio.py", line 35, in read
    return b""
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpcore/_exceptions.py", line 12, in map_exceptions
    raise to_exc(exc)
httpcore.ReadError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/prefect/agent.py", line 154, in get_and_submit_flow_runs
    queue_runs = await self.client.get_runs_in_work_queue(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/prefect/client/orion.py", line 759, in get_runs_in_work_queue
    response = await self._client.post(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1842, in post
    return await self.request(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1527, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/prefect/client/base.py", line 159, in send
    await super().send(*args, **kwargs)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1614, in send
    response = await self._send_handling_auth(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1642, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1679, in _send_handling_redirects
    response = await self._send_single_request(request)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1716, in _send_single_request
    response = await transport.handle_async_request(request)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadError

Versions

Version:             2.6.6
API version:         0.8.3
Python version:      3.9.13
Git commit:          87767cda
Built:               Thu, Nov 3, 2022 1:15 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         hosted

Additional context

The above version is the environment running the deployed agent.

I have a virtual machine that is hosting a postgres server connected to an orion instance. My compute infrastructure is a HPC facility using a SLURM based workflow. My processing script seems to work really well with prefect2 and this set up. The main flow with spin up a DaskTaskRunner running on some compute nodes from a SLURM request managed by dask_jobqueue.SLURMCluster.

I would like to demonstrate the ability to deploy agents to orchestrate workflows, and remotely kick them off. This seems to be working well. I can successful use prefect deployment build --apply to construct and register the workflow with my orion server. I can also use prefect deployment run to kick of this registered workflow, and the corresponding prefect agent I have started on my cluster picking up the job and successfully kicks them off.

The logs that are recorded all look correct, and the data-products created are all correct, and the flow run states recorded by orion all indicate success. The prefect agent also outputs a note about the flow/subflow successfully finishing. However, just after those messages from the prefect agent (that I have included in the log) I get this connect reset error. The agent still seems to be running fine, and seems to be accepting new flow runs without issue.

I have not been able to create a MWE that replicates the issue.

@tjgalvin tjgalvin added bug Something isn't working status:triage labels Nov 8, 2022
@zanieb
Copy link
Contributor

zanieb commented Nov 8, 2022

Thanks for opening the issue! This does not seem related to finishing a flow run, it's an error when attempting to get runs to submit (get_runs_in_work_queue). I'm not sure why the connection is being reset or why the connection pool is not resilient to this. Possible fixes include

  1. Add request retries for httpx.ReadError
  2. Wrap this query in the agent with a try/catch and log errors without crashing the agent. Intermittent failures can be ignored.

I'm leaning towards the second as the simplest approach, it'd help with #7442 as well :)

@mathijscarlu
Copy link
Contributor

mathijscarlu commented Nov 9, 2022

@madkinsz I don't think it's just this query that is being affected. Me and some others are having issues with this as well, albeit with the creation of task_runs. Discussion can be found in this slack thread. Since all queries are probably using the same code underneath, it seems quite realistic that all API queries are affected...

@MuFaheemkhan
Copy link

MuFaheemkhan commented Nov 10, 2022

prefect 2.6.6, local server in docker containers. self hosted postgress server in docker container.
I am facing the same issue as well. Even without running any flows I get the below errors after some time and when I run a flow and the error appears, it crashes the flow.
BrokenPipeError: [Errno 32] Broken pipe
ConnectionResetError: [Errno 104] Connection reset by peer

Further details in this thread https://prefect-community.slack.com/archives/CL09KU1K7/p1667880546594239

@zanieb zanieb changed the title Connection Reset when prefect agent has finished a flowrun Connection Reset on prefect agent Nov 11, 2022
@zanieb
Copy link
Contributor

zanieb commented May 25, 2023

I believe this should be resolved in the most recent versions by retries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants