-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection Reset on prefect agent #7472
Comments
Thanks for opening the issue! This does not seem related to finishing a flow run, it's an error when attempting to get runs to submit (
I'm leaning towards the second as the simplest approach, it'd help with #7442 as well :) |
@madkinsz I don't think it's just this query that is being affected. Me and some others are having issues with this as well, albeit with the creation of task_runs. Discussion can be found in this slack thread. Since all queries are probably using the same code underneath, it seems quite realistic that all API queries are affected... |
prefect 2.6.6, local server in docker containers. self hosted postgress server in docker container. Further details in this thread https://prefect-community.slack.com/archives/CL09KU1K7/p1667880546594239 |
I believe this should be resolved in the most recent versions by retries. |
First check
Bug summary
I am running both orion and postgres as self hosted services, and have created a deployed agent running on my compute. My deployed agent is picking up a pushed flow run, which has completed successfully, as judge by my logs, expected data products, and orion web UI. However, the deployed agent seems to consistently produce a
httpx.ReadError
error raised by aConnectionResetError
error.Reproduction
Error
Versions
Additional context
The above version is the environment running the deployed agent.
I have a virtual machine that is hosting a postgres server connected to an orion instance. My compute infrastructure is a HPC facility using a SLURM based workflow. My processing script seems to work really well with prefect2 and this set up. The main flow with spin up a
DaskTaskRunner
running on some compute nodes from a SLURM request managed bydask_jobqueue.SLURMCluster
.I would like to demonstrate the ability to deploy agents to orchestrate workflows, and remotely kick them off. This seems to be working well. I can successful use
prefect deployment build --apply
to construct and register the workflow with my orion server. I can also useprefect deployment run
to kick of this registered workflow, and the correspondingprefect agent
I have started on my cluster picking up the job and successfully kicks them off.The logs that are recorded all look correct, and the data-products created are all correct, and the flow run states recorded by orion all indicate success. The
prefect agent
also outputs a note about the flow/subflow successfully finishing. However, just after those messages from theprefect agent
(that I have included in the log) I get this connect reset error. The agent still seems to be running fine, and seems to be accepting new flow runs without issue.I have not been able to create a MWE that replicates the issue.
The text was updated successfully, but these errors were encountered: