-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional http2 connection errors (KeyError) #11660
Comments
Thanks for the issue @j-tr! Do you have an example of a flow we can use to reproduce this issue? Also, can you share the version of |
@desertaxle thank you for looking into this I'm using so far we haven't come up with an MRE yet as this is very flaky and seems to happen only for relatively long-running flows. I tried to make sense of the stack trace and found that the stream_id that cannot be found in self.streams of the h2 connection is provided by get_next_available_stream_id (https://github.com/python-hyper/h2/blob/bc005afad8302549facf5afde389a16759b2ccdb/src/h2/connection.py#L625C17-L625C17). The docstring of that method contains a warning:
As this is all aync code, could it be possible that under high load multiple connections get the same stream_id and consequently ending the stream a second time fails as the stream is already removed? In this case, would this rather be an h2 problem? |
Over the past two weeks, we have consistently encountered similar issues. Our implementation primarily utilizes asynchronous code, and I've integrated retry mechanisms into all relevant functions for enhanced reliability. Our system operates within a Docker Pool environment. The error predominantly arises during our extended workflows, which typically run for durations ranging between 30 to 50 minutes and are scheduled hourly. It also occurs in our shorter workflows, which execute every 15 minutes. Although the issue's occurrence is sporadic, it has been happening frequently throughout the day. As a potential solution, I am currently implementing PREFECT_API_ENABLE_HTTP2 = False to evaluate its effectiveness in resolving these issues. Here is the list of packages installed. Attached below is the stack trace from a recent incident for further analysis:
|
Update: After adding PREFECT_API_ENABLE_HTTP2 = False it is giving more errors on different pipelines.
|
Hey, it's really difficult to say what is going on without a reproduction.
|
Yes, since it happens randomly it is very hard to troubleshoot. In my requirements.txt file I am still doing anyio<4.0.0. Additionally, I've attempted downgrading Prefect to prefect==2.14.10, but unfortunately, it didn't resolve the issue. I have re-pulled our pip dependencies and confirmed that we are on the latest versions, as detailed below. Notably, h2 has not had an update since October 2021. Here is the current status of the related packages: aiohttp: 3.9.1 (Last updated on Nov 26, 2023) |
I unfortunately clicked on the wrong h2 repo and gave you some bad info, my apologies! Prefect still has anyio<4.0.0 pinned at the moment. Was everything working before on a different prefect version and after an update you started to see the errors? |
No worries! Here is the stack trace from the 25th. If I was able to handle the exception, then it would be better so I can still allow the rest of the script to finish. But I have added try excepts to every function and it is cancelling so I am not able to handle it.
|
Do you build images for your deployments? Any way to find out exactly what your dependencies looked like before this started and roll back to that? |
Unfortunately, no I do not. I am looking at migrating us from Pip to Poetry to help dependency management going forward. Thank you for your support!!!!! |
Update:
|
Not a new issue for sure, but possibly related: encode/httpcore#808 |
@jakekaplan, That does look like the same stack trace we are seeing. Thank you!! |
Hello, Here is another stack trace from the one this morning. Looking at this, it appears it is failing on pushing logs and states to Prefect Cloud. Would it be helpful to add some retry logic on these Prefect Engine functions?
|
Hello, I wanted to update. I have removed all the Prefect Tasks and moved all the logic into one Prefect Flow. We are not seeing the issue now on that specific Prefect Deployment/Flow. For the other Flows, we are still seeing the issue we are going to rewrite them to just be the one Flow. Thanks! |
Hi @bnewman-tech sorry to hear you're still seeing the issue. The issue I linked above (encode/httpcore#808) that I believe to be the cause seems to have been fixed ~5 days ago. It will still be a little bit before it gets into their next release, but will try and respond here once I see it merged and we can pin |
@jakekaplan looks like they released a new version. We’ll be on the lookout for when it gets pinned in prefect |
We have seen a similar error in our Prefect 2 Agent in GCP Cloud Run (launches Vertex jobs). The agent falls over at the start of the hour when ~50 flow runs are scheduled. We recently scaled up the number of instances and don't seem to be hitting resource limits on the instances. Prefect version: Error:
|
the upstream fix should be merged and released at this point, so you should be able to upgrade |
@zzstoatzz We've upgrade our httpcore library to 1.0.5 but we're still seeing this. We are on prefect 2.16.4 - was anything changed on the Prefect side in tandem with this fix? I'm not sure if we need to upgrade to 2.16.9 to see improvement or if it should be resolved with just the httpcore update |
I've been able to reproduce this bug by rapidly running @task()
def one_second_task():
time.sleep(1)
return 1
@flow(
log_prints=True,
task_runner=ConcurrentTaskRunner(),
)
def no_op_flow(
num_tasks: int,
task_submit_delay: float,
):
no_op_futures: list[PrefectFuture] = []
for _ in range(num_tasks):
no_op_futures.append(one_second_task.submit())
time.sleep(task_submit_delay)
print("Submitted all tasks, now waiting for futures")
[future.result(timeout=240) for future in no_op_futures]
slogger.info(f"Finished test {test}") Deploying, then repeatedly running the deployment with Workaround!Good news is, throttling task submission with a `time.sleep(0.25) completely eliminated the error during my testing. If anyone else is struggling with this bug, consider adding a little throttling to your task submissions! |
First check
Bug summary
Occasionally, flows crash with a connection-related exception that seems to originate from h2.
So far this could only be observed in longer flow runs (>2h) and seems not to be related to any specific workload.
Possibly related to #7442, #9429
Reproduction
Error
Versions
Additional context
The stream_id from the final KeyError is different for each crash.
The text was updated successfully, but these errors were encountered: