Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

Connection issue with Temporal 1.9.x #14

Open
dkryptr opened this issue May 10, 2021 · 6 comments
Open

Connection issue with Temporal 1.9.x #14

dkryptr opened this issue May 10, 2021 · 6 comments

Comments

@dkryptr
Copy link
Contributor

dkryptr commented May 10, 2021

I get a connection issue when running my worker against the latest version of Temporal:

2021-05-10 08:37:04,083 | DEBUG | protocol.py:data_received:714 | Protocol error
Traceback (most recent call last):
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 224, in process_input
    func, target_state = self._transitions[(self.state, input_)]
KeyError: (<ConnectionState.CLOSED: 3>, <ConnectionInputs.RECV_PING: 14>)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/protocol.py", line 712, in data_received
    events = self.connection.feed(data)
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/protocol.py", line 189, in feed
    return self._connection.receive_data(data)  # type: ignore
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 1463, in receive_data
    events.extend(self._receive_frame(frame))
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 1486, in _receive_frame
    frames, events = self._frame_dispatch_table[frame.__class__](frame)
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 1759, in _receive_ping_frame
    events = self.state_machine.process_input(
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 228, in process_input
    raise ProtocolError(
h2.exceptions.ProtocolError: Invalid input ConnectionInputs.RECV_PING in state ConnectionState.CLOSED
2021-05-10 08:37:04,103 | ERROR | retry.py:retry_loop:29 | run failed: Connection lost, retrying in 6 seconds
Traceback (most recent call last):
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/client.py", line 360, in recv_initial_metadata
    headers = await self._stream.recv_headers()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/protocol.py", line 342, in recv_headers
    await self.headers_received.wait()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/asyncio/locks.py", line 309, in wait
    await fut
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/retry.py", line 17, in retry_loop
    await fp(*args, **kwargs)
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/decision_loop.py", line 934, in run
    decision_task: PollWorkflowTaskQueueResponse = await self.poll()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/decision_loop.py", line 983, in poll
    task = await self.service.poll_workflow_task_queue(request=poll_decision_request)
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/api/workflowservice/v1.py", line 844, in poll_workflow_task_queue
    return await self._unary_unary(
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/betterproto/__init__.py", line 1133, in _unary_unary
    response = await stream.recv_message()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/client.py", line 408, in recv_message
    await self.recv_initial_metadata()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/client.py", line 380, in recv_initial_metadata
    self.initial_metadata = im
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/utils.py", line 70, in __exit__
    raise self._error
grpclib.exceptions.StreamTerminatedError: Connection lost
@dkryptr
Copy link
Contributor Author

dkryptr commented May 10, 2021

Here's what I've tracked down:

WORKAROUND

Set frontend.keepAliveMaxConnectionAge in the dynamic config to a large value (not sure how to set it to infinity like it defaulted to pre-1.9.0 release):

frontend.keepAliveMaxConnectionAge:
  - value: 87600h # 10 years

@vitarb
Copy link

vitarb commented May 10, 2021

Chad, you are correct with your investigation, however the fact that you are seeing errors on the client side is unexpected. We've added this configuration to initiate healthy rotation of connections and to better re-balance them over time.

According to grpc documetation for keep-alive, when MaxConnectionAge is reached, GoAway signal is sent to the client and if client doesn't close the connection within MaxConnectionAgeGrace period only then connection will be forcibly closed. The idea behind it is that grace period should be longer than the long poll request time, which should result in clean connection closure.

Is it possible that Python implementation of grpc that you are using is not handling GoAway signal properly? (this issue?) My expectation is that client should finish ongoing requests using an old connection, and route new requests into sub-channels that are using new connection.

pypt added a commit to mediacloud/backend-temporal-server-config that referenced this issue May 11, 2021
@dkryptr
Copy link
Contributor Author

dkryptr commented May 11, 2021

@vitarb this project uses grpclib, which uses h2. You're spot on that it isn't handling the GoAway signal properly (issue)

@vitarb
Copy link

vitarb commented May 11, 2021

What do you think is the right thing to do here? My opinion is that we should aim for an actual fix in python, because server defaults seem to be reasonable according to our usage profile and gRPC spec and I don't think we should be changing them. Meanwhile overriding MaxConnectionAge on the server to a large or unlimited value sounds like a good mitigation strategy.

pypt added a commit to mediacloud/backend that referenced this issue May 12, 2021
@dkryptr
Copy link
Contributor Author

dkryptr commented May 15, 2021

I agree with you. A couple of options are:

  • Implement GoAway handling in the h2 library
  • Update to use grpcio channel over grpclib channel - maybe grpcio's channel handles GoAway signals?
    • This one might require a huge refactor because the protobuffers are compiled using python-betterproto and it looks like it uses grpclib underneath.

@dkryptr
Copy link
Contributor Author

dkryptr commented May 15, 2021

As we've continued migrating existing applications to temporal using the python-sdk, it looks like we also need to override the config option frontend.keepAliveTime. Temporal server sets it to 1 minute. Some of our activities take longer than 60 seconds and we get the same Connection lost error. Setting this to a longer time fixes the issue.

pypt added a commit to mediacloud/backend-temporal-server-config that referenced this issue Jan 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants