Skip to content

Improve Coder Connect tunnel reconnect handling #563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ethanndickson opened this issue Apr 8, 2025 · 5 comments
Open

Improve Coder Connect tunnel reconnect handling #563

ethanndickson opened this issue Apr 8, 2025 · 5 comments

Comments

@ethanndickson
Copy link
Member

ethanndickson commented Apr 8, 2025

The Coder Connect tunnel receives workspace state from the Coder server over a dRPC stream. When first connecting to this stream, the current state of the user's workspaces is received, with subsequent messages being diffs on top of that state.

However, if the client disconnects from this stream, such as when the user's device is suspended, and then reconnects later, no mechanism exists for the tunnel to differentiate that message containing the entire initial state from another diff, and so that state is incorrectly applied as a diff.

In practice:

  • Tunnel connects, receives a workspace update containing all the existing workspaces & agents.
  • Tunnel loses connection, but isn't completely stopped.
  • All the user's workspaces are restarted, producing a new set of agents.
  • Tunnel regains connection, and receives a workspace update containing all the existing workspaces & agents.
  • This initial update is incorrectly applied as a diff, with the Tunnel's state containing both the old & new agents.

On Coder Desktop macOS, we've attempted to mitigate this issue by stopping the tunnel completely when the user device is suspended, and starting it back up when the device wakes. This is annoying for the user, and we don't want to do this long term.
If we fix this issue, we won't need to stop Coder Connect on device sleep anymore.

@ibetitsmike
Copy link

Important caveat to consider was discussed with @ethanndickson - there might be a scenario in which our keep-alive will not allow the workspace to shutdown and update on schedule.

@spikecurtis
Copy link

no mechanism exists for the tunnel to differentiate that message containing the entire initial state from another diff, and so that state is incorrectly applied as a diff.

Can we solve this problem by adding an explicit field or field(s) to the update to tell whether you're getting a snapshot or diff?

@ethanndickson
Copy link
Member Author

ethanndickson commented Apr 15, 2025

Can we solve this problem by adding an explicit field or field(s) to the update to tell whether you're getting a snapshot or diff?

That makes sense to me, but unless I'm mistaken I think we could also plumb through a isState bool from the tunnelUpdater through to the Tunnel? That way we'd avoid touching the dRPC protocol. I think we could just set it to true each time recvLoop is called.

@deansheather
Copy link
Member

@spikecurtis @ethanndickson

Potential plan:

  1. In the tunnel, set a flag needFreshWorkspaceUpdate (or some other name) immediately after reconnecting to the tailnet protocol endpoint on the server
  2. In the tunnel's workspace update handler, check this flag near the top and if true:
    1. Get a list of workspaces/agents that are in the tunnel's current state but not in the new update
    2. Add these workspaces/agents to the deleted_ fields on the update
    3. Set the flag to false
    4. Process the update as usual, the missing objects should be deleted and the update should be forwarded properly

@spikecurtis
Copy link

Need to be careful of data races:

  • what happens if we reconnect immediately before or after setting this field?
  • what happens if we reconnect during processing of updates?

What about putting the flag into the WorkspaceUpdates struct? The tunnelUpdater is recreated each reconnection, so it can set the flag on the first update and not subsequent ones. That keeps the flag colocated with the data it applies to and makes it less likely that we can have a race.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants