Retrying on connection closed #104

jbransen · 2025-01-06T09:43:39Z

I am running this library in a production setting, and on a regular basis I observe the following error:

RpcError(Status { code: Cancelled, message: "operation was canceled", source: Some(tonic::transport::Error(Transport, hyper::Error(Canceled, "connection closed"))) })

When reading the BigTable connection pool documentation this seems to be expected, as connections are refreshed every hour. However, as a client user I am suprised: why does this error make it's way to the end user? Is that a concious choice or is this an oversight? I would expect this library to gracefully handle this by retrying (which should always work fine), so that clients never end up with such errors.

The text was updated successfully, but these errors were encountered:

liufuyang · 2025-01-06T15:01:17Z

Hey @jbransen, thanks for raising this issue. You might be the first person that uses this crate in a production environment. I created this repo/crate while I worked a Spotify - at that moment I was thinking that perhaps it would be handy in case we wanted to try using Rust in some of our backend systems, as many of the backend systems needed to talk to Bigtable (however, that never really happened before I left Spotify). And I do remember during those days, the Spotify engineers needed to add Bigtable channel refreshing (client-swapping) manually in their code base to deal with the hourly spikes, and perhaps these days those "auto-refresh" feature gets merged into the standard Bigtable Java client side (as it described from the doc you linked); For the Golang side, the doc you linked points to some simple Golang example code that does the same thing, but requires users to implement those as well, I suppose.

So basically, without looking into all the details, if one wishes to have similar features (client refreshing), one might just need to implement it in a similar fashion in Rust. On your side, you might make it work by manually creating a new BigTableConnection at some period and replace the old one, periodically. Or, as you said, this could be done by BigTableConnection internally and providing some settings/flags to turn on and configure this feature.

Unfortunately, I do not use this crate in my work, nor work with Bigtable anymore. So even if I try to implement a feature like this, it might be a bit hard for me to try and test it out (not sure if the Bigtable emulator can do connection close, perhaps possible?).

But if you want, you are very welcome to contribute to this crate, add such feature, and test it out in your environment before merging.

Let me know if this helps :)

jbransen · 2025-01-06T19:03:02Z

Thank you for the fast reply, this helps. I am happy to craft a solution, try it out, and make a PR.

liufuyang · 2025-01-06T20:46:04Z

Oh cool, thanks a lot. I haven't been following up with the Rust GRPC development closely, not sure if on the tonic and tokio side whether there is some implementation off the shelve can help in this case. You might want to check the API there a bit or perhaps ask in the tonic Discord channel, people are helpful and I got a lot of good help while I was implementing this crate.

jbransen · 2025-01-07T11:23:44Z

Some (mostly open) discussions on retrying with gRPC:

Support gRPC Retry Design hyperium/tonic#1463
Unable to integrate tower retry with tonic because http::Request is not Clone hyperium/tonic#733
Retry middleware improvements for non-cloneable requests tower-rs/tower#790
gRPC sample clients should automatically re-connect and re-subscribe if connection is dropped rpcpool/yellowstone-grpc#38

liufuyang · 2025-01-07T12:23:14Z

Thanks. And in this case, I am not 100% sure the "retry-if-encounter-channel-close" is really what we need and can solve your issue gracefully. As from the Google doc linked, and with some public info such as this stackoverflow page, what I know is that the operation of reconnection to grpc to open warm channels are not cheap, so relying on any "retry" mechanism to fix your issue will cause hourly latency spikes to Bigtable.

What I mentioned above, and what the Google doc link's Golang example suggested, is not a "retry", but "keep creating a new client (and open new channels) and replace the old client to the new client, periodically" - so it is more about "auto client/channel refreshing" rather than any sort of retry.

liufuyang · 2025-01-07T12:26:24Z

And you might want to update the issue title from "Retrying on connection closed" to something like "Auto refreshing client to deal Bigtable server hourly grpc channel close"

jbransen · 2025-01-13T10:06:05Z

Thank you. I understand your point, and I agree that a some process to keep connections warm is a good idea. However, for now I am still looking at a retry because:

the implementation is much simpler
in my application latency spikes are not really an issue
it solves a much wider range of issues. Even with auto channel refreshing there will be connectivity errors, so I am convinced there also has to be some retry mechanism on top of it. Ideally that would be taken care of by tonic, e.g. this, but as long as that is not in place, going for the simple "retry once" approach is the pragmatic way of dealing with this.

I've implemented something here, I will first evaluate this in our environment before creating a PR. I am very aware that this is not the best solution for the above problem, and that you'd ideally have a configurable backoff retry strategy, but I think this is enough for us for the foreseeable future. Once I am satisfied with that I'll create a PR and leave it to you whether or not you'd like to merge it.

liufuyang · 2025-01-13T12:43:17Z

Hi @jbransen, thanks for the good effort to move this forward. However, at the same time, I have a few concerns about adding retry-once-feature in this repo, and let me illustrate a few points here and see if you agree:

The retry feature might not be needed in many of the use cases out there, adding a simple feature here might only benefit a tiny group of end users but at the cost of making this relatively simple repo more complicated than necessary.
For the use cases that do need retry-feature, as you probably also noticed by reading the grpc retry design, it would be quite hard to do it well, because:

How many times of retry would need to be customized - would just one retry be enough?
The retry context needs to be customized. What if one just wants to retry on write but fail fast on read?
Similarly, retry might only needed for certain type of error, for other types of errors (e.g. bad request, 404, unauthurized, and so on) one might not want to retry.

Even with a relatively good retry mechanism implemented, there are still other important things to concern, especially in high-load situations (10K+ requests per second per Bigtable instances, multiple instances running with auto-scale) as in those situations a glitch of the system might cause a huge amount of un-necessary or impossible-to-recover retry, could end up in chain-reaction situation that servers queues up with retry requests, overloading servers themsevles or overloading BT instances with huge amount of retry requests while they are trying to recover.

If the retry feature is implemented on the tonic side as you pointed out, it would not be useful to double implement here.
If your end goal is to deal with Bigtable hourly connection close, a client-refresh impl will fix the issue gracefully (no latency spikes), making a retry feature useless.
It is still simple for any user to implement a retry on their own code base (and define when and how they want to retry easily). And I believe if you put the retry-once-for-anything code on your end, it probably just adds 3 to 5 lines of code in your codebase?
Adding another reason - I think your current implementation won't work well in the case if end user is using the Bigtable "raw" APIs (example here) as in this repo we are not wrapping all the possible Bigtable APIs.

With all these points above, plus I would very much like this repo to follow the KISS principle, I think perhaps the best solution for now is to leave any retry related implement off to the user side or the tonic side.

Let me know if you agree. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrying on connection closed #104

Retrying on connection closed #104

jbransen commented Jan 6, 2025

liufuyang commented Jan 6, 2025 •

edited

Loading

jbransen commented Jan 6, 2025

liufuyang commented Jan 6, 2025

jbransen commented Jan 7, 2025

liufuyang commented Jan 7, 2025 •

edited

Loading

liufuyang commented Jan 7, 2025

jbransen commented Jan 13, 2025 •

edited

Loading

liufuyang commented Jan 13, 2025 •

edited

Loading

Retrying on connection closed #104

Retrying on connection closed #104

Comments

jbransen commented Jan 6, 2025

liufuyang commented Jan 6, 2025 • edited Loading

jbransen commented Jan 6, 2025

liufuyang commented Jan 6, 2025

jbransen commented Jan 7, 2025

liufuyang commented Jan 7, 2025 • edited Loading

liufuyang commented Jan 7, 2025

jbransen commented Jan 13, 2025 • edited Loading

liufuyang commented Jan 13, 2025 • edited Loading

liufuyang commented Jan 6, 2025 •

edited

Loading

liufuyang commented Jan 7, 2025 •

edited

Loading

jbransen commented Jan 13, 2025 •

edited

Loading

liufuyang commented Jan 13, 2025 •

edited

Loading