Timeouts in async-nats #609

Jarema · 2022-08-15T11:23:57Z

Jarema
Aug 15, 2022
Maintainer

Context

As the new async_nats crate matures, we're working on making it resilient against all kind of network issues.

One thing we found is that jetstream::Context.publish() needs a timeout, as when library waits for the response, that might be lost if reconnection was happening after publishing request, before receiving response. Resulting in deadlock.

That was fixed with default timeout and set_timeout() method on Jetstream::Context.

Open topic

That leads us to question: how we would like to handle timeouts in Core NATS requests?

Identical scenario can happen in async_nats::request().

Usually we tell the users that in async ecosystem in Rust, user is expected to wrap the call in tokio::time::timeout and it works fine, as it cancels the future and cleans up, though the API suddenly became somewhat inconsistent.

For now, we left the behaviour as is with rationale:

Jetstream::publish() calls request internally with the responder being nats-server, so an internal mechanism with predictable response time. Having default timeout seems to be "path of least surprise".
Client::request() does not have default timeout, as its up to user to decide if he wants to time it out, how long the other service can respond. Tolerance can be few hundred milliseconds or hours.

We might go middle ground - by default do not timeout async_nats::timeout, but add option to set it on Client.
User can then either set up the option, or directly wrap request calls in timeout.
Overloading request with request_with_timeout does not sound like a good idea, as we would end up with bunch of those for each kind of method call.

Would love to hear community and contributors thoughts on this one!
Pinging some active and recent contributors/users:
@stevelr @MattesWhite @thed0ct0r @neogenie @brooksmtownsend @paulgb @autodidaddict
Everybody's opinion is welcomed though!

How handle Core NATS request timeouts

Users should wrap request in `tokio::time::timeout` themselves, no default timeout

16%

add method on `Client` to set the timeout (optional, user can still wrap requests), no default timeout

16%

add method on `Client` to set the timeout or disable it, with default timeout

50%

other solution (let us know what is it!)

16%

6 votes

brooksmtownsend · 2022-08-15T11:59:02Z

brooksmtownsend
Aug 15, 2022

add method on Client to set the timeout or disable it, with default timeout

This was my vote, which allows for the users to have some flexibility in configuring the timeout or disabling if need be. I like a default timeout present for a couple of reasons you mention (jetstream deadlock especially) but particularly for predictability. As far as I know, unless you're doing a NATS request that does a lot of work on the other end (big KV operation?), no NATS operation is going to take longer than a couple of seconds.

Users wrapping the request in a tokio::time::timeout is okay, but I do see questions in Slack commonly about timing a request out so I feel this leads to some initial confusion

0 replies

paulgb · 2022-08-15T15:20:20Z

paulgb
Aug 15, 2022

+1 to default timeout with the ability to disable/change it. Among the advantages I see over expecting developers to wrap it in a timeout on their own:

The type of the future is the same regardless of whether it has a timeout, so you can write higher-order functions that wrap NATS calls without polluting things with lots of generics/boxing.
Handling errors is cleaner when you want timeout errors and connection errors to follow the same code path.

The advantages over being a Client option with no default are harder to articulate, but it just “feels right” to me that a future that is expected to terminate should not hang indefinitely unless the developer has explicitly opted-in to that behavior.

One disadvantage I see is that because Error is currently a Box-wrapped dyn Error, in cases where the developer wants to differentiate between a timeout error and a connection error, it's not ergonomic to do.

0 replies

autodidaddict · 2022-08-15T15:22:52Z

autodidaddict
Aug 15, 2022
Collaborator

Not sure if this makes things less or more clear.. but the Go client (which I consider to be the ultimate source of truth since that's the one maintained as first priority) expects a timeout value as a parameter on all requests:

msg, err := nc.Request("help", []byte("help me"), 10*time.Millisecond)

^^ the above is from the README on the go client.

0 replies

abalmos · 2022-08-15T16:05:24Z

abalmos
Aug 15, 2022

Interesting take that timeouts create ergonomic issues—I tend to hold the inverse opinion. That is, I feel like it is an ergonomic issue that async_nats does not inform me/force me to deal with the timeouts that are needed to ensure that the final program remains available. Currently, it is easy to build async_nats programs that randomly hang or otherwise remain disconnected forever.

Given that I don't see how any production program written with async_nats could exist without proper handling of timeout and connection errors, I personally prefer:

Client::request() (and friends) with a built-in reasonable timeout, and
Client::request_with_timeout(..., timeout: Duration) (and friends)

Client::request() could be a simple wrapper over Client::request_with_timeout(), using a default value (or a configurable one stored on Client) for the timeout.

This would make it hard to write broken programs, while timeouts also remain quite discoverable in docs.

All that said, I also feel like this change would need to be paired with concrete Error types. While an ergonomic issue by itself, I would say that timeout errors would commonly be dealt with differently than system errors.

2 replies

brooksmtownsend Aug 15, 2022

Agreed that a request_with_timeout is a valuable addition

paulgb Aug 15, 2022

To clarify re. the ergonomic issues, I do agree that a default timeout is more ergonomic overall, just that handling a timeout error separately would not be ergonomic when errors are boxed and dynamic. If paired with concrete error types as you suggest, that problem goes away entirely.

stevelr · 2022-08-15T16:33:36Z

stevelr
Aug 15, 2022

request_with_timeout is a common use case in wasmcloud. It's far more useful than a single timeout for Client because it's not uncommon to use different timeouts for different calls from the same Client. For that reason, if there were a default timeout for Client, there would have to be a way to set it to None or a very large value, so that the caller can apply a call-specific timeout with either request_timeout or tokio timeout.

A while ago, I submitted a PR to async_nats to add request_with_timeout and it was rejected at the time. Working around it by adding tokio timeout was only a minor inconvenience. I don't think it automatically extrapolates to all calls requiring timeouts. @autodidaddict 's point about parity with the go client is as good an argument as the minor convenience, even if following idiomatic rust is also a goal.

Two other cases that come to mind: publish_and_flush_with_timeout (or a more concise name :-) ) because that's another operation you don't want to hang indefinitely, and collect subscription results until a timeout - used for broadcasting a message and collecting responses. For example:
https://github.com/wasmCloud/control-interface-client/blob/main/src/sub_stream.rs

2 replies

stevelr Aug 15, 2022

I don't think changing timeout after Client has been constructed would be that useful except for fringe cases (that I can think of). Because clone()-ing a Client is such a common operation, if there ever was an api to change timeout, it can't require &mut self. It would have to use atomics or internal mutability.

abalmos Aug 15, 2022

Totally agree with not changing timeout after Client is constructed.

I updated the wording in my post to better reflect this as well.

Jarema · 2022-08-15T18:48:42Z

Jarema
Aug 15, 2022
Maintainer Author

First of all, thank you all for the great feedback!

@brooksmtownsend I consider request as part of JetStream operations a separate case from request as a user operation.

We can assume proper timeouts for JetStream related requests, but we have no idea how user core NATS requests should behave.
Some, for example, might send request for heavy data processing and calmly await response for few minutes.
Some might send http-like request, where response should come back in milliseconds or seconds.

That's why I didn't even ask about default timeout for request send as part of JetStream internal API. Timeouts are there. That's also why I asked about Core NATS requests. That part is not as simple :).

Most operation in Core NATS cannot hang: they're dispatched to TcpStream. Request is an exception, as it's subscribe paired with published.

After what you said so far, it seems that most sensible API would be:

Build Client with specific timeout for requests (not a setter).
2.Have additional request_with_timeout method that can override that timeout (wrapping with tokio::time::timeout would work only to shorten specified timeout so that's not enough then).
The only disadvantage of the per-request timeout method is that we have request and request_with_headers already.
That would mean we're ending up with:

request()
request_with_timeout()
request_with_headers()
request_with_headers_and_timeout()

It's not bad, but not great either.

@abalmos concrete error types will come soon, but please keep in mind that you can access the underlying error pretty easily with
err.downcast::<std::io::Error>().unwrap().kind() which in this case would yield ErrorKind::TimedOut.

Or, we can follow Go client path and do a breaking change, forcing user to always specify the timeout. It sounds justified, as as said - request is a specific NATS operation that is actually two separate ones: pub and sub.

4 replies

autodidaddict Aug 15, 2022
Collaborator

What about a request function that takes an Option<Duration> for timeout and an Option<Headers> . This gives you one function call and the worst case scenario is people pass 2 Nones for the easy case.

thed0ct0r Aug 15, 2022

What about a request function that takes an Option<Duration> for timeout and an Option<Headers> . This gives you one function call and the worst case scenario is people pass 2 Nones for the easy case.

I think that having Option<Headers> is a really good idea - because it simplifies usage.
On the other hand - i think that having Option<Duration> will only lead to most people using it with None and thus shooting themselves in the foot.

abalmos Aug 15, 2022

It's not bad, but not great either.

To each it own, but I think this is good. Perhaps using the Builder pattern would be cleaner?

brooksmtownsend Aug 16, 2022

I don't mind having multiple function heads or taking Options for duration/headers, if I had to pick I would go with a request function that optionally takes a duration and request_with_headers that optionally takes a duration

thed0ct0r · 2022-08-15T21:33:26Z

thed0ct0r
Aug 15, 2022

Apologies for jumping in a little late in the discussion.

A bit philosophical - but i believe that timing out on any i/o bound API call is just a statistical encapsulation of an error that has already occurred. in a perfect world we would get a response saying - "this ain't happening", but that's not the case... one of the things i like about rust and it's like are the way they "force/push" you to handle errors and faults in an ergonomic way - i think this is just an extension of that.

My current nats experimental code (which led to pr #607) has lots of helper methods which encapsulate a few different edge cases and situations with a tokio::time::timeout call, it's tedious and repetitive - as i would have to have it in any app that uses nats, ending up with writing a wrapper library.

I think that Client should have a default timeout, held internally by an Option<Duration>. then, a with_timeout(Option<Duration>) can be used to create the client with a different timeout (or None = disabled).

Building on top of that i would have 2 versions of each method:

request() which uses the internal Client timeout mentioned above, this is what most users would use. it's simple, ergonomic, and we protect them from what they're not thinking about.
request_for() or request_with_timeout() (i'm a fan of the former) which allows overriding each call with it's own timeout. this could then be used in conjunction with a Client having None as a default timeout for advanced users to have complete control over each call, while maintaining a sane and simple API.

1 reply

thed0ct0r Aug 15, 2022

I think that Client should have a default timeout, held internally by an Option. then, a with_timeout(Option) can be used to create the client with a different timeout (or None = disabled).
Building on top of that i would have 2 versions of each method:

request() which uses the internal Client timeout mentioned above, this is what most users would use. it's simple, ergonomic, and we protect them from what they're not thinking about.

request_for() or request_with_timeout() (i'm a fan of the former) which allows overriding each call with it's own timeout. this could then be used in conjunction with a Client having None as a default timeout for advanced users to have complete control over each call, while maintaining a sane and simple API.

After having given it some more thought, i think an even better solution would be to have the current methods (request(), request_with_headers(), etc...) that have the only change of using the Client's timeout.

Then, instead of having a _with_timeout() method for each use case, just have a RequestBuilder - which can then have a .with_timeout(Option<Duration>) modifier, a .with_headers(Headers) modifier, .with_inbox(String), .with_subscriber(&mut Subscriber) (as i have found that generating an inbox and subscriber for each request has some overhead for high throughput systems), etc...
That is actually the norm for HTTP request libraries, and can be more future proof for additions/modifications without breaking users code or expectations.

Jarema · 2022-08-23T09:53:05Z

Jarema
Aug 23, 2022
Maintainer Author

So, let's start with having default timeout on Client set for requests, with possibility to customise it while creating a Client. That would fix the issue of users blocking their code.

As for request with different timeout for each call, as @stevelr uses that:
Having optional Duration for each request while we have default timeout on Client - those two are IMO excluding each other, as there would be no way to use the default value if you always pass Option<Duration>. None variant should not use default timeout, but disable it entirely, or it's really counterintuitive.

@stevelr maybe a separate function client.request_builder() with timeout and headers option would be the best one?

1 reply

stevelr Aug 25, 2022

request_builder would satisfy the need to set timeout on a request that is different from the Client's default.

Jarema · 2022-08-29T07:48:09Z

Jarema
Aug 29, 2022
Maintainer Author

PR available here.
#616

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeouts in async-nats #609

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Timeouts in async-nats #609

Jarema Aug 15, 2022 Maintainer

Context

Open topic

Replies: 9 comments · 10 replies

autodidaddict Aug 15, 2022 Collaborator

Jarema Aug 15, 2022 Maintainer Author

autodidaddict Aug 15, 2022 Collaborator

Jarema Aug 23, 2022 Maintainer Author

Jarema Aug 29, 2022 Maintainer Author

Jarema
Aug 15, 2022
Maintainer

Replies: 9 comments 10 replies

autodidaddict
Aug 15, 2022
Collaborator

Jarema
Aug 15, 2022
Maintainer Author

autodidaddict Aug 15, 2022
Collaborator

Jarema
Aug 23, 2022
Maintainer Author

Jarema
Aug 29, 2022
Maintainer Author