Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NBFT reliability #669

Open
milyin opened this issue Jan 25, 2024 · 2 comments
Open

NBFT reliability #669

milyin opened this issue Jan 25, 2024 · 2 comments
Assignees
Labels
enhancement Existing things could work better

Comments

@milyin
Copy link
Contributor

milyin commented Jan 25, 2024

Describe the feature

NBFT means Non-Blocking Fault-Tolerance

Based on PublicationCache and QueryingSubscriber. Newer implementation is in branch #401 (@OlivierHecart, is it correct one?)

Goals:

Subscriber when started pulls history of packets from publisher

Subscriber when detects missing packet makes query to publisher to restore them (by sequence number)

Subscriber keeps order of packets when giving them to user (holds onguing packets until query is done)

Subscriber detects new publishers and pulls history from them (through liveliness)

Subscriber makes sure that packets are not repeated (existing impl. have this problem: same packet from query and sub)

Pulling history, missing packets detection, late-comers publisher detection are orthogonal and can be enabled/disabled when constructing

Additional requirements:

  • allow to use unbounded channel in PublicationCache

  • allow to use liveliness support without caching, just for monitoring

@milyin milyin self-assigned this Jan 25, 2024
@milyin milyin added the enhancement Existing things could work better label Jan 25, 2024
@milyin milyin moved this to Backlog in Zenoh 1.0.0 release Jan 25, 2024
@milyin
Copy link
Contributor Author

milyin commented Jan 31, 2024

New requirement added: allow to use unbounded channel in PublicationCache

lomac — Yesterday at 8:57 AM
Hello,
I am getting some deadlock on publication. Is this an expected behavior?
Here is my setup:
1°) one zenoh session in one process (so, local only)
2°) declaring some publishers and subscribers
3°) async
I have some tokio tasks and they are doing some subscriber.recv_async() and some session.put(xxx).await.
If I burst the system too much, I am deadlock: tokio console show some tasks always busy (should not be the case since everything is async).
I don't personnaly use any Mutex or locking mechanism.
If I stop the debugger and go on the threads to find the root cause, I see this:
-flume channel is probably full and blocked
-I can never recover because all my thread are doing this so executor has no more thread anymore to empty the flume channels

Basically, I am using a session.put(xxx) in an async way but it is blocking nevertheless (because of local subscriber and zenoh try to put the sample in its local flume channel).
It seems declaring my subscribers with an unbounded flume channel solve the problem (I cannot reproduce the deadlock, but I am not 100% sure this is the cause).
Does my analysis seems right? What would be the right way to solve the deadlock? Use flume unbounded channels? Other?

Pierre 'Fox' Avital — Yesterday at 9:30 AM
Hi there,

Your analysis is correct. Specifically, the core of the issue is that Zenoh doesn't have a way to run asynchronous callbacks, so pushing onto the channel can only be done synchronously.

With a bounded channel, if the channel is full, the synchronous implementation has no choice but to block the thread (either parking it, or looping on some condition), since it can't return before the operation succeeds the receiver dies out, needing another task to be performed before it can return in both cases (reading from the channel or destroying its receiver end).
With an unbounded one, the deadlock is prevented because the sender task can resolve the issue all on its own: just grow the channel, push the message, and return.

While we are working at supporting asynchronous callbacks, I think using unbounded channels is the correct approach here regardless: since we run callbacks on the read task to minimize latency, the trade-off is that a callback (even asynchronous) not returning will block the reading task, preventing other messages on the same link from being read until the callback returns. Channels let you do the inverse trade-off of allowing parallel processing of messages at the cost of a bit of latency, so we do expect them to be often used by default.

Conclusion: yes, unbounded channels will solve your issue, and if you expect bursty traffic, are probably the most well-suited channels :)

My personal opinion on bounded channels is that they tend to be overused. Mostly because rather than to apply backpressure (the reason why they're Zenoh's default, and the only valid use for them IMO), they're often used as a premature optimization or chasing the delusion that some people have that this will let them control how much memory their programs use more finely (usually while pushing into an ever growing hashmap somewhere else to use as cache :p)
TL;DR: your analysis is right and unbounded channels are the solution to everything :)

lomac — Today at 10:09 AM
hi @pierre 'Fox' Avital
new day, new problem 😉
For my subscribers declaration, I now use flume::unbounded, no problem here.
But, I am also using PublicationCache:
zenoh_ext::ArcSessionExt::declare_publication_cache(&session, &key_expr)
.history(1)
.res()
.await
.unwrap();

It seems I have no control on the internal flume channel and the publication cache use the bounded version from the DefaultHandler.
Is there a possibility for the Zenog Team to add a way to use unbounded channels also there?

@OlivierHecart
Copy link
Contributor

Regarding last comment, I think it would be even better to implement the publication cache with a callback subscriber rather than a flume subscriber. It would imply some mutex but could be better overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Existing things could work better
Projects
None yet
Development

No branches or pull requests

2 participants