-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues on storing message locally when connection to the cloud is lost #28
Comments
@pricelessrabbit I must say that I do not understand well this problem. If the export service is started and initially connected to broker - this is a part of obligatory gateway provisioning process, without which there should be no sensors connected to a gateway and data coming to the gateway - then there is immediately QoS2 session enabled. If Export comes back up, it will try to re-establish the session with broker, but if broker is not there, Export can just retry or exit. If it exits, it should be restarted by a supervisor (orchestrator, K3s, systemd, ...) and then it tries to connect again. But I do not see how messages can be lost here - unless you mean the time while Export restarts? Do I understand this correctly? Additional question - in the case when there is a network issue:
Why do you restart the Export service in this case? Isn't there a way to loop-reconnect without restart? |
@dusanb94 @mteodor @nmarcetic can you please take a closer look into this? |
@drasko this is the tricky part imho try to explain the issue with an example (exaggerated) day 1: provision and bootstrap of the edge gateway with export that start to deliver data to the cloud result: But, if you change the example and keep the export up (exclude it from power failure) in that case data will be delivered. In other implementations (for example https://docs.edgexfoundry.org/1.3/microservices/core/data/Ch-CoreData/) there is no such issue. the edge data is always collected, and sent to the cloud when it becomes available. |
@pricelessrabbit I see - idea here would be that Export does not restart on unreachable cloud, but just be up without connection, retrying MQTT This is actually how it should work and then none of the data will be lost. Apart from this, we should consider NATS 2.0 JetStream or probably even better add Liftbridge to enable durable replicated logs for NATS, and then any subscriber (weather it is Export, Writer or any client's on-gateway app) will not be loosing messages, as Liftbridge would preserve them |
maybe mine is a special use case, but in my case can be that for external reasons, exporter restarts when there is no internet connection, and in that case important measures and events are lost. So, if you make a choice between JetStream and Lifbridge ( or maybe also nats streaming server) i can try to provide an implementation based on it. |
This is not your special case, I think this is a bug in our current implementation. I suggested for Export service not to be restarted in the case of remote MQTT connection loss, but to be up and keep retrying. When it is up, it will get the messages, try to forward them via MQTT, and because remote connection is broken they will be stored locally. I agree that Badger will do the job, but there are drawbcks:
Redis helped with this "statelessness", QoS2 probably does not (but rather falls back to Badger-like solution, where each instance of the Export stores it's own local storage. So, for me there are 2 correct solutions to this:
Let's contemplate with @mteodor a bit on this and make some decisions until end of week. In the meantime we need to do a PoC with Liftbridge and explore it's scalability and how it extends NATS. TBH, I would prefer Liftbridge solution, because this would help other services as well (not only Export) to prevent message loss, as they would mark message consumed in Liftbridge only once it has been really used (for example sent to remote MQTT broker in case of Export). |
As an edge service working on an industrial context, I expect that storage and retention of data implementation is solid and protects from data loss in case of common system failure events.
As now, after the redis stream removal, the current implementation that delegate to paho message persistence fails in common use cases. I opened a PR that fix some of them but there is still a major flaw that can cause large amount of data loss. This is related to a limitation of paho client during the first MQTT connection and afaik it cannot be easily solved without a custom storage mechanism like the one based on redis streams.
TEST ENVIRONMENT:
TESTED USE CASES:
results:
result:
fix:
results:
fix:
OR
results:
export service shutdown, all messages are lost
fix:
This imho is a major flaw of the export service, that becomes essentially useless in contexts requiring reliability also in cases when the service is started without internet connection.
Imagine the situation in which the edge hardware is restarted but that day there is no internet connection. All the messages of that day are lost.
Note about #19
How can export service guarantee message reliability in case of future http publisher implementation, if it is delegated to the MQTT library?
The text was updated successfully, but these errors were encountered: