Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues on storing message locally when connection to the cloud is lost #28

Open
pricelessrabbit opened this issue Nov 8, 2020 · 6 comments
Assignees

Comments

@pricelessrabbit
Copy link
Contributor

As an edge service working on an industrial context, I expect that storage and retention of data implementation is solid and protects from data loss in case of common system failure events.

As now, after the redis stream removal, the current implementation that delegate to paho message persistence fails in common use cases. I opened a PR that fix some of them but there is still a major flaw that can cause large amount of data loss. This is related to a limitation of paho client during the first MQTT connection and afaik it cannot be easily solved without a custom storage mechanism like the one based on redis streams.

TEST ENVIRONMENT:

  • local nats publisher producing messages
  • local export service running
  • export service with setup to publish QOS 2 messages to cloud
  • cloud instance of mainflux with export channel configured
  • external mqtt client (mossquitto_sub) connected to mainflux channel to check data loss

TESTED USE CASES:


  • connection lost after successfull connection
  1. export service starts and connects to broker
  2. export disconnects from the broker (simulated network issue)
  3. export service reconnects to broker

results:

  • all messages are delivered (correctly managed by paho)

  • restart of export service
  1. export service starts and connects to broker
  2. force restart of export service

result:

  • sometimes some messages are lost

fix:

  • fixed with PR to persist in file

  • connection lost after successfull connection and restart
  1. export service starts and connects to broker
  2. export disconnects from the broker (simulated network issue)
  3. force restart of export service
  4. export service reconnects to broker

results:

  • all pending messages ( produced during disconnection timeframe) are lost

fix:

  • fixed with PR to persist in file

  • broker unreachable when service starts
  1. export service starts but cannot connect to broker

OR

  1. export service starts and connects to broker
  2. export service force restarts
  3. export service starts but the broker now is unreachable

results:
export service shutdown, all messages are lost

fix:

This imho is a major flaw of the export service, that becomes essentially useless in contexts requiring reliability also in cases when the service is started without internet connection.

Imagine the situation in which the edge hardware is restarted but that day there is no internet connection. All the messages of that day are lost.


Note about #19

How can export service guarantee message reliability in case of future http publisher implementation, if it is delegated to the MQTT library?

@drasko
Copy link
Contributor

drasko commented Nov 8, 2020

broker unreachable when service starts
export service starts but cannot connect to broker
OR

export service starts and connects to broker
export service force restarts
export service starts but the broker now is unreachable
results:
export service shutdown, all messages are lost

fix:

NO FIX. I tried to fix it by disable the initial paho token wait that cause the shutdown of the service and let paho manage the auto reconnect, but there is an known limitation of it (eclipse-paho/paho.mqtt.golang#77). seems that paho stores messages locally and manage reconnects only if the first connection is successful.
This imho is a major flaw of the export service, that becomes essentially useless in contexts requiring reliability also in cases when the service is started without internet connection.

Imagine the situation in which the edge hardware is restarted but that day there is no internet connection. All the messages of that day are lost.

@pricelessrabbit I must say that I do not understand well this problem. If the export service is started and initially connected to broker - this is a part of obligatory gateway provisioning process, without which there should be no sensors connected to a gateway and data coming to the gateway - then there is immediately QoS2 session enabled.

If Export comes back up, it will try to re-establish the session with broker, but if broker is not there, Export can just retry or exit. If it exits, it should be restarted by a supervisor (orchestrator, K3s, systemd, ...) and then it tries to connect again.

But I do not see how messages can be lost here - unless you mean the time while Export restarts?

Do I understand this correctly?

Additional question - in the case when there is a network issue:

connection lost after successfull connection and restart
export service starts and connects to broker
export disconnects from the broker (simulated network issue)
force restart of export service
export service reconnects to broker

Why do you restart the Export service in this case? Isn't there a way to loop-reconnect without restart?

@drasko
Copy link
Contributor

drasko commented Nov 8, 2020

@dusanb94 @mteodor @nmarcetic can you please take a closer look into this?

@pricelessrabbit
Copy link
Contributor Author

pricelessrabbit commented Nov 9, 2020

If Export comes back up, it will try to re-establish the session with broker, but if broker is not there, Export can just retry or exit. If it exits, it should be restarted by a supervisor (orchestrator, K3s, systemd, ...) and then it tries to connect again.

@drasko this is the tricky part imho

try to explain the issue with an example (exaggerated)

day 1: provision and bootstrap of the edge gateway with export that start to deliver data to the cloud
day 2: all ok
day 3: some power failure put the edge system down. export restarts but (OMG) the internet router after the power off it is not working anymore.So for all the day 3 export service starts and exit (and get restarted by orchestrator)
day 4: a technician fix the router issue

result:
all the day 3 data lost.

But, if you change the example and keep the export up (exclude it from power failure) in that case data will be delivered.
Maybe i'm missing something but this is at least a strange behaviour.

In other implementations (for example https://docs.edgexfoundry.org/1.3/microservices/core/data/Ch-CoreData/) there is no such issue. the edge data is always collected, and sent to the cloud when it becomes available.

@drasko
Copy link
Contributor

drasko commented Nov 14, 2020

@pricelessrabbit I see - idea here would be that Export does not restart on unreachable cloud, but just be up without connection, retrying MQTT CONNECT and storing QoS data locally.

This is actually how it should work and then none of the data will be lost.
And then we have the nice functionality even without Redis (which was the goal - simplification).

Apart from this, we should consider NATS 2.0 JetStream or probably even better add Liftbridge to enable durable replicated logs for NATS, and then any subscriber (weather it is Export, Writer or any client's on-gateway app) will not be loosing messages, as Liftbridge would preserve them

@pricelessrabbit
Copy link
Contributor Author

pricelessrabbit commented Nov 25, 2020

maybe mine is a special use case, but in my case can be that for external reasons, exporter restarts when there is no internet connection, and in that case important measures and events are lost. So, if you make a choice between JetStream and Lifbridge ( or maybe also nats streaming server) i can try to provide an implementation based on it.
As a simpler solution, i had a look on key-value storages like https://github.com/dgraph-io/badger, that can be embedded into the exporter without external services / servers

@drasko
Copy link
Contributor

drasko commented Nov 25, 2020

maybe mine is a special use case, but in my case can be that for external reasons, exporter restarts when there is no internet connection, and in that case important measures and events are lost.

This is not your special case, I think this is a bug in our current implementation. I suggested for Export service not to be restarted in the case of remote MQTT connection loss, but to be up and keep retrying. When it is up, it will get the messages, try to forward them via MQTT, and because remote connection is broken they will be stored locally.

I agree that Badger will do the job, but there are drawbcks:

  • Badger is only 64-bit compatible, will not run on 32-bit gateways
  • I would like Export to be easily scalable, and for this it must be stateless

Redis helped with this "statelessness", QoS2 probably does not (but rather falls back to Badger-like solution, where each instance of the Export stores it's own local storage.

So, for me there are 2 correct solutions to this:

  1. Bring back Redis
  2. Add queue on the level of NATS, as I explained, and I lean towards Liftbridge here

Let's contemplate with @mteodor a bit on this and make some decisions until end of week. In the meantime we need to do a PoC with Liftbridge and explore it's scalability and how it extends NATS.

TBH, I would prefer Liftbridge solution, because this would help other services as well (not only Export) to prevent message loss, as they would mark message consumed in Liftbridge only once it has been really used (for example sent to remote MQTT broker in case of Export).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants