Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Action Error #271

Open
pneuschwander opened this issue Jun 17, 2018 · 2 comments
Open

Handle Action Error #271

pneuschwander opened this issue Jun 17, 2018 · 2 comments

Comments

@pneuschwander
Copy link

Hello guys,
how can errors be handled when using messageHubFeed as a trigger for an openwhisk action?

Let's take the following example scenario:
TopicA contains the messages: M1, M2, M3, M4, M5

The openwhisk action Action1 is bound to a trigger for TopicA.

Action1 persists messages in Cloudant.

The trigger is sucessfully fired with {"messages": [M1, M2, M3]}.
Now assume that Cloudant is unavailable or the action crashes/fails.

As far as I know, the offset has already been commited, so these messages won't ever be redelivered/retried.
And maybe following trigger/action invocations (in case of cloudant being down for let's say 5 minutes) may end the same.

So to sum up: If the action fails, messageHubFeed ignores that and fires the trigger for the next messages. Whether they can be processed or not. In worst case all messages get delivered but never successfully processed.
In such a case it would be nice to pause the delivery until the action can process the messages again.

"Messages can't currently be processed, it is not good to deliver more of them, let's queue them up (kafka can do this) and try to continue delivery in 5 Minutes".

For sure I understand that a poisoned message should not halt the processing and may be skipped. But what can we do in such a "Database is down"-scenario?

Can/Should the processing be paused?

Do we need to monitor the activation records and manually resolve all the failed ones?

Should all affected messages be sent to a Dead-Letter-Queue/Topic?
And what if that fails, too (timeout, network partitioning, ...)?

Does anyone have some ideas or experience on how to deal with that kind of scenarios?

@jberstler
Copy link

@regmebaby Right now, whoever fires the trigger actually gets no feedback at all about whether any connected actions even run, let alone if those actions succeed. This is as-designed to keep trigger firing as lightweight and quick as possible. As such, the kafka/message hub event provider can't automatically know that your actions failed and to skip backwards to re-fire triggers for those messages.

On top of that, as far as I know, it is currently not possible to pause an event provider to stop if from firing triggers. But even if that were possible, there is also no way to tell the Kafka/Message Hub trigger to rewind and re-fire for a specific message or offset.

So... what to do in this situation? If you need to guarantee that every message is processed, I think you will need to handle it in your action.

One way to handle this situation is to persist somewhere (Cloudant? Reddis?) information about the messages that failed processing. You could persist either the entire message contents or, perhaps, just the topic and offset for the message as this is contained in the trigger payload sent to the action handling the messages. In either case, you could then have a periodic trigger that fires every so often to retry processing on messages that need it. This trigger would fire an action that:

  1. Examines your persisted store of messages that failed processing
  2. Attempts to process them by invoking the right action(s)
  3. If successful, removes the message from the store (or marks it as being successfully processed)
  4. If processing fails, it leaves those messages around for another retry, the next time the periodic trigger fires.

Should all affected messages be sent to a Dead-Letter-Queue/Topic?

I believe this has been discussed at some point, but only for scenarios where the trigger fails to fire. There is no way to make the event provider do this for you when trigger successfully fires, but the triggered actions fail.

I hope this helps.

@HunderlineK
Copy link

HunderlineK commented Oct 23, 2019

@jberstler There are many scenarios that it will not be possible for the actions to log the failed messages e.g. an out-of-memory exception will terminate the action process before it even starts, a network connection issue will prevent the action from initiating, etc.

Without support for at-least-once delivery, this package cannot be used for any use case where data integrity is critical.

Even a mediator which receives all the messages from the event and then compares them to the messages successfully processed by the actions is not completely reliable, as the messages might fail to reach the comparison queue of the mediator for the same reasons that they will fail to reach the logging of the actions.

Basically, an at-least-once delivery mechanism is necessary to use the package for any use case that requires data integrity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants