Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message Tracing by metric log #34

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions designs/message-trace/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
state: draft
---

# Message Tracing by metric log

## Summary

To enable precise tracing of each message, we propose to make the runtime emit
log message whenever the message processing is finished. For more complex
tracing for nodes which splitting or joining the messages, these nodes may
emit a *correlation* log to provide information of
which message is depend on another message.

## Authors

- @k-toumura

## Details

### Use case

In debugging, we want to trace the messages. Currently, we use Debug nodes to confirm that a process in some node is finished. But:

- we can't know whether the message is received and start processing in some node.
- inserting Debug nodes is cumbersome process, and readability become worse.

By enabling metric log, we can see message IDs that is sent/received.
But, if the node doesn't send a message, we can't confirm that
the processing in the node is finished.

Moreover, if the node send a message asynchronously, it is difficult
to analyze a causal relationship between messages.

For example, there is a flow configured like following diagram:
![sample flow](sampleflow.png)
and the runtime emittted the following metric logs.
![sample log](samplelog.png)

Node "D" receives message "X", run join process, and then emit message to node "E",
but, we can't confirm that messages "Y" and "Z" been processed, or still queued in node "D".

### Requirement

To trace a message, we should know:
- when the processing of the message is finished.
- which messages are related to the message.

### Proposed changes

#### Metric log for message process completion in a node

Emit metric log when nodes calls `done()` callback, such like:
```
26 Feb 09:33:26 - [metric] {"level":99,"nodeid":"8b1f2328.8e972","event":"node.join.done","msgid":"2522bce.e667944","timestamp":1582677206863}

```
where:
- nodeid: id of a node
- event: `node.NODETYPE.done`
- msgid: id of a message that the node has finished processing
- timestamp: Unix time in millisecond

#### Metric log to correlate between messages

Emit metric log when sent a message which is related to other messages using:
```
node.metric("correlate", msg, relatedMsgIds)
```
where:
- msg: message in process. For example, when join node send a concatenated message, use the message as an argument of `node.metric()`.
- relatedMsgIds: array of ids of related messages.

Note: we can put more complex information on the third argument of `node.metric()`. For example, we can make it a object which contains a array of message ids and a relation type such as `"isPartOf"`, `"splitInto"`, `"isDependsOn"`, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to decide if it's going to be a more complex object from the start - it isn't something we can change later.

In the Join node scenario, which is the main scenario that requires this correlate metric, I would suggest it looks like:

{ "join":["A", "B", "C" ...]}

That gives us room to introduce other correlate messages if they prove necessary.


### Usage example

For the example shown in Use cases, the proposed metric logging makes message traceable.

![Correlation](correlationlog.png)

- by `node.join.correlate` event, we can assume that message X is related to message Y and Z.
- by `node.join.done` event, we can assume that processing of message X, Y and Z are completed.

## Related topics

- [Metrics Usage design note](../metrics-usage.md)
- [Graceful Shutdown design note](../gracerul-shutdown/README.md)
- [Pluggable Message Routing](../pluggable-message-routing.md)

## History

- 2020-07-14 - Initial proposal submitted

Binary file added designs/message-trace/correlationlog.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added designs/message-trace/sampleflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added designs/message-trace/samplelog.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.