Experiments about node (ELU) profiling #71
bartbutenaers
started this conversation in
Design Proposals
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey guys,
I would like to share here some experiments about building a Node-RED node profiler, to allow users to determine quickly which nodes in their flows are causing a performance drop.
However a few evenings of spare time was unfortunately NOT enough to find a working solution, but I wanted to share it here anyway so everybody can share ideas.
Use case
Suppose you have a flow with a lot of nodes, and it is becoming horrible slow. That could have multiple causes:
In the latter case it would be nice if there was a way to determine easily which nodes are consuming most CPU.
ELU vs CPU
From what I have read, it seems not possible to measure the CPU that is being used by NodeJs/V8 to execute a function call. On the other hand it is possible to measure the time it takes to execute a single function.
The NodeJs performance API allows us to measure the ELU (Event Loop Utilization), i.e. measure how much time it took for the event loop to execute a function:
I have not analyzed in depth the difference between ELU and CPU, but e.g. in this article you can see the the chart of the CPU and ELU are sometimes completely identical. But sometimes not. For example the garbage collection by the V8 engine consumes CPU, which means there is less processor power available to execute the application which runs on the V8. In this case the application is NodeJs, whose event loop will not get much processor resources to execute function calls. So in that case the ELU will be lower than the CPU.
Summarized: the ELU shows how much your application is doing, while the CPU shows how much the entire system is doing (inclusive garbage collections, and so on...). But if we can't measure the CPU of a node, then ELU is the (only) way to go...
Difficulty measuring ELU of a node
It seems easy to measure the ELU of a node, by measuring - via the above code snippet - how long it takes to process a message inside a node. However you only measure all the code that is being called synchronous within that node's function.
But that node can also do a lot of processing asynchronous (via timers, promises), which we cannot measure. And those asynchronous function calls can start again their own asynchronous processing, resulting in an entire subtree of background processing:
And we really need to measure the asynchronouse: because asynchronous processing is used a lot in nodes that do a lot of heavy work (e.g. image/video/audio processing). And that are the most interesting nodes to analyze in a performance study...
ClinicJs as starting point
As described in this discussion I was rather fascinated by ClinicJs, because they somehow managed to show my Node-REF flow as "usage bubbles":
And those bubbles even seemed to take into account usage caused by asynchronous processing.
So I had a quick look at their code, and it works like this:
Due to time limit I have not analyzed their node-clinic-bubbleprof repo further in depth. Moreover I wanted to have a live profiling, instead of writing to a file which is being analyzed afterwards. But I think they have very good reasons why they do it afterwards, as I will explain further on ...
Basics of Async hooks
The NodeJs Async hook API allows us to keep track of the lifecycle of asynchronous resources, because it provides hooks that are called for the following events on every async resource:
The difference between init and before/after: init is called once, while before/after could be called never or multiple times. For example when you create a timer (via setInterval), then the before/after will be called every timer the callback function is being called.
Such an hook can be implemented easily:
You only need to be aware not to call an asynchronous feature (e.g. console.log) inside these functions, otherwise you get into an endless loop causing a stack overflow...
Moreover in every NodeJs code snippet, you can get the async id of the current async resource (in which your current code snippet is being executed):
This async id tells you in which execution context this code snippet is being executed.
The async hook
init
method contains a very interesting parameter "triggerAsyncId, which is the id of the parent async resource (in whose execution context the current async resource was created).Now we need to find a way to start tracing async calls when a message arrives in a node, and link all nested async calls to that node...
Linking async resources to a node
It was not clear to me how to link all the nested async resources to a node in my flow. Until I found some examples (e.g. here) of how async hooks are being called to do profiling of http requests inside an ExpressJs webserver. Summarized: as soon as a http request arrives, they get the current executionAsyncId and store that in a map together with the request id. Every time the init function is called, they lookup the parent id in that map and link everything together...
So we need to do something similar:
We need to find the root async resource, which is dedicated to a single a node and in which the processing of a message is being executed. Seems that (in the handlePredeliver) a new async resource is started via an setImmediate call:
That async context seems to me very tightly connected to the next node where the message will arrive, so I will use this async resource as my root async resource (that needs to be linked to the node).
When a message arrives in that async context, we need to start the tracing and link it to the node (and to the message). And we need to do that just before the message arrives at the node, to avoid that we start profiling code that is not related to the node itself (e.g. code of the Node-RED message hook mechanism). To do that, I have used a Node-RED
onReceive
message hook (which is triggered after the preDeliver but before the on-input of the node):Now that we have a starting point in our map, we need to watch every async resource being created from here on:
I had created and enabled this hook from within a test node, but it should be done somewhere in the Node-RED core code. You should create the hook ONCE, because it will handle the events of ALL the async resources. The hook should also be disabled by default, and only enabled in case of troubleshooting. Because async hooks will result in a lot of extra processing!
AGGREGATION ISSUE
This is the part where I was completely stuck.
My original idea was to show live the ELU of selected nodes, for example on top of the node. However that means that you need to know e.g. every second how much ELU has been used during the previous second.
However it seems to me that we can only calculate the ELU of an async resource, inside the
after
event. But that means that we can only calculate a long duration processing, when the entire processing is finalized.That is too late if you want to show the ELU live.
OTHER LIMITATIONS
As we have discussed above, you can only link async resources to a specific node if there exists a root async resource that is linked tightly to that node. In that root async resource the syncrhonous message processing code of that node should be executed, and that root async resource should only be used by this particular node.
This will result in two extra limitations, due to the way the Node-RED core currently creates its async resources:
The startup code of a node might also cause performance issues. A node can do a lot of processing at startup (or after a deploy) both synchronous and asynchronous. A node could for example start a timer, that executes heavy code continiously. However I don't think this can be profiled, because Node-RED starts an async resource that will call the synchronous startup code of ALL the nodes in one big sequence. However that means that there is only ONE root async id for all the async resources:
As a result there is no way you can link (in the
init
function) an async resource to a specific node. Because an async resource can start a nested async resource when the sequential startup code of a NEXT node in the flow is already being executed. At this point in time you can never have a reference somewhere to the original node...Imho I think startup profiling is only possible if the startup code of each node would be executed in a separate async resource. But not sure what the other implications of such a design change would be, so perhaps this is not possible ...
When the user disables asynchronous processing in his settings, then the
setImmediate
will be skipped and the node message processing will be called synchronous:As a result there will be (similar to the startup code) one single root async resource (id) for ALL nodes. I have not tested this, but I am pretty sure that in this case profiling will also not possible...
CONCLUSION
I don't think it is possible to show the ELU of a node live.
What might be possible using the above mechanism: start a profiling, and end it after some time. At that moment you might be able to aggregate all the results. That way you can calculate:
Although I have not a ready-to-go solution, hopefully you find this useful information to get insights into the profiling.
Of course there might be better ways to solve it, which I am not aware of...
Bart
Beta Was this translation helpful? Give feedback.
All reactions