Experiments about node (ELU) profiling #71

bartbutenaers · 2022-02-24T23:37:12Z

bartbutenaers
Feb 24, 2022

Hey guys,

I would like to share here some experiments about building a Node-RED node profiler, to allow users to determine quickly which nodes in their flows are causing a performance drop.

However a few evenings of spare time was unfortunately NOT enough to find a working solution, but I wanted to share it here anyway so everybody can share ideas.

Use case

Suppose you have a flow with a lot of nodes, and it is becoming horrible slow. That could have multiple causes:

It might be that the load is spreaded rather equally across a lot of nodes, which means that the system is simply overloaded.
But it might also be that one or a couple of nodes are consuming most of the CPU.

In the latter case it would be nice if there was a way to determine easily which nodes are consuming most CPU.

ELU vs CPU

From what I have read, it seems not possible to measure the CPU that is being used by NodeJs/V8 to execute a function call. On the other hand it is possible to measure the time it takes to execute a single function.

The NodeJs performance API allows us to measure the ELU (Event Loop Utilization), i.e. measure how much time it took for the event loop to execute a function:

const elu = eventLoopUtilization();
doSomething();
console.log(eventLoopUtilization(elu).utilization);

I have not analyzed in depth the difference between ELU and CPU, but e.g. in this article you can see the the chart of the CPU and ELU are sometimes completely identical. But sometimes not. For example the garbage collection by the V8 engine consumes CPU, which means there is less processor power available to execute the application which runs on the V8. In this case the application is NodeJs, whose event loop will not get much processor resources to execute function calls. So in that case the ELU will be lower than the CPU.

Summarized: the ELU shows how much your application is doing, while the CPU shows how much the entire system is doing (inclusive garbage collections, and so on...). But if we can't measure the CPU of a node, then ELU is the (only) way to go...

Difficulty measuring ELU of a node

It seems easy to measure the ELU of a node, by measuring - via the above code snippet - how long it takes to process a message inside a node. However you only measure all the code that is being called synchronous within that node's function.

But that node can also do a lot of processing asynchronous (via timers, promises), which we cannot measure. And those asynchronous function calls can start again their own asynchronous processing, resulting in an entire subtree of background processing:

And we really need to measure the asynchronouse: because asynchronous processing is used a lot in nodes that do a lot of heavy work (e.g. image/video/audio processing). And that are the most interesting nodes to analyze in a performance study...

ClinicJs as starting point

As described in this discussion I was rather fascinated by ClinicJs, because they somehow managed to show my Node-REF flow as "usage bubbles":

And those bubbles even seemed to take into account usage caused by asynchronous processing.
So I had a quick look at their code, and it works like this:

Start ClinicJs, who starts Node-RED in turn.
They use the NodeJS Async hooks API to find the links between individual asychronous function calls.
They stream that information to a profile file.
When Node-RED is closed, they start analyzing the file.
During the analysis they aggregate the usage of the asynchronous function calls, based on the information recorded in step 2.

Due to time limit I have not analyzed their node-clinic-bubbleprof repo further in depth. Moreover I wanted to have a live profiling, instead of writing to a file which is being analyzed afterwards. But I think they have very good reasons why they do it afterwards, as I will explain further on ...

Basics of Async hooks

The NodeJs Async hook API allows us to keep track of the lifecycle of asynchronous resources, because it provides hooks that are called for the following events on every async resource:

init: called when an async resource is initialized.
before: called just before the callback associated with the async resource is executed.
after: called just after the callback associated with the async resource has been executed.
destroy: called after the resource is destroyed.
promiseResolve: called when the resolve function of the Promise constructor is invoked.

The difference between init and before/after: init is called once, while before/after could be called never or multiple times. For example when you create a timer (via setInterval), then the before/after will be called every timer the callback function is being called.

Such an hook can be implemented easily:

const timeoutHook = async_hooks.createHook({
  init(asyncId, type, triggerAsyncId) {
      ...
  },
  before(asyncId) {
     ...
  },
  destroy(asyncId) {
      ...
  },
  after(asyncId) {
      ...
  },
});
timeoutHook.enable();

You only need to be aware not to call an asynchronous feature (e.g. console.log) inside these functions, otherwise you get into an endless loop causing a stack overflow...

Moreover in every NodeJs code snippet, you can get the async id of the current async resource (in which your current code snippet is being executed):

const asyncHooks = require('async_hooks');
var currentAsyncIdentifier = asyncHooks.executionAsyncId()

This async id tells you in which execution context this code snippet is being executed.

The async hook init method contains a very interesting parameter "triggerAsyncId, which is the id of the parent async resource (in whose execution context the current async resource was created).

Now we need to find a way to start tracing async calls when a message arrives in a node, and link all nested async calls to that node...

Linking async resources to a node

It was not clear to me how to link all the nested async resources to a node in my flow. Until I found some examples (e.g. here) of how async hooks are being called to do profiling of http requests inside an ExpressJs webserver. Summarized: as soon as a http request arrives, they get the current executionAsyncId and store that in a map together with the request id. Every time the init function is called, they lookup the parent id in that map and link everything together...

So we need to do something similar:

We need to find the root async resource, which is dedicated to a single a node and in which the processing of a message is being executed. Seems that (in the handlePredeliver) a new async resource is started via an setImmediate call:

That async context seems to me very tightly connected to the next node where the message will arrive, so I will use this async resource as my root async resource (that needs to be linked to the node).

When a message arrives in that async context, we need to start the tracing and link it to the node (and to the message). And we need to do that just before the message arrives at the node, to avoid that we start profiling code that is not related to the node itself (e.g. code of the Node-RED message hook mechanism). To do that, I have used a Node-RED onReceive message hook (which is triggered after the preDeliver but before the on-input of the node):

RED.hooks.add("onReceive.asyncProfiling", function(receiveEvent) {
   // Get the current async execution context id
   var currentExecutionContextId = asyncHooks.executionAsyncId();
             
   // Collect information about the current execution context
   var asyncTriggerInfo = {
      msgId: receiveEvent.msg._msgid,
      nodeId: receiveEvent.destination.node.id,
      elu: eventLoopUtilization(), // Start tracing
      type: 'synchronous'
   }
                             
   // This call will be considered the 'trigger' for all the code executed inside the node (where the message will arrive soon afterwards).
   // Note that it has NO use to capture the callstack here, because it contains only frames starting from the setImmediate (timer) and Node-RED hook handling...
    myProfilerMap.set(currentExecutionContextId, asyncTriggerInfo);
});

Now that we have a starting point in our map, we need to watch every async resource being created from here on:

myAsyncHook = asyncHooks.createHook({
   init: function(asyncId, type, triggerAsyncId, resource) {
      // Try to find the parent async id in the map                  
      var asyncTrigger = myProfilerMap.get(triggerAsyncId);
                     
      // Ignore all async resource that have not been triggered initially by a message (in our Node-RED hook), to avoid performance troubles.
      // This way we ignore e.g. all async resources that are being created by the Node-RED framework...
      if(asyncTrigger) {
         var asyncInfo = {
            nodeId: asyncTrigger.nodeId,
            type: type // The type can be 'Promise', 'Timeout', 'Immediate', ...
         }

         // Store the current async id also in the list, together with all required information context
         node.profilerMap.set(asyncId, asyncInfo);
      }
   },
   before: function (asyncId) { 
      // Try to find the current async id in the map    
      var asyncInfo = myProfilerMap.get(asyncId);
                     
      if(asyncInfo) {
         // Start tracing
         asyncInfo.elu: eventLoopUtilization(),
      }
   },
   after: function (asyncId) { 
      // Try to find the current async id in the map  
      var asyncInfo = myProfilerMap.get(asyncId);
                     
      if(asyncInfo) {
          // Calculate the duration of the processing in this async resource
          var elu = eventLoopUtilization(asyncInfo.elu).utilization
          
          // Do something with this elu ............
      }
   },
   destroy: function(asyncId) {                         
      // Decouple the async resource from it's context.
      // I.e. remove the async id from the map , to avoid memory leaking
      myProfilerMap.delete(asyncId);
   });
}

node.asyncHook.enable();

I had created and enabled this hook from within a test node, but it should be done somewhere in the Node-RED core code. You should create the hook ONCE, because it will handle the events of ALL the async resources. The hook should also be disabled by default, and only enabled in case of troubleshooting. Because async hooks will result in a lot of extra processing!

AGGREGATION ISSUE

This is the part where I was completely stuck.

My original idea was to show live the ELU of selected nodes, for example on top of the node. However that means that you need to know e.g. every second how much ELU has been used during the previous second.

However it seems to me that we can only calculate the ELU of an async resource, inside the after event. But that means that we can only calculate a long duration processing, when the entire processing is finalized.

That is too late if you want to show the ELU live.

OTHER LIMITATIONS

As we have discussed above, you can only link async resources to a specific node if there exists a root async resource that is linked tightly to that node. In that root async resource the syncrhonous message processing code of that node should be executed, and that root async resource should only be used by this particular node.

This will result in two extra limitations, due to the way the Node-RED core currently creates its async resources:

The startup code of a node might also cause performance issues. A node can do a lot of processing at startup (or after a deploy) both synchronous and asynchronous. A node could for example start a timer, that executes heavy code continiously. However I don't think this can be profiled, because Node-RED starts an async resource that will call the synchronous startup code of ALL the nodes in one big sequence. However that means that there is only ONE root async id for all the async resources:

As a result there is no way you can link (in the init function) an async resource to a specific node. Because an async resource can start a nested async resource when the sequential startup code of a NEXT node in the flow is already being executed. At this point in time you can never have a reference somewhere to the original node...

Imho I think startup profiling is only possible if the startup code of each node would be executed in a separate async resource. But not sure what the other implications of such a design change would be, so perhaps this is not possible ...
When the user disables asynchronous processing in his settings, then the setImmediate will be skipped and the node message processing will be called synchronous:

As a result there will be (similar to the startup code) one single root async resource (id) for ALL nodes. I have not tested this, but I am pretty sure that in this case profiling will also not possible...

CONCLUSION

I don't think it is possible to show the ELU of a node live.

What might be possible using the above mechanism: start a profiling, and end it after some time. At that moment you might be able to aggregate all the results. That way you can calculate:

The total ELU of a node (based on the nodeId): how much ELU a node has been using in total.
The ELU of messages (based on the _msgId): how much ELO messages have been using.
The ELU of a node per message (based on the combination of nodeId and _msgId): how much ELO a node has been used per message, or for the worst message, ...

Although I have not a ready-to-go solution, hopefully you find this useful information to get insights into the profiling.

Of course there might be better ways to solve it, which I am not aware of...

Bart

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiments about node (ELU) profiling #71

{{title}}

Replies: 0 comments

Select a reply

Experiments about node (ELU) profiling #71

bartbutenaers Feb 24, 2022

Use case

ELU vs CPU

Difficulty measuring ELU of a node

ClinicJs as starting point

Basics of Async hooks

Linking async resources to a node

AGGREGATION ISSUE

OTHER LIMITATIONS

CONCLUSION

Replies: 0 comments

bartbutenaers
Feb 24, 2022