-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: calculation of node uptimes by looking at recentUptime
#414
Conversation
Fix accumulated uptime calculation in NodeData This commit addresses issues in the accumulated uptime calculation for nodes in the network. Previously, the calculation could lead to inflated uptime values due to repeated additions of time since the last known activity markers, regardless of the node's actual activity status. The updated logic now accurately calculates the uptime based on the node's activity periods, ensuring that only the duration of actual activity is added to the accumulated uptime. For nodes that have left the network, the uptime is calculated as the difference between the last left and last joined timestamps. For active nodes, the uptime since the first join is considered, preventing overestimation. This change ensures a more accurate representation of node uptime, enhancing the reliability of network monitoring metrics. - Add precise calculation for nodes that have left the network - Implement dynamic uptime update for active nodes - Prevent repeated addition of the same uptime duration - Ensure accurate tracking of node activity periods
recentUptime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that the UpdateAccumulatedUptime method is never called in the code. It is calling CalculateAccumulatedUptime() instead. The issue still exists. I agree with enforcing at least the limit that the uptime cannot be greater than now - first joined. This will at least keep it from getting overinflated.
The fundamental problem is that the Join event happens at the moment of connection and is very reliable; however, the Leave event occurs after the node polls the PeerStore and cannot make a connection to the node. At that time it will fire a leave event. This could be 20-30 seconds after the node actually left and it may have then joined once again which disrupts this calculation badly.
At one point I had written code to account for this. I don't think it is still there but as I recall it would check on the Join events if the node was still in an active status (meaning it joined again before a leave event was captured)
I will track this down and make sure it is incorporated the the changes you have made and is in code that is being called.
The calculation methods for current uptime and accumulated uptime have been commented out and their calls removed from the join and leave event methods. Instead, they are replaced with an update method called when a node leaves. Additionally, before processing an expired connect event, the node is now forced to leave to ensure correct timestamp updating.
@teslashibe @jdutchak let me know if you have any issues with my logic here. Calling CalculateCurrentUptime on a join event makes no sense as it is being done. It should be zero at that point, so I'm removing it. Using the same logic, there should be no change in the Accumulated uptime on a join event, so it should not be called from the Join method. Uptime in both cases should be calculated in real time while the node is active and only the accumulated uptime should only be updated when there is a Leave event, otherwise it is "last known accumulated uptime + current uptime". It is possible there is some kind of issue still outstanding with multiple leave/join events that throws off the calculation. I am unable to reproduce that situation. A check to ensure that the value of accumulated uptime is not greater than now - first joined will mitigate highly inflated numbers, but this situation reflects an outstanding bug in the system. In our logic to handle the multiple join/leave events, adding a leave event before allowing an expired join event should ensure that the calculation is maintained correctly. I am commenting out the Calculate function, it doesn't seem correct and should only be called from one place. |
It took nearly all day, but I found the source of the bug and captured a screenshot. The situation is when the node leaves and joins again, it will "self identify" to the boot node. In the self identification all of the time values are zero, but a few are set on the Join method for the node data. The problem was that the update to the underlying SafeMap of the NodeDataTracker is setting the value to the incoming nodeData after it has updated all of the fields on the existing nd (nodeData) pointer. This was the mistake, it should always send the nd (nodeData) pointer. |
The previous version of the code attempted to persist a wrong instance into the `nodeData` map. The update corrects this issue by ensuring that `nodeData` is updated with the new instance `nd`, which contains all recent modifications.
The IsActive attribute in the NodeEventTracker has been removed because it should not be set by other nodes. It should only be manipulated by the join/leave events tracked by this node.
@restevens402 thanks - looking forward to getting this into test and testing. We should clear the current nodeData file on the network and reset to a blank slate when this gets deployed so we can test with new data. cc @mudler @jdutchak @j2d3 to make sure with this deployment we have the "clean slate" of the database :) |
@mudler @jdutchak I do not have time to review when I am OOO this week would you mind 🍐 'ing with @restevens402 to get this merged and then deployed with a clean database so we can validate the data? cc @lacyg4 @Luka-Loncar @giovaroma - we will need to communicate this update with the community so they can make sure the version of their node software is updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent @restevens402 - this has been a tough one tysm! I tested this on my local and dev network and it appears to be solved. will monitor in any case!
@restevens402 on test still seeing this, I wonder if the nodes being on different versions is the issue here, you can use the following script to test against: https://github.com/masa-finance/awesome-masa/tree/main/tests/protocol
|
@teslashibe thank you for the list. Discouraging. I will look at the individual records and see if I can diagnose. This isn't the egregious problem that was solved, but there is something quite off here. |
Problem
This addresses and closes #413
The revised
UpdateAccumulatedUptime
function addresses several key issues present in the original implementation, offering a more accurate calculation of a node's accumulated uptime. Here's how it addresses specific concerns:When the Node Has Left
ActivityLeft
recentUptime
as the difference betweenLastLeftUnix
andLastJoinedUnix
, the function now accurately adds only the duration of the last session to the accumulated uptime. This prevents the overestimation seen when the function was previously adding the time sinceLastLeftUnix
on every call.When the Node Is Still Active
ActivityJoined
FirstJoinedUnix
. This approach ensures that the accumulated uptime reflects all the active time up to the current moment.currentUptime
is greater than the storedAccumulatedUptime
, it prevents the repeated addition of the same uptime duration. This logic helps in maintaining an accurate count of the uptime, especially in scenarios where the node remains active without rejoining or leaving.General Improvements
Remaining Considerations
ActivityJoined
orActivityLeft
and the corresponding timestampsLastJoinedUnix
,LastLeftUnix
. Ensuring these values are accurately maintained is crucial for the function to work as intended.Deployment Notes
In order to test this we will need to reset the entire networks version of nodeData and start from scratch @Luka-Loncar @mudler @jdutchak @j2d3 in the next release. We should just test this live in
test
it can't get anymore broken.