fix: calculation of node uptimes by looking at `recentUptime` #414

teslashibe · 2024-07-11T23:41:08Z

Problem

This addresses and closes #413

The revised UpdateAccumulatedUptime function addresses several key issues present in the original implementation, offering a more accurate calculation of a node's accumulated uptime. Here's how it addresses specific concerns:

When the Node Has Left `ActivityLeft`

Improvement: By calculating the most recent active period recentUptime as the difference between LastLeftUnix and LastJoinedUnix, the function now accurately adds only the duration of the last session to the accumulated uptime. This prevents the overestimation seen when the function was previously adding the time since LastLeftUnix on every call.
Accuracy: This method ensures that each period of activity is only counted once, regardless of how many times the function is called after the node has left.

When the Node Is Still Active `ActivityJoined`

Dynamic Calculation: For active nodes, the function calculates the uptime since the node first joined FirstJoinedUnix. This approach ensures that the accumulated uptime reflects all the active time up to the current moment.
Prevention of Overestimation: By updating the accumulated uptime only if the calculated currentUptime is greater than the stored AccumulatedUptime, it prevents the repeated addition of the same uptime duration. This logic helps in maintaining an accurate count of the uptime, especially in scenarios where the node remains active without rejoining or leaving.

General Improvements

Accurate Tracking: The function now more accurately tracks the uptime by considering the entire duration of the node's activity from its initial join to the current time (for active nodes) or until it leaves the network.
Prevents Inflation: By adding checks to ensure uptime is only added based on actual activity periods and by preventing repeated additions of the same duration, the revised function mitigates the risk of inflated uptime values.

Remaining Considerations

Periodic Updates for Active Nodes: While the function improves accuracy, I have not considered how AccumulatedUptime is invoked for long running active nodes. This is particularly relevant for long-running nodes that don't frequently change their activity status.
Synchronization with State Changes: The function's accuracy depends on the timely update of the node's activity status ActivityJoined or ActivityLeft and the corresponding timestamps LastJoinedUnix, LastLeftUnix. Ensuring these values are accurately maintained is crucial for the function to work as intended.

Deployment Notes

In order to test this we will need to reset the entire networks version of nodeData and start from scratch @Luka-Loncar @mudler @jdutchak @j2d3 in the next release. We should just test this live in test it can't get anymore broken.

Fix accumulated uptime calculation in NodeData This commit addresses issues in the accumulated uptime calculation for nodes in the network. Previously, the calculation could lead to inflated uptime values due to repeated additions of time since the last known activity markers, regardless of the node's actual activity status. The updated logic now accurately calculates the uptime based on the node's activity periods, ensuring that only the duration of actual activity is added to the accumulated uptime. For nodes that have left the network, the uptime is calculated as the difference between the last left and last joined timestamps. For active nodes, the uptime since the first join is considered, preventing overestimation. This change ensures a more accurate representation of node uptime, enhancing the reliability of network monitoring metrics. - Add precise calculation for nodes that have left the network - Implement dynamic uptime update for active nodes - Prevent repeated addition of the same uptime duration - Ensure accurate tracking of node activity periods

pkg/pubsub/node_data.go

restevens402

It turns out that the UpdateAccumulatedUptime method is never called in the code. It is calling CalculateAccumulatedUptime() instead. The issue still exists. I agree with enforcing at least the limit that the uptime cannot be greater than now - first joined. This will at least keep it from getting overinflated.

The fundamental problem is that the Join event happens at the moment of connection and is very reliable; however, the Leave event occurs after the node polls the PeerStore and cannot make a connection to the node. At that time it will fire a leave event. This could be 20-30 seconds after the node actually left and it may have then joined once again which disrupts this calculation badly.

At one point I had written code to account for this. I don't think it is still there but as I recall it would check on the Join events if the node was still in an active status (meaning it joined again before a leave event was captured)

I will track this down and make sure it is incorporated the the changes you have made and is in code that is being called.

The calculation methods for current uptime and accumulated uptime have been commented out and their calls removed from the join and leave event methods. Instead, they are replaced with an update method called when a node leaves. Additionally, before processing an expired connect event, the node is now forced to leave to ensure correct timestamp updating.

restevens402 · 2024-07-14T21:00:27Z

@teslashibe @jdutchak let me know if you have any issues with my logic here.

Calling CalculateCurrentUptime on a join event makes no sense as it is being done. It should be zero at that point, so I'm removing it. Using the same logic, there should be no change in the Accumulated uptime on a join event, so it should not be called from the Join method.

Uptime in both cases should be calculated in real time while the node is active and only the accumulated uptime should only be updated when there is a Leave event, otherwise it is "last known accumulated uptime + current uptime".

It is possible there is some kind of issue still outstanding with multiple leave/join events that throws off the calculation. I am unable to reproduce that situation. A check to ensure that the value of accumulated uptime is not greater than now - first joined will mitigate highly inflated numbers, but this situation reflects an outstanding bug in the system.

In our logic to handle the multiple join/leave events, adding a leave event before allowing an expired join event should ensure that the calculation is maintained correctly.

I am commenting out the Calculate function, it doesn't seem correct and should only be called from one place.
I have updated the code to reflect the above description and appreciate another set of eyes on it.

restevens402 · 2024-07-16T23:01:50Z

It took nearly all day, but I found the source of the bug and captured a screenshot. The situation is when the node leaves and joins again, it will "self identify" to the boot node. In the self identification all of the time values are zero, but a few are set on the Join method for the node data.

The problem was that the update to the underlying SafeMap of the NodeDataTracker is setting the value to the incoming nodeData after it has updated all of the fields on the existing nd (nodeData) pointer. This was the mistake, it should always send the nd (nodeData) pointer.

The change is as simple as this:

Error situation found:

The previous version of the code attempted to persist a wrong instance into the `nodeData` map. The update corrects this issue by ensuring that `nodeData` is updated with the new instance `nd`, which contains all recent modifications.

The IsActive attribute in the NodeEventTracker has been removed because it should not be set by other nodes. It should only be manipulated by the join/leave events tracked by this node.

teslashibe · 2024-07-17T10:02:19Z

@restevens402 thanks - looking forward to getting this into test and testing. We should clear the current nodeData file on the network and reset to a blank slate when this gets deployed so we can test with new data.

cc @mudler @jdutchak @j2d3 to make sure with this deployment we have the "clean slate" of the database :)

teslashibe · 2024-07-17T10:03:58Z

@mudler @jdutchak I do not have time to review when I am OOO this week would you mind 🍐 'ing with @restevens402 to get this merged and then deployed with a clean database so we can validate the data?

cc @lacyg4 @Luka-Loncar @giovaroma - we will need to communicate this update with the community so they can make sure the version of their node software is updated.

jdutchak

excellent @restevens402 - this has been a tough one tysm! I tested this on my local and dev network and it appears to be solved. will monitor in any case!

teslashibe · 2024-07-24T23:17:15Z

@restevens402 on test still seeing this, I wonder if the nodes being on different versions is the issue here, you can use the following script to test against: https://github.com/masa-finance/awesome-masa/tree/main/tests/protocol

(awesome-masa) (base) brendanplayford@Brendans-MBP awesome-masa % python tests/protocol/node_data_uptime.py
Test Results:
Success - 6 nodes passed the uptime less than elapsed time test.
Peer ID: 16Uiu2HAmP3LjZc2m4AVcbf32ttyuLQ38rUyvhP8zAT5yTwRD8Lk7 is within the correct bounds.
Peer ID: 16Uiu2HAmVsd1WxZZj6pLMoG9cjVgKmdcfH8RNsETktsSBhPHBZ4g is within the correct bounds.
Peer ID: 16Uiu2HAm1efpreLqXkirZY2XmEkNZmy48RXbhJniNevCsyzLfino is within the correct bounds.
Peer ID: 16Uiu2HAmBPAV8GRe2RZADzvwAsVjQtYhBAqk9r46pkDP4YcSnVCu is within the correct bounds.
Peer ID: 16Uiu2HAmQ7HkzrjJbVkRMbHGe9tfoTDrqHtLZ1J2fsPcsZvqrRMC is within the correct bounds.
Peer ID: 16Uiu2HAmGyu1Vu4QX9axCEacWJotzXzXGzoJ6bSAhpa2aJuNRcUP is within the correct bounds.
Failures - 32 nodes did not pass the uptime less than elapsed time test.
Failure - Peer ID: 16Uiu2HAm7kV7i3JspkSkYX9tqCqy97cYueNZ3gLRZBGXBSms4Mmx, Error: Accumulated uptime 45 days, 10:48:00 is not less than elapsed time 14:03:01.787544
Failure - Peer ID: 16Uiu2HAmKsyNVMXagpV6dUEJSoRPYHdjSRrCXst4RexNJ6wrLZxD, Error: Accumulated uptime 40 days, 12:46:00 is not less than elapsed time 13:14:38.787550
Failure - Peer ID: 16Uiu2HAmKesNMGFvsR3idgvYbNo638aacKdNY3W9hZ6LdhJgRico, Error: Accumulated uptime 39 days, 22:54:00 is not less than elapsed time 13:06:32.787554
Failure - Peer ID: 16Uiu2HAm9tzx58JS8EzrtoJveVf3y4gMjUUded5JN3GJTrH7Jg3R, Error: Accumulated uptime 43 days, 19:37:00 is not less than elapsed time 13:51:06.787558
Failure - Peer ID: 16Uiu2HAmLmPtiPZU1Y6rLQxTHj2eoTaxF4ugszr9zgeyiSmfpkEk, Error: Accumulated uptime 41 days, 15:22:00 is not less than elapsed time 13:34:16.787561
Failure - Peer ID: 16Uiu2HAmS8TtaSwghkPHuiGWL2uzzidCfscjfQFSTLc9E5xoBw5a, Error: Accumulated uptime 80 days, 7:09:00 is not less than elapsed time 19:53:23.787564
Failure - Peer ID: 16Uiu2HAmQPrR2jdnaC3kFZgQXUgyG6e8XF3BvNH66ev5bEebf4bW, Error: Accumulated uptime 13 days, 14:11:00 is not less than elapsed time 10:01:47.787567
Failure - Peer ID: 16Uiu2HAkx1SVDmGdxafeKeSRGMimknbN4wCQ7XPmgZw8ggc7wNxs, Error: Accumulated uptime 51 days, 5:00:00 is not less than elapsed time 15:02:34.787570
Failure - Peer ID: 16Uiu2HAkvKys9BoqdBBEkNytwkJWsPAkJ95psmVJoowWqxXCJbEf, Error: Accumulated uptime 37 days, 11:33:00 is not less than elapsed time 12:51:11.787573
Failure - Peer ID: 16Uiu2HAmJc8qanqkq6j9npX58fDNnR5biB5XbVT5uKaQoh8m2Bi4, Error: Accumulated uptime 9 days, 11:55:00 is not less than elapsed time 8:27:00.787576
Failure - Peer ID: 16Uiu2HAmRsNKycBgvqwPt1AjCwp3pUW55TSoZsvmnQtgkKZTSct5, Error: Accumulated uptime 41 days, 5:18:00 is not less than elapsed time 15:25:55.787582
Failure - Peer ID: 16Uiu2HAmLGgZPXmux6ULeP6H9oitQ4nUkRaAAhJsVMjzPjZznrW1, Error: Accumulated uptime 4 days, 13:38:00 is not less than elapsed time 4:46:37.787585
Failure - Peer ID: 16Uiu2HAmUgcLEkRtDPoHve4v5XAWE2xTxjQLQPpa7DVm2gzpsVpL, Error: Accumulated uptime 4 days, 3:34:00 is not less than elapsed time 10:51:24.787588
Failure - Peer ID: 16Uiu2HAmJSvd8S2usTvYLQgGCmeavtCxxBEd9VxKC6nphLDeLgBf, Error: Accumulated uptime 2 days, 0:52:00 is not less than elapsed time 3:04:51.787591
Failure - Peer ID: 16Uiu2HAm3HswgNGUPqDT1FbX49tTSxXgByZ8c4gAYDFhYzgmNuFR, Error: Accumulated uptime 5 days, 7:55:00 is not less than elapsed time 11:22:01.787596
Failure - Peer ID: 16Uiu2HAmV7reMF6Z1N5uCDc2RiqvTFmX82umXMLSNKBTBA2RbBsk, Error: Accumulated uptime 44 days, 9:00:00 is not less than elapsed time 15:08:03.787599
Failure - Peer ID: 16Uiu2HAkysM8hGmg23v7TwCkemet9bwvg7SKPS4SQXjmUjFc8ofn, Error: Accumulated uptime 2 days, 4:43:00 is not less than elapsed time 3:15:39.787602
Failure - Peer ID: 16Uiu2HAm3tDZvT6xmwqTwxwKDZ6fGXbaiwr26kaVN6HqHacnbLFC, Error: Accumulated uptime 2 days, 16:10:00 is not less than elapsed time 17:19:00.787607
Failure - Peer ID: 16Uiu2HAmJak8NXQaQG3zk5RYLZRtZm4rfqVRGSn6nYAFKDX7DX7b, Error: Accumulated uptime 3:46:00 is not less than elapsed time 0:29:49.787610
Failure - Peer ID: 16Uiu2HAmGLRJsEidseSPYqKYXV67kDdEm6rbo8Vx2Y9AgwXbCJCM, Error: Accumulated uptime 12 days, 15:30:00 is not less than elapsed time 19:17:10.787613
Failure - Peer ID: 16Uiu2HAkx7ctGnTuoT1c2ZBvGFhQrth94CHJmxntCsjaqpkDuG2n, Error: Accumulated uptime 5 days, 5:13:00 is not less than elapsed time 8:24:01.787618
Failure - Peer ID: 16Uiu2HAm1TLgFmY3WVmscWYn7iUmcBAARyCQUDFF5DDvWcXdxVD8, Error: Accumulated uptime 5 days, 4:35:00 is not less than elapsed time 9:27:52.787621
Failure - Peer ID: 16Uiu2HAmH5zRXB1DKzppjHFsZgsbNFW6Kv2WmajaJbtJNNaCrQQh, Error: Accumulated uptime 11 days, 7:39:00 is not less than elapsed time 10:27:36.787624
Failure - Peer ID: 16Uiu2HAm5uCWFT129tbXBNbxMHbZ7r7D83bZCGrp2dyXwdiah5cN, Error: Accumulated uptime 15:43:00 is not less than elapsed time 6:48:06.787627
Failure - Peer ID: 16Uiu2HAmM9aK4befL2t62JjMgfCmAHiSFdNZwpmFqTGnokhwmeqj, Error: Accumulated uptime 1 day, 21:54:00 is not less than elapsed time 8:45:24.787630
Failure - Peer ID: 16Uiu2HAkytpME6VgsrowYbfBP8MMWo3SgWvcvT281GpeovoJHfzK, Error: Accumulated uptime 15:16:00 is not less than elapsed time 14:55:02.787633
Failure - Peer ID: 16Uiu2HAm1Ypq3ELdMgwdWQujxphrtoF9whaUQRFBueUobdtvk6Zu, Error: Accumulated uptime 4 days, 17:22:00 is not less than elapsed time 9:31:12.787636
Failure - Peer ID: 16Uiu2HAm8bW485axj7SPA3rPWLhbNNFnkyGDuDsPdF67g8x5wpzV, Error: Accumulated uptime 1 day, 8:55:00 is not less than elapsed time 4:43:48.787639
Failure - Peer ID: 16Uiu2HAmHtehnbuDeXfivvLPiw3YD4JDyLVwRdtGiedShNimwQtK, Error: Accumulated uptime 11 days, 1:49:00 is not less than elapsed time 18:23:58.787642
Failure - Peer ID: 16Uiu2HAmCSkus3wP9Tk7EgXguiDjgsMheuXNhvhnd4CgBDKdMZiJ, Error: Accumulated uptime 10 days, 19:03:00 is not less than elapsed time 8:08:53.787645
Failure - Peer ID: 16Uiu2HAmTi9VdFnQ5t69nAmkCMueekhe67Goud9uYAVBviPTXzCC, Error: Accumulated uptime 3 days, 9:01:00 is not less than elapsed time 7:31:41.787648
Failure - Peer ID: 16Uiu2HAmEVHcvwjDE1qyHagdmEjoABAk9v4MXKYYU7YdEzwhgU9Y, Error: Accumulated uptime 14 days, 5:15:00 is not less than elapsed time 9:59:52.787653

restevens402 · 2024-07-24T23:54:45Z

@teslashibe thank you for the list. Discouraging. I will look at the individual records and see if I can diagnose. This isn't the egregious problem that was solved, but there is something quite off here.

teslashibe added 5 commits July 11, 2024 18:44

Update workers.go to handle nil pointers

06732ca

handle errors from db and update breaking changes

59432cd

Update monitorworkers and refactor code

75ffbee

Fix sendWork nill pointer

78f7b48

teslashibe added the bug Something isn't working label Jul 11, 2024

teslashibe self-assigned this Jul 11, 2024

teslashibe changed the title ~~Fix Accumulated Uptime Bug #413~~ bug: Fix Accumulated Uptime Bug #413 Jul 11, 2024

teslashibe requested review from mudler and restevens402 July 11, 2024 23:41

mudler changed the title ~~bug: Fix Accumulated Uptime Bug #413~~ fix: calculation of node uptimes by looking at recentUptime Jul 12, 2024

mudler added 3 commits July 12, 2024 09:52

Merge branch 'test' into teslashibe/#413

56561ff

Merge branch 'test' into teslashibe/#413

2b2852e

Merge branch 'test' into teslashibe/#413

3d09bb7

mudler reviewed Jul 12, 2024

View reviewed changes

pkg/pubsub/node_data.go Show resolved Hide resolved

mudler reviewed Jul 12, 2024

View reviewed changes

pkg/pubsub/node_data.go Show resolved Hide resolved

mudler reviewed Jul 12, 2024

View reviewed changes

pkg/pubsub/node_data.go Show resolved Hide resolved

restevens402 reviewed Jul 13, 2024

View reviewed changes

restevens402 added 2 commits July 16, 2024 16:03

Update nodeData with new data

aeedab9

The previous version of the code attempted to persist a wrong instance into the `nodeData` map. The update corrects this issue by ensuring that `nodeData` is updated with the new instance `nd`, which contains all recent modifications.

Remove IsActive attribute from NodeEventTracker

dca6f96

The IsActive attribute in the NodeEventTracker has been removed because it should not be set by other nodes. It should only be manipulated by the join/leave events tracked by this node.

teslashibe requested a review from jdutchak July 17, 2024 10:04

Luka-Loncar mentioned this pull request Jul 17, 2024

QA: Test Node Uptime Calculation Fix #426

Closed

15 tasks

Merge branch 'test' into teslashibe/#413

ea02129

jdutchak approved these changes Jul 17, 2024

View reviewed changes

jdutchak merged commit c226563 into test Jul 17, 2024
7 checks passed

jdutchak deleted the teslashibe/#413 branch July 17, 2024 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: calculation of node uptimes by looking at `recentUptime` #414

fix: calculation of node uptimes by looking at `recentUptime` #414

teslashibe commented Jul 11, 2024

restevens402 left a comment

restevens402 commented Jul 14, 2024

restevens402 commented Jul 16, 2024 •

edited

Loading

teslashibe commented Jul 17, 2024

teslashibe commented Jul 17, 2024

jdutchak left a comment

teslashibe commented Jul 24, 2024

restevens402 commented Jul 24, 2024

fix: calculation of node uptimes by looking at recentUptime #414

fix: calculation of node uptimes by looking at recentUptime #414

Conversation

teslashibe commented Jul 11, 2024

Problem

When the Node Has Left ActivityLeft

When the Node Is Still Active ActivityJoined

General Improvements

Remaining Considerations

Deployment Notes

restevens402 left a comment

Choose a reason for hiding this comment

restevens402 commented Jul 14, 2024

restevens402 commented Jul 16, 2024 • edited Loading

teslashibe commented Jul 17, 2024

teslashibe commented Jul 17, 2024

jdutchak left a comment

Choose a reason for hiding this comment

teslashibe commented Jul 24, 2024

restevens402 commented Jul 24, 2024

fix: calculation of node uptimes by looking at `recentUptime` #414

fix: calculation of node uptimes by looking at `recentUptime` #414

When the Node Has Left `ActivityLeft`

When the Node Is Still Active `ActivityJoined`

restevens402 commented Jul 16, 2024 •

edited

Loading