Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: calculation of node uptimes by looking at recentUptime #414

Merged
merged 12 commits into from
Jul 17, 2024

Conversation

teslashibe
Copy link
Contributor

Problem

This addresses and closes #413

The revised UpdateAccumulatedUptime function addresses several key issues present in the original implementation, offering a more accurate calculation of a node's accumulated uptime. Here's how it addresses specific concerns:

When the Node Has Left ActivityLeft

  • Improvement: By calculating the most recent active period recentUptime as the difference between LastLeftUnix and LastJoinedUnix, the function now accurately adds only the duration of the last session to the accumulated uptime. This prevents the overestimation seen when the function was previously adding the time since LastLeftUnix on every call.
  • Accuracy: This method ensures that each period of activity is only counted once, regardless of how many times the function is called after the node has left.

When the Node Is Still Active ActivityJoined

  • Dynamic Calculation: For active nodes, the function calculates the uptime since the node first joined FirstJoinedUnix. This approach ensures that the accumulated uptime reflects all the active time up to the current moment.
  • Prevention of Overestimation: By updating the accumulated uptime only if the calculated currentUptime is greater than the stored AccumulatedUptime, it prevents the repeated addition of the same uptime duration. This logic helps in maintaining an accurate count of the uptime, especially in scenarios where the node remains active without rejoining or leaving.

General Improvements

  • Accurate Tracking: The function now more accurately tracks the uptime by considering the entire duration of the node's activity from its initial join to the current time (for active nodes) or until it leaves the network.
  • Prevents Inflation: By adding checks to ensure uptime is only added based on actual activity periods and by preventing repeated additions of the same duration, the revised function mitigates the risk of inflated uptime values.

Remaining Considerations

  • Periodic Updates for Active Nodes: While the function improves accuracy, I have not considered how AccumulatedUptime is invoked for long running active nodes. This is particularly relevant for long-running nodes that don't frequently change their activity status.
  • Synchronization with State Changes: The function's accuracy depends on the timely update of the node's activity status ActivityJoined or ActivityLeft and the corresponding timestamps LastJoinedUnix, LastLeftUnix. Ensuring these values are accurately maintained is crucial for the function to work as intended.

Deployment Notes

In order to test this we will need to reset the entire networks version of nodeData and start from scratch @Luka-Loncar @mudler @jdutchak @j2d3 in the next release. We should just test this live in test it can't get anymore broken.

Fix accumulated uptime calculation in NodeData

This commit addresses issues in the accumulated uptime calculation for nodes in the network. Previously, the calculation could lead to inflated uptime values due to repeated additions of time since the last known activity markers, regardless of the node's actual activity status. The updated logic now accurately calculates the uptime based on the node's activity periods, ensuring that only the duration of actual activity is added to the accumulated uptime. For nodes that have left the network, the uptime is calculated as the difference between the last left and last joined timestamps. For active nodes, the uptime since the first join is considered, preventing overestimation. This change ensures a more accurate representation of node uptime, enhancing the reliability of network monitoring metrics.

- Add precise calculation for nodes that have left the network
- Implement dynamic uptime update for active nodes
- Prevent repeated addition of the same uptime duration
- Ensure accurate tracking of node activity periods
@teslashibe teslashibe added the bug Something isn't working label Jul 11, 2024
@teslashibe teslashibe self-assigned this Jul 11, 2024
@teslashibe teslashibe changed the title Fix Accumulated Uptime Bug #413 bug: Fix Accumulated Uptime Bug #413 Jul 11, 2024
@mudler mudler changed the title bug: Fix Accumulated Uptime Bug #413 fix: calculation of node uptimes by looking at recentUptime Jul 12, 2024
Copy link
Contributor

@restevens402 restevens402 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that the UpdateAccumulatedUptime method is never called in the code. It is calling CalculateAccumulatedUptime() instead. The issue still exists. I agree with enforcing at least the limit that the uptime cannot be greater than now - first joined. This will at least keep it from getting overinflated.

The fundamental problem is that the Join event happens at the moment of connection and is very reliable; however, the Leave event occurs after the node polls the PeerStore and cannot make a connection to the node. At that time it will fire a leave event. This could be 20-30 seconds after the node actually left and it may have then joined once again which disrupts this calculation badly.

At one point I had written code to account for this. I don't think it is still there but as I recall it would check on the Join events if the node was still in an active status (meaning it joined again before a leave event was captured)

I will track this down and make sure it is incorporated the the changes you have made and is in code that is being called.

The calculation methods for current uptime and accumulated uptime have been commented out and their calls removed from the join and leave event methods. Instead, they are replaced with an update method called when a node leaves. Additionally, before processing an expired connect event, the node is now forced to leave to ensure correct timestamp updating.
@restevens402
Copy link
Contributor

@teslashibe @jdutchak let me know if you have any issues with my logic here.

Calling CalculateCurrentUptime on a join event makes no sense as it is being done. It should be zero at that point, so I'm removing it. Using the same logic, there should be no change in the Accumulated uptime on a join event, so it should not be called from the Join method.

Uptime in both cases should be calculated in real time while the node is active and only the accumulated uptime should only be updated when there is a Leave event, otherwise it is "last known accumulated uptime + current uptime".

It is possible there is some kind of issue still outstanding with multiple leave/join events that throws off the calculation. I am unable to reproduce that situation. A check to ensure that the value of accumulated uptime is not greater than now - first joined will mitigate highly inflated numbers, but this situation reflects an outstanding bug in the system.

In our logic to handle the multiple join/leave events, adding a leave event before allowing an expired join event should ensure that the calculation is maintained correctly.

I am commenting out the Calculate function, it doesn't seem correct and should only be called from one place.
I have updated the code to reflect the above description and appreciate another set of eyes on it.

@restevens402
Copy link
Contributor

restevens402 commented Jul 16, 2024

It took nearly all day, but I found the source of the bug and captured a screenshot. The situation is when the node leaves and joins again, it will "self identify" to the boot node. In the self identification all of the time values are zero, but a few are set on the Join method for the node data.

The problem was that the update to the underlying SafeMap of the NodeDataTracker is setting the value to the incoming nodeData after it has updated all of the fields on the existing nd (nodeData) pointer. This was the mistake, it should always send the nd (nodeData) pointer.

The change is as simple as this:
image

Error situation found:
image

The previous version of the code attempted to persist a wrong instance into the `nodeData` map. The update corrects this issue by ensuring that `nodeData` is updated with the new instance `nd`, which contains all recent modifications.
The IsActive attribute in the NodeEventTracker has been removed because it should not be set by other nodes. It should only be manipulated by the join/leave events tracked by this node.
@teslashibe
Copy link
Contributor Author

@restevens402 thanks - looking forward to getting this into test and testing. We should clear the current nodeData file on the network and reset to a blank slate when this gets deployed so we can test with new data.

cc @mudler @jdutchak @j2d3 to make sure with this deployment we have the "clean slate" of the database :)

@teslashibe
Copy link
Contributor Author

@mudler @jdutchak I do not have time to review when I am OOO this week would you mind 🍐 'ing with @restevens402 to get this merged and then deployed with a clean database so we can validate the data?

cc @lacyg4 @Luka-Loncar @giovaroma - we will need to communicate this update with the community so they can make sure the version of their node software is updated.

@teslashibe teslashibe requested a review from jdutchak July 17, 2024 10:04
Copy link
Contributor

@jdutchak jdutchak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent @restevens402 - this has been a tough one tysm! I tested this on my local and dev network and it appears to be solved. will monitor in any case!

@jdutchak jdutchak merged commit c226563 into test Jul 17, 2024
7 checks passed
@jdutchak jdutchak deleted the teslashibe/#413 branch July 17, 2024 13:07
@teslashibe
Copy link
Contributor Author

@restevens402 on test still seeing this, I wonder if the nodes being on different versions is the issue here, you can use the following script to test against: https://github.com/masa-finance/awesome-masa/tree/main/tests/protocol

(awesome-masa) (base) brendanplayford@Brendans-MBP awesome-masa % python tests/protocol/node_data_uptime.py
Test Results:
Success - 6 nodes passed the uptime less than elapsed time test.
Peer ID: 16Uiu2HAmP3LjZc2m4AVcbf32ttyuLQ38rUyvhP8zAT5yTwRD8Lk7 is within the correct bounds.
Peer ID: 16Uiu2HAmVsd1WxZZj6pLMoG9cjVgKmdcfH8RNsETktsSBhPHBZ4g is within the correct bounds.
Peer ID: 16Uiu2HAm1efpreLqXkirZY2XmEkNZmy48RXbhJniNevCsyzLfino is within the correct bounds.
Peer ID: 16Uiu2HAmBPAV8GRe2RZADzvwAsVjQtYhBAqk9r46pkDP4YcSnVCu is within the correct bounds.
Peer ID: 16Uiu2HAmQ7HkzrjJbVkRMbHGe9tfoTDrqHtLZ1J2fsPcsZvqrRMC is within the correct bounds.
Peer ID: 16Uiu2HAmGyu1Vu4QX9axCEacWJotzXzXGzoJ6bSAhpa2aJuNRcUP is within the correct bounds.
Failures - 32 nodes did not pass the uptime less than elapsed time test.
Failure - Peer ID: 16Uiu2HAm7kV7i3JspkSkYX9tqCqy97cYueNZ3gLRZBGXBSms4Mmx, Error: Accumulated uptime 45 days, 10:48:00 is not less than elapsed time 14:03:01.787544
Failure - Peer ID: 16Uiu2HAmKsyNVMXagpV6dUEJSoRPYHdjSRrCXst4RexNJ6wrLZxD, Error: Accumulated uptime 40 days, 12:46:00 is not less than elapsed time 13:14:38.787550
Failure - Peer ID: 16Uiu2HAmKesNMGFvsR3idgvYbNo638aacKdNY3W9hZ6LdhJgRico, Error: Accumulated uptime 39 days, 22:54:00 is not less than elapsed time 13:06:32.787554
Failure - Peer ID: 16Uiu2HAm9tzx58JS8EzrtoJveVf3y4gMjUUded5JN3GJTrH7Jg3R, Error: Accumulated uptime 43 days, 19:37:00 is not less than elapsed time 13:51:06.787558
Failure - Peer ID: 16Uiu2HAmLmPtiPZU1Y6rLQxTHj2eoTaxF4ugszr9zgeyiSmfpkEk, Error: Accumulated uptime 41 days, 15:22:00 is not less than elapsed time 13:34:16.787561
Failure - Peer ID: 16Uiu2HAmS8TtaSwghkPHuiGWL2uzzidCfscjfQFSTLc9E5xoBw5a, Error: Accumulated uptime 80 days, 7:09:00 is not less than elapsed time 19:53:23.787564
Failure - Peer ID: 16Uiu2HAmQPrR2jdnaC3kFZgQXUgyG6e8XF3BvNH66ev5bEebf4bW, Error: Accumulated uptime 13 days, 14:11:00 is not less than elapsed time 10:01:47.787567
Failure - Peer ID: 16Uiu2HAkx1SVDmGdxafeKeSRGMimknbN4wCQ7XPmgZw8ggc7wNxs, Error: Accumulated uptime 51 days, 5:00:00 is not less than elapsed time 15:02:34.787570
Failure - Peer ID: 16Uiu2HAkvKys9BoqdBBEkNytwkJWsPAkJ95psmVJoowWqxXCJbEf, Error: Accumulated uptime 37 days, 11:33:00 is not less than elapsed time 12:51:11.787573
Failure - Peer ID: 16Uiu2HAmJc8qanqkq6j9npX58fDNnR5biB5XbVT5uKaQoh8m2Bi4, Error: Accumulated uptime 9 days, 11:55:00 is not less than elapsed time 8:27:00.787576
Failure - Peer ID: 16Uiu2HAmRsNKycBgvqwPt1AjCwp3pUW55TSoZsvmnQtgkKZTSct5, Error: Accumulated uptime 41 days, 5:18:00 is not less than elapsed time 15:25:55.787582
Failure - Peer ID: 16Uiu2HAmLGgZPXmux6ULeP6H9oitQ4nUkRaAAhJsVMjzPjZznrW1, Error: Accumulated uptime 4 days, 13:38:00 is not less than elapsed time 4:46:37.787585
Failure - Peer ID: 16Uiu2HAmUgcLEkRtDPoHve4v5XAWE2xTxjQLQPpa7DVm2gzpsVpL, Error: Accumulated uptime 4 days, 3:34:00 is not less than elapsed time 10:51:24.787588
Failure - Peer ID: 16Uiu2HAmJSvd8S2usTvYLQgGCmeavtCxxBEd9VxKC6nphLDeLgBf, Error: Accumulated uptime 2 days, 0:52:00 is not less than elapsed time 3:04:51.787591
Failure - Peer ID: 16Uiu2HAm3HswgNGUPqDT1FbX49tTSxXgByZ8c4gAYDFhYzgmNuFR, Error: Accumulated uptime 5 days, 7:55:00 is not less than elapsed time 11:22:01.787596
Failure - Peer ID: 16Uiu2HAmV7reMF6Z1N5uCDc2RiqvTFmX82umXMLSNKBTBA2RbBsk, Error: Accumulated uptime 44 days, 9:00:00 is not less than elapsed time 15:08:03.787599
Failure - Peer ID: 16Uiu2HAkysM8hGmg23v7TwCkemet9bwvg7SKPS4SQXjmUjFc8ofn, Error: Accumulated uptime 2 days, 4:43:00 is not less than elapsed time 3:15:39.787602
Failure - Peer ID: 16Uiu2HAm3tDZvT6xmwqTwxwKDZ6fGXbaiwr26kaVN6HqHacnbLFC, Error: Accumulated uptime 2 days, 16:10:00 is not less than elapsed time 17:19:00.787607
Failure - Peer ID: 16Uiu2HAmJak8NXQaQG3zk5RYLZRtZm4rfqVRGSn6nYAFKDX7DX7b, Error: Accumulated uptime 3:46:00 is not less than elapsed time 0:29:49.787610
Failure - Peer ID: 16Uiu2HAmGLRJsEidseSPYqKYXV67kDdEm6rbo8Vx2Y9AgwXbCJCM, Error: Accumulated uptime 12 days, 15:30:00 is not less than elapsed time 19:17:10.787613
Failure - Peer ID: 16Uiu2HAkx7ctGnTuoT1c2ZBvGFhQrth94CHJmxntCsjaqpkDuG2n, Error: Accumulated uptime 5 days, 5:13:00 is not less than elapsed time 8:24:01.787618
Failure - Peer ID: 16Uiu2HAm1TLgFmY3WVmscWYn7iUmcBAARyCQUDFF5DDvWcXdxVD8, Error: Accumulated uptime 5 days, 4:35:00 is not less than elapsed time 9:27:52.787621
Failure - Peer ID: 16Uiu2HAmH5zRXB1DKzppjHFsZgsbNFW6Kv2WmajaJbtJNNaCrQQh, Error: Accumulated uptime 11 days, 7:39:00 is not less than elapsed time 10:27:36.787624
Failure - Peer ID: 16Uiu2HAm5uCWFT129tbXBNbxMHbZ7r7D83bZCGrp2dyXwdiah5cN, Error: Accumulated uptime 15:43:00 is not less than elapsed time 6:48:06.787627
Failure - Peer ID: 16Uiu2HAmM9aK4befL2t62JjMgfCmAHiSFdNZwpmFqTGnokhwmeqj, Error: Accumulated uptime 1 day, 21:54:00 is not less than elapsed time 8:45:24.787630
Failure - Peer ID: 16Uiu2HAkytpME6VgsrowYbfBP8MMWo3SgWvcvT281GpeovoJHfzK, Error: Accumulated uptime 15:16:00 is not less than elapsed time 14:55:02.787633
Failure - Peer ID: 16Uiu2HAm1Ypq3ELdMgwdWQujxphrtoF9whaUQRFBueUobdtvk6Zu, Error: Accumulated uptime 4 days, 17:22:00 is not less than elapsed time 9:31:12.787636
Failure - Peer ID: 16Uiu2HAm8bW485axj7SPA3rPWLhbNNFnkyGDuDsPdF67g8x5wpzV, Error: Accumulated uptime 1 day, 8:55:00 is not less than elapsed time 4:43:48.787639
Failure - Peer ID: 16Uiu2HAmHtehnbuDeXfivvLPiw3YD4JDyLVwRdtGiedShNimwQtK, Error: Accumulated uptime 11 days, 1:49:00 is not less than elapsed time 18:23:58.787642
Failure - Peer ID: 16Uiu2HAmCSkus3wP9Tk7EgXguiDjgsMheuXNhvhnd4CgBDKdMZiJ, Error: Accumulated uptime 10 days, 19:03:00 is not less than elapsed time 8:08:53.787645
Failure - Peer ID: 16Uiu2HAmTi9VdFnQ5t69nAmkCMueekhe67Goud9uYAVBviPTXzCC, Error: Accumulated uptime 3 days, 9:01:00 is not less than elapsed time 7:31:41.787648
Failure - Peer ID: 16Uiu2HAmEVHcvwjDE1qyHagdmEjoABAk9v4MXKYYU7YdEzwhgU9Y, Error: Accumulated uptime 14 days, 5:15:00 is not less than elapsed time 9:59:52.787653

@restevens402
Copy link
Contributor

@teslashibe thank you for the list. Discouraging. I will look at the individual records and see if I can diagnose. This isn't the egregious problem that was solved, but there is something quite off here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: overestimation of AccumulatedUptime in nodData
4 participants