Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: switch RPC if block number is less than last recorded block number #1264

Draft
wants to merge 1 commit into
base: releases/v2.1.0
Choose a base branch
from

Conversation

Yashk767
Copy link
Contributor

@Yashk767 Yashk767 commented Jan 20, 2025

Description

There was an instance when an RPC reported an outdated block number and current block monitor doesn't handle the scenario.
So there is a need for Block Monitor module to detect scenarios where the newly fetched block number is less than the last recorded block number.

Fixes https://linear.app/interstellar-research/issue/RAZ-1178

Solution

Compared the newly fetched block number with the last recorded block number in updateLatestBlock() in block monitor module and triggered an RPC switch if the new block number is lower than the recorded block number.

How Has This Been Tested?

Created scenarios to return an older block number from HeaderByNumber call by modifying the function and tested whether RPC was switched correctly or not.

Copy link

@astronaut-aldrin astronaut-aldrin bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! 👨‍🚀 Aldrin here, ready to launch into this code review! Let's make this code as reliable as a Saturn V rocket! 🚀

Overall Summary

The PR addresses an important edge case where an RPC endpoint might report an outdated block number, which could lead to inconsistent blockchain state tracking. The solution implements a failover mechanism to switch RPCs when this occurs. The approach is generally sound, but there are some areas that could use additional hardening.

Issues Found

🔥🔥 Potential Race Condition in RPC Client Update
if bm.rpcManager != nil {
    switched, err := bm.rpcManager.SwitchToNextBestRPCClient()
    if err != nil {
        logrus.Errorf("Failed to switch RPC endpoint: %v", err)
    } else if switched {
        logrus.Info("Switched to the next best RPC endpoint.")
        bm.updateClient()
    }
}

The updateClient() call after switching RPCs could potentially race with other operations. We should ensure thread safety here.

Suggested fix:

if bm.rpcManager != nil {
    bm.mu.Unlock() // Release lock before RPC switch
    switched, err := bm.rpcManager.SwitchToNextBestRPCClient()
    bm.mu.Lock() // Reacquire lock
    if err != nil {
        logrus.Errorf("Failed to switch RPC endpoint: %v", err)
    } else if switched {
        logrus.Info("Switched to the next best RPC endpoint.")
        bm.updateClient()
    }
}
⚠️⚠️ Missing Error Handling for nil Header
if bm.latestBlock != nil && header.Number.Uint64() < bm.latestBlock.Number.Uint64() {

The code doesn't check if header is nil before accessing it.

Suggested fix:

if header == nil {
    logrus.Error("Received nil header from RPC endpoint")
    return
}

if bm.latestBlock != nil && header.Number.Uint64() < bm.latestBlock.Number.Uint64() {
⚠️ Missing Metrics/Monitoring

The RPC switching event should be tracked with metrics for observability.

Suggested addition:

if switched {
    logrus.Info("Switched to the next best RPC endpoint.")
    metrics.IncCounter("rpc_switches_total")
    bm.updateClient()
}

Reviewed Files

  • block/block.go

Positive Feedback

🌟 Great job on:

  • Clear and descriptive logging messages that will help with debugging
  • Good use of mutex for thread safety
  • Thoughtful handling of the edge case where no alternative RPC is available
  • Clean code structure with clear conditional logic

Additional Recommendations

  1. Consider adding unit tests specifically for this new RPC switching logic
  2. Add documentation explaining the RPC switching behavior
  3. Consider adding a maximum number of RPC switches within a time window to prevent thrashing

Keep up the great work! 🚀 Let me know if you need any clarification on these points!

Remember: In space, no one can hear you debug 😉


Review generated by: anthropic/claude-3.5-sonnet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant