Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: decrease ctrlr_loss_timeout_sec for base bdev #247

Merged
merged 2 commits into from
Nov 29, 2024

Conversation

derekbit
Copy link
Member

@derekbit derekbit commented Nov 28, 2024

Which issue(s) this PR fixes:

Issue longhorn/longhorn#9874

Signed-off-by: Derek Su [email protected]

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

Copy link

coderabbitai bot commented Nov 28, 2024

Walkthrough

The pull request introduces significant modifications to the Engine, Replica, Server, and utility functions within the SPDK package. Key changes include the addition of new timeout parameters to improve configurability for controller loss and I/O failures. The connectNVMfBdev function has been updated to accept these parameters, ensuring consistent timeout management across various operations. Import statements have been cleaned up, and error handling has been enhanced for clarity and robustness. Overall, these changes refine the timeout handling and improve the error management strategies within the SPDK codebase.

Changes

File Change Summary
pkg/spdk/engine.go Added fields ctrlrLossTimeout and fastIOFailTimeoutSec to Engine. Updated NewEngine, Create, and ReplicaAdd methods to use new timeout parameters in connectNVMfBdev. Cleaned up import statements.
pkg/spdk/replica.go Updated RebuildingSrcAttach and RebuildingDstStart methods to include new timeout parameters in connectNVMfBdev. Cleaned up import statements.
pkg/spdk/server.go Improved error handling in NewServer. Updated logic in verify, monitoring, and added tryEnsureSPDKTgtConnectionHealthy method.
pkg/spdk/util.go Updated connectNVMfBdev to include ctrlrLossTimeout and fastIOFailTimeoutSec. Enhanced error logging in exposeSnapshotLvolBdev.
pkg/spdk/disk/nvme/nvme.go Added new constants for NVMe disk operations related to timeouts and configurations. Updated DiskCreate method to utilize these constants.
pkg/spdk/types.go Introduced new constants for timeout settings related to RAID bdevs.
pkg/client/types.go Reordered import statements for clarity.
pkg/spdk/backup.go Added initiator field to Backup struct and updated logging in OpenSnapshot and CloseSnapshot methods.
pkg/spdk/disk.go Updated import statements and refined error handling in disk management functions.
pkg/spdk/restore.go Added initiator field to Restore struct and integrated NVMe initiator management in OpenVolumeDev method.
pkg/util/block.go Reordered import statements without altering core functionality.

Assessment against linked issues

Objective Addressed Explanation
Ensure timeout settings for controller loss are configurable (#[9874])
Prevent v2 volume from entering ERROR state after deleting a replica (#[9874]) Changes may help but do not directly address the race condition described.

Possibly related PRs

  • feat(auto-salvage): v2 support #222: The changes in this PR enhance the Engine struct and its methods, particularly the Create method, which now includes a new parameter for salvage operations. This is related to the main PR as both involve modifications to the Engine struct and its methods, particularly in how parameters are managed during engine creation.
  • fix: spdk replica lists its own lvols so that everyone won't be affected by others #229: This PR modifies the Replica struct and its methods to implement a filtering mechanism for logical volumes. The changes in the main PR also involve updates to the ReplicaAdd method, which includes timeout parameters when interacting with logical volumes. Both PRs focus on enhancing the management of replicas and their associated resources.
  • refactor: refactor replica head and snap chain related implementation #234: This PR refactors the Replica struct and its methods, particularly focusing on the management of the Head field. The main PR also modifies methods related to the Engine and Replica structs, indicating a broader context of improving the handling of replicas and their states, which connects the two PRs through their focus on replica management.

Suggested reviewers

  • shuo-wu
  • c3y1huang
  • DamiaSan
  • innobead

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (9)
pkg/spdk/util.go (1)

Line range hint 80-92: Document the timeout parameters

Please add documentation for the new parameters explaining:

  • Expected value ranges
  • Default recommendations
  • Impact on error recovery
  • Relationship with DefaultReplicaReconnectDelaySec

This will help users understand how to properly configure these timeouts.

+// connectNVMfBdev connects to the NVMe-oF target, which is exposed by a remote lvol bdev.
+// Parameters:
+//   - controllerName: typically the lvol name
+//   - address: the IP:port of the NVMe-oF target
+//   - ctrlrLossTimeout: duration in seconds that SPDK will try to reconnect to the controller
+//   - fastIOFailTimeoutSec: duration in seconds before failing I/O operations
+// Returns the bdev name and any error encountered.
 func connectNVMfBdev(spdkClient *spdkclient.Client, controllerName, address string, ctrlrLossTimeout, fastIOFailTimeoutSec int) (bdevName string, err error) {
pkg/spdk/server.go (5)

Line range hint 94-103: Add error handling for SPDK target reconnection failures

The new method tryEnsureSPDKTgtConnectionHealthy effectively checks the SPDK target's health and attempts to reconnect. However, if clientReconnect fails, the error is logged but not acted upon. Consider implementing a retry mechanism with exponential backoff or notifying upstream components to handle prolonged disconnections gracefully.


Line range hint 261-273: Simplify the monitoring loop logic

The done flag in the monitoring method is unnecessary, as the for loop can be exited directly using a return statement when the context is done. This can simplify the code and improve readability.

Apply this diff to remove the done variable:

 func (s *Server) monitoring() {
     ticker := time.NewTicker(MonitorInterval)
     defer ticker.Stop()

-    done := false
     for {
         select {
         case <-s.ctx.Done():
             logrus.Info("spdk gRPC server: stopped monitoring replicas due to the context done")
-            done = true
-        case <-ticker.C:
+            return
+        case <-ticker.C:
             err := s.verify()
             if err == nil {
                 break
             }

             logrus.WithError(err).Errorf("spdk gRPC server: failed to verify and update replica cache, will retry later")

             if jsonrpc.IsJSONRPCRespErrorBrokenPipe(err) || jsonrpc.IsJSONRPCRespErrorInvalidCharacter(err) {
                 err = s.tryEnsureSPDKTgtConnectionHealthy()
                 if err != nil {
                     logrus.WithError(err).Error("spdk gRPC server: failed to ensure spdk_tgt connection healthy")
                 }
             }
         }
-        if done {
-            break
-        }
     }
 }

Line range hint 483-503: Handle errors in deferred function in 'verify' method carefully

In the verify method, a deferred function checks if err is not nil to handle JSON RPC errors. If err is modified within the method after certain operations, the deferred function might not capture the intended error. Consider capturing err in a local variable inside the deferred function to ensure accurate error handling.

Modify the deferred function as follows:

 func (s *Server) verify() (err error) {
     // ... existing code ...

     defer func() {
+        deferredErr := err
         if deferredErr == nil {
             return
         }
         if jsonrpc.IsJSONRPCRespErrorBrokenPipe(deferredErr) {
             logrus.WithError(deferredErr).Warn("spdk gRPC server: marking all non-stopped and non-error replicas and engines as error")
             for _, r := range replicaMapForSync {
                 r.SetErrorState()
             }
             for _, e := range engineMapForSync {
                 e.SetErrorState()
             }
         }
     }()
     // ... existing code ...
 }

This change ensures that the deferred function uses the correct value of err at the time of deferral.


Line range hint 575-589: Improve error messages for engine operations

In methods like EngineCreate, error messages can be more descriptive. For instance, when invalid Frontend types are provided, specify the allowed values to guide the user.

Apply this diff to enhance the error message:

 if req.Frontend != types.FrontendSPDKTCPBlockdev && req.Frontend != types.FrontendSPDKTCPNvmf && req.Frontend != types.FrontendEmpty {
-    return nil, grpcstatus.Error(grpccodes.InvalidArgument, "engine frontend is required")
+    return nil, grpcstatus.Errorf(grpccodes.InvalidArgument, "invalid engine frontend '%v'; allowed values are '%v', '%v', or '%v'",
+        req.Frontend, types.FrontendSPDKTCPBlockdev, types.FrontendSPDKTCPNvmf, types.FrontendEmpty)
 }

This provides clearer guidance on acceptable Frontend values.


Line range hint 1142-1160: Check for nil references before accessing 'restore' fields

In the ReplicaRestoreStatus method, there is a condition that checks if replica.restore == nil. Ensure that subsequent accesses to replica.restore fields are safe and won't cause a nil pointer dereference.

Verify that all usages of replica.restore are protected against nil references. Consider restructuring the code to prevent potential panics.

pkg/spdk/replica.go (2)

1329-1330: Ensure consistent usage of timeout configurations

In the call to connectNVMfBdev, the timeout values helpertypes.DefaultReplicaCtrlrLossTimeoutSec and helpertypes.DefaultReplicaFastIOFailTimeoutSec are hardcoded. However, in other parts of the codebase, such as within the Engine methods, these timeout values are passed as parameters from the Engine struct. For consistency and enhanced configurability, consider passing the timeout parameters from the Engine struct or making them configurable at a higher level.


1437-1438: Ensure consistent usage of timeout configurations

Similar to the previous comment, the call to connectNVMfBdev here uses hardcoded default timeout values. To maintain consistency across the codebase and improve flexibility, consider passing these timeout values as parameters or making them configurable, aligning with how timeouts are managed in other parts of the code.

pkg/spdk/engine.go (1)

52-54: Add comments to new timeout fields for clarity

Consider adding comments to the newly added fields ctrlrLossTimeout and fastIOFailTimeoutSec in the Engine struct to explain their purpose and how they affect the engine's behavior.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 114edff and d86d56a.

⛔ Files ignored due to path filters (5)
  • go.mod is excluded by !go.mod
  • go.sum is excluded by !**/*.sum, !go.sum
  • vendor/github.com/longhorn/go-spdk-helper/pkg/spdk/client/basic.go is excluded by !vendor/**
  • vendor/github.com/longhorn/go-spdk-helper/pkg/types/types.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
📒 Files selected for processing (4)
  • pkg/spdk/engine.go (7 hunks)
  • pkg/spdk/replica.go (3 hunks)
  • pkg/spdk/server.go (1 hunks)
  • pkg/spdk/util.go (2 hunks)
🔇 Additional comments (6)
pkg/spdk/util.go (1)

92-92: Verify the timeout values across the codebase

Let's ensure the timeout values being passed to this function align with the PR objectives of decreasing ctrlr_loss_timeout_sec.

✅ Verification successful

Based on the search results, I can now generate the final response as the information is sufficient to verify the timeout values:

Timeout values are properly aligned with PR objectives

The codebase shows consistent usage of decreased timeout values:

  • Default controller loss timeout is set to 15 seconds (DefaultReplicaCtrlrLossTimeoutSec = 15)
  • Fast I/O failure timeout is set to 10 seconds (DefaultReplicaFastIOFailTimeoutSec = 10)
  • These values are consistently used across replica and engine operations for NVMf connections
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for calls to connectNVMfBdev to verify the timeout values being used
rg -A 2 "connectNVMfBdev\(" 

# Search for any other timeout-related constants that might affect these parameters
rg "LossTimeout|FailTimeout" 

Length of output: 7686

pkg/spdk/server.go (1)

Line range hint 127-144: Avoid potential race conditions during client reconnection

In the clientReconnect method, the SPDK client is replaced while holding a lock. Ensure that all accesses to s.spdkClient elsewhere in the code properly acquire the necessary locks to prevent race conditions. Verify that no other goroutines might access s.spdkClient concurrently without synchronization.

To confirm there are no unsynchronized accesses to s.spdkClient, run the following script:

The script searches for instances where spdkClient is used without the s. prefix, which may indicate direct accesses without proper locking.

✅ Verification successful

Based on the search results, I can now generate the final response since I have enough information to verify the concern about potential race conditions during client reconnection.

Proper synchronization is in place for s.spdkClient access

All accesses to s.spdkClient throughout the codebase are properly synchronized using the server's mutex locks. The code consistently follows these patterns:

  • Read access: Protected by s.RLock() and s.RUnlock()
  • Write access: Protected by s.Lock() and s.Unlock()

The client reconnection in clientReconnect method is properly synchronized, and all other methods accessing s.spdkClient also acquire the appropriate locks before use. This includes all RPC handlers and internal methods.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for unsynchronized accesses to s.spdkClient

# Find all occurrences where s.spdkClient is accessed without locking
rg --pcre2 '(?<!s\.)spdkClient' pkg/spdk/ -A 3 -B 3

Length of output: 102405

pkg/spdk/replica.go (1)

18-19: Imports updated appropriately

The new imports github.com/longhorn/go-spdk-helper/pkg/jsonrpc and github.com/longhorn/types/pkg/generated/spdkrpc are necessary for the added functionality and appear correctly included.

pkg/spdk/engine.go (3)

100-102: Ensure default timeout values are appropriate

Confirm that initializing ctrlrLossTimeout and fastIOFailTimeoutSec with helpertypes.DefaultReplicaCtrlrLossTimeoutSec and helpertypes.DefaultFastIOFailTimeoutSec respectively aligns with the desired behavior. If these values should be configurable externally, consider exposing them as parameters.


218-218: Use consistent timeout parameters across connectNVMfBdev calls

In this call to connectNVMfBdev, you are passing e.ctrlrLossTimeout and e.fastIOFailTimeoutSec. Ensure that these engine-configured timeout values are intended here and verify their consistency with other calls to the same function.


1660-1663: Consistent use of engine's timeout parameters

Good job using e.ctrlrLossTimeout and e.fastIOFailTimeoutSec in this connectNVMfBdev call. This ensures consistency with the engine's configured timeouts.

pkg/spdk/util.go Show resolved Hide resolved
pkg/spdk/server.go Show resolved Hide resolved
pkg/spdk/engine.go Outdated Show resolved Hide resolved
pkg/spdk/engine.go Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
pkg/spdk/replica.go (1)

1437-1438: LGTM! Consider documenting the timeout behavior.

The addition of timeout parameters is consistent with the previous changes and helps prevent timeout-related issues during rebuilding operations.

Consider adding a comment explaining the timeout behavior and its impact on rebuilding operations, especially in scenarios where instance managers are deleted.

pkg/spdk/engine.go (1)

1965-1968: Consider using engine's timeout for reconnectDelaySec

While the change properly uses engine's timeout values for ctrlrLossTimeout and fastIOFailTimeoutSec, it still uses the default value for reconnectDelaySec. For consistency, consider making reconnectDelaySec configurable through the engine as well.

-           helpertypes.DefaultReplicaReconnectDelaySec, int32(e.fastIOFailTimeoutSec),
+           int32(e.reconnectDelaySec), int32(e.fastIOFailTimeoutSec),

This would require adding a new field to the Engine struct:

reconnectDelaySec int
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d86d56a and 19d8508.

📒 Files selected for processing (4)
  • pkg/spdk/engine.go (7 hunks)
  • pkg/spdk/replica.go (3 hunks)
  • pkg/spdk/server.go (1 hunks)
  • pkg/spdk/util.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/spdk/util.go
🔇 Additional comments (5)
pkg/spdk/replica.go (1)

1329-1330: LGTM! Verify the default timeout values.

The addition of timeout parameters to connectNVMfBdev is aligned with the PR's objective. The use of default values from helpertypes package ensures consistent timeout handling.

Let's verify the default timeout values:

✅ Verification successful

Default timeout values are correctly defined and used

The verification confirms that the default timeout values are properly defined in the go-spdk-helper package:

  • DefaultReplicaCtrlrLossTimeoutSec = 15
  • DefaultReplicaFastIOFailTimeoutSec = 10

These values are consistently used across the codebase in pkg/spdk/replica.go and pkg/spdk/engine.go. The timeouts are set to reasonable values, with the controller loss timeout being longer than the fast I/O fail timeout, which is the expected configuration.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the default timeout values in helpertypes package

# Search for DefaultReplicaCtrlrLossTimeoutSec and DefaultReplicaFastIOFailTimeoutSec definitions
rg -A 1 "DefaultReplicaCtrlrLossTimeoutSec|DefaultReplicaFastIOFailTimeoutSec" 

Length of output: 1423

pkg/spdk/engine.go (3)

52-53: LGTM: Well-structured timeout field additions

The new timeout fields are appropriately named and typed for their purpose.


100-101: LGTM: Proper initialization of timeout fields

The timeout fields are correctly initialized using default values from the helpertypes package.


218-218: LGTM: Consistent timeout parameter usage

The engine's timeout values are consistently passed to all connectNVMfBdev calls, properly implementing the timeout control across the codebase.

Also applies to: 1089-1089, 1659-1659

pkg/spdk/server.go (1)

70-70: ⚠️ Potential issue

Ensure 'ctrlr_loss_timeout_sec' is set to the decreased value

To meet the PR objective of decreasing ctrlr_loss_timeout_sec, include this parameter in the BdevNvmeSetOptions call with the desired reduced value. Currently, it's not being set, which may prevent the intended behavior change.

Apply this diff to set the ctrlr_loss_timeout_sec parameter:

 if _, err = cli.BdevNvmeSetOptions(
     helpertypes.DefaultTransportAckTimeout,
     helpertypes.DefaultKeepAliveTimeoutMs,
+    helpertypes.DecreasedCtrlrLossTimeoutSec,
 ); err != nil {
     return nil, errors.Wrap(err, "failed to set NVMe options")
 }

Ensure that helpertypes.DecreasedCtrlrLossTimeoutSec is defined with the appropriate timeout value as per the PR requirements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
pkg/spdk/replica.go (1)

1329-1330: Consider adding monitoring for timeout-related events.

To validate the effectiveness of these timeout parameters in production and catch potential issues early:

  1. Consider adding metrics/logging for:
    • NVMe controller connection loss events
    • Reconnection attempt durations
    • Failed reconnection counts
  2. Set up alerts for cases where reconnection consistently takes longer than expected

This will help tune these timeout values based on real-world data.

Also applies to: 1437-1438

pkg/spdk/engine.go (1)

52-53: LGTM! Consider adding field documentation.

The new timeout fields are well-placed and follow Go naming conventions. Consider adding documentation comments to explain their purpose and units.

Add documentation like:

// ctrlrLossTimeout specifies how long (in seconds) SPDK will attempt to reconnect to the NVMe controller after a connection loss
// fastIOFailTimeoutSec specifies how long (in seconds) to wait for I/O to complete before failing the I/O
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 19d8508 and 5c041e9.

⛔ Files ignored due to path filters (4)
  • go.mod is excluded by !go.mod
  • go.sum is excluded by !**/*.sum, !go.sum
  • vendor/github.com/longhorn/go-spdk-helper/pkg/types/types.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
📒 Files selected for processing (4)
  • pkg/spdk/engine.go (7 hunks)
  • pkg/spdk/replica.go (3 hunks)
  • pkg/spdk/server.go (1 hunks)
  • pkg/spdk/util.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/spdk/util.go
🧰 Additional context used
📓 Learnings (1)
pkg/spdk/server.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-spdk-engine#247
File: pkg/spdk/server.go:70-70
Timestamp: 2024-11-28T12:13:14.445Z
Learning: In the Longhorn SPDK engine, the `ctrlr_loss_timeout_sec` parameter is set in the `connectNVMfBdev` function (used in `bdev_nvme_attach_controller`), so it's not necessary to set it in `BdevNvmeSetOptions` within `pkg/spdk/server.go`.
🔇 Additional comments (6)
pkg/spdk/server.go (2)

73-73: LGTM: Improved error message capitalization

The error message now correctly capitalizes "NVMe" which improves consistency.


68-70: Verify timeout values align with PR objectives

The changes introduce new timeout parameters for NVMe options. Based on the PR objectives to decrease ctrlr_loss_timeout_sec, we should verify these values.

✅ Verification successful

Timeout values are properly aligned with PR objectives

The verification confirms that the timeout values have been decreased as intended:

  • DefaultReplicaCtrlrLossTimeoutSec = 15 (controller loss timeout)
  • DefaultReplicaReconnectDelaySec = 2 (reconnect delay)
  • DefaultReplicaFastIOFailTimeoutSec = 10 (fast I/O fail timeout)

These values are consistently used across the codebase in pkg/spdk/server.go, pkg/spdk/replica.go, pkg/spdk/util.go, and pkg/spdk/engine.go, with proper hierarchical relationships (reconnect delay < fast I/O fail timeout < controller loss timeout).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the default timeout values in helper types

# Search for the timeout constant definitions
rg -A 1 "DefaultReplicaCtrlrLossTimeoutSec|DefaultReplicaReconnectDelaySec|DefaultReplicaFastIOFailTimeoutSec" 

# Search for any tests validating these timeouts
ast-grep --pattern 'func Test$_($_, $_) {
  $$$
  DefaultReplicaCtrlrLossTimeoutSec
  $$$
}'

Length of output: 1860

pkg/spdk/replica.go (2)

1329-1330: LGTM! Addition of timeout parameters improves connection stability.

The addition of DefaultReplicaCtrlrLossTimeoutSec and DefaultReplicaFastIOFailTimeoutSec parameters helps prevent the v2 volume from entering ERROR state by providing better control over NVMe controller reconnection attempts during rebuilding.

Let's verify the timeout values are consistently used across the codebase:

✅ Verification successful

Timeout parameters are consistently used across NVMf connections

Based on the search results, I can verify that:

  1. All connectNVMfBdev calls consistently use timeout parameters
  2. The engine uses DefaultReplicaCtrlrLossTimeoutSec from initialization
  3. Both replica rebuilding paths (source and destination) use the same timeout values
  4. The default value is set to 15 seconds in the helper types package
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistent usage of timeout parameters
# Expected: All connectNVMfBdev calls should use these timeout parameters

# Search for connectNVMfBdev calls
rg "connectNVMfBdev\(" -A 3

# Search for DefaultReplicaCtrlrLossTimeoutSec usage
rg "DefaultReplicaCtrlrLossTimeoutSec"

Length of output: 5505


1437-1438: LGTM! Consistent timeout handling for rebuilding process.

The addition of timeout parameters here ensures consistent behavior between source and destination replicas during rebuilding, which helps prevent race conditions that could lead to ERROR states.

Let's verify the rebuilding process handles these timeouts correctly:

✅ Verification successful

LGTM! Timeout parameters are properly handled in the rebuilding process

The code correctly implements timeout handling for NVMe connections during rebuilding:

  • The connectNVMfBdev function properly sets both controller loss timeout and fast I/O fail timeout parameters
  • Default timeout values are well-defined constants: DefaultReplicaCtrlrLossTimeoutSec=15 and DefaultReplicaFastIOFailTimeoutSec=10
  • Error handling is in place for connection failures and timeouts
  • The implementation is consistent across both source and destination replicas during rebuilding
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check rebuilding process error handling
# Expected: Proper error handling around NVMe controller timeouts

# Search for rebuilding error handling
rg "rebuildingError.*timeout" -B 2 -A 2

# Search for rebuilding state transitions
rg "rebuildingState.*Error.*timeout" -B 2 -A 2

Length of output: 90


Script:

#!/bin/bash
# Let's check for error handling in rebuilding process more broadly
# Search for rebuilding error handling patterns
rg "RebuildingDstStart.*error" -A 5
rg "RebuildingDstStart.*err" -A 5

# Search for NVMf connection error handling
rg "connectNVMfBdev.*error" -A 5
rg "connectNVMfBdev.*err" -A 5

# Look for timeout related error handling in the replica code
rg "timeout.*error" pkg/spdk/
rg "timeout.*err" pkg/spdk/

Length of output: 19856


Script:

#!/bin/bash
# Let's check the error handling in connectNVMfBdev implementation
ast-grep --pattern 'func connectNVMfBdev($$$) {
  $$$
}'

# Check for NVMe controller timeout handling
rg "BdevNvmeSetControllerTimeout|BdevNvmeSetFastIOFailTimeout" -A 3 -B 3

# Look for NVMe connection error handling in the rebuilding process
rg "connectNVMfBdev.*rebuilding" -A 5 -B 5

Length of output: 2404


Script:

#!/bin/bash
# Let's examine the implementation of connectNVMfBdev and timeout handling
rg -l "func connectNVMfBdev" pkg/spdk/
cat $(rg -l "func connectNVMfBdev" pkg/spdk/)

# Check for timeout parameter usage in NVMe connections
rg "DefaultReplicaCtrlrLossTimeoutSec|DefaultReplicaFastIOFailTimeoutSec" -A 2 -B 2

Length of output: 8085

pkg/spdk/engine.go (2)

218-218: LGTM! Consistent usage of timeout parameters.

The changes consistently use the engine's configured timeout values across all NVMe connection points, which aligns with the PR objective to adjust controller loss timeout behavior.

Also applies to: 1089-1089, 1659-1659, 1965-1968


100-101: ⚠️ Potential issue

Verify timeout values for edge cases.

The default timeout values should be carefully considered as they affect error recovery. Please verify:

  1. The decreased ctrlrLossTimeout value is sufficient for network hiccups
  2. The timeout values work well in high-load scenarios
  3. The values are appropriate for various network conditions
✅ Verification successful

Based on the search results, I can now provide a definitive response about the timeout values and their usage:

The default timeout values are appropriate and well-configured

The timeout values are properly set with a good balance between reliability and responsiveness:

  1. DefaultReplicaCtrlrLossTimeoutSec = 15 - This 15-second timeout provides sufficient time for network recovery while preventing extended hangs
  2. DefaultReplicaFastIOFailTimeoutSec = 10 - The 10-second fast I/O failure timeout allows quick failure detection while avoiding premature timeouts
  3. The timeouts are used consistently throughout the codebase for NVMe controller operations

The relationship between these timeouts is correctly structured:

  • Controller loss timeout (15s) > Fast I/O failure timeout (10s)
  • This hierarchy ensures proper error handling escalation

The error handling code shows these timeouts are used appropriately with proper connection recovery mechanisms in place.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any existing timeout-related issues or discussions
# Search for timeout-related error patterns in the codebase

# Check for existing timeout configurations
echo "Checking for existing timeout configurations..."
rg "DefaultReplicaCtrlrLossTimeoutSec|DefaultReplicaFastIOFailTimeoutSec" -A 2

# Look for timeout-related error patterns
echo "Checking for timeout-related error patterns..."
rg "timeout|connection.*lost|reconnect.*fail" --type go

# Check for related test cases
echo "Checking for related test cases..."
fd "test.*\.go$" | xargs rg "TestCtrlrLossTimeout|TestFastIOFailTimeout"

Length of output: 102268


Script:

#!/bin/bash
# Let's check the actual timeout values and their usage in the codebase

# Check the default timeout values
echo "Checking default timeout values..."
rg "DefaultReplicaCtrlrLossTimeoutSec.*=|DefaultReplicaFastIOFailTimeoutSec.*=" -A 1

# Check where these timeouts are used in the codebase
echo "Checking timeout usage..."
rg "ctrlrLossTimeout|fastIOFailTimeout" --type go -A 2

# Look for any related error handling code
echo "Checking error handling related to timeouts..."
rg "timeout.*error|connection.*lost|reconnect.*fail" --type go -A 2

Length of output: 78374

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
pkg/spdk/disk/nvme/nvme.go (1)

18-26: Consider making timeout values configurable

While the current implementation addresses the immediate issue, consider making these timeout values configurable through environment variables or configuration files. This would allow operators to tune these values based on their specific environment needs without requiring code changes.

Example approach:

  1. Define these as configuration parameters
  2. Provide environment variable overrides
  3. Keep current values as defaults

Would you like me to propose a detailed implementation for this enhancement?

pkg/spdk/types.go (1)

34-42: Add documentation for the timeout constants

Consider adding documentation comments to explain:

  • The purpose and impact of each timeout
  • The relationships between these timeouts
  • The rationale behind the chosen values

Example:

 const (
-	// Timeout for RAID base bdev (replica)
+	// Timeouts for RAID base bdev (replica) connection management:
+	// - Controller loss timeout: Maximum time to wait for controller reconnection
+	// - Reconnect delay: Time to wait between reconnection attempts
+	// - Fast I/O failure: Time to wait before failing I/O operations
+	// - Transport ACK: Maximum time to wait for transport layer acknowledgments
+	// - Keep alive: Interval for connection health checks
 	replicaCtrlrLossTimeoutSec  = 15
 	replicaReconnectDelaySec    = 2
 	replicaFastIOFailTimeoutSec = 10
pkg/spdk/engine.go (1)

52-53: Document timeout parameters and make them configurable

The new timeout parameters lack documentation explaining their purpose and impact. Additionally, the TODO comment suggests these should be user-configurable.

Consider:

  1. Adding field documentation explaining the purpose and impact of each timeout
  2. Making these values configurable through engine options or configuration
 type Engine struct {
+	// ctrlrLossTimeout defines how long SPDK will attempt to reconnect to the NVMe 
+	// controller after a connection loss before removing the controller
 	ctrlrLossTimeout     int
+	// fastIOFailTimeoutSec defines how long SPDK will wait for I/O completion before
+	// failing the I/O with error
 	fastIOFailTimeoutSec int

Also applies to: 100-102

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5c041e9 and 5fd74c6.

⛔ Files ignored due to path filters (4)
  • go.mod is excluded by !go.mod
  • go.sum is excluded by !**/*.sum, !go.sum
  • vendor/github.com/longhorn/go-spdk-helper/pkg/types/types.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
📒 Files selected for processing (6)
  • pkg/spdk/disk/nvme/nvme.go (2 hunks)
  • pkg/spdk/engine.go (7 hunks)
  • pkg/spdk/replica.go (3 hunks)
  • pkg/spdk/server.go (2 hunks)
  • pkg/spdk/types.go (1 hunks)
  • pkg/spdk/util.go (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/spdk/replica.go
  • pkg/spdk/util.go
🧰 Additional context used
📓 Learnings (1)
pkg/spdk/server.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-spdk-engine#247
File: pkg/spdk/server.go:70-70
Timestamp: 2024-11-28T12:13:14.445Z
Learning: In the Longhorn SPDK engine, the `ctrlr_loss_timeout_sec` parameter is set in the `connectNVMfBdev` function (used in `bdev_nvme_attach_controller`), so it's not necessary to set it in `BdevNvmeSetOptions` within `pkg/spdk/server.go`.
🔇 Additional comments (7)
pkg/spdk/disk/nvme/nvme.go (1)

58-58: LGTM! Verify timeout behavior in integration tests

The implementation correctly uses the new timeout constants. However, given the critical nature of this change, we should verify the behavior in integration tests.

Let's verify if there are any existing integration tests that cover this scenario:

pkg/spdk/types.go (1)

34-42: LGTM! Well-structured timeout hierarchy

The new timeout constants form a well-thought-out hierarchy:

  • Controller loss timeout (15s) is appropriately decreased from 30s to reduce the race condition window
  • Fast I/O failure timeout (10s) < Controller loss timeout (15s) ensures I/O issues are detected first
  • Transport ACK timeout (14s) < Controller loss timeout (15s) ensures transport issues are caught before declaring controller loss
pkg/spdk/server.go (3)

14-23: LGTM! Import changes are well-organized.

The new imports for jsonrpc and spdkrpc packages are properly organized and necessary for the functionality.


75-75: LGTM! Error handling is well-implemented.

The error handling follows good practices:

  • Uses errors.Wrap to preserve the error context
  • Provides a clear and descriptive error message

70-75: Verify the necessity of setting timeout parameters in BdevNvmeSetOptions.

Based on previous learnings, the ctrlr_loss_timeout_sec parameter should be set in the connectNVMfBdev function, not in BdevNvmeSetOptions. Please verify if setting these parameters here is necessary or potentially redundant:

  • replicaCtrlrLossTimeoutSec
  • replicaReconnectDelaySec
  • replicaFastIOFailTimeoutSec
  • replicaTransportAckTimeout
  • replicaKeepAliveTimeoutMs
pkg/spdk/engine.go (2)

1090-1093: LGTM: Robust error handling for controller timeouts

The error handling for NVMe controller connections is robust and properly propagates errors while maintaining the engine state. The code correctly handles timeout-related failures and includes appropriate recovery mechanisms.


219-219: Verify the impact of decreased timeout values

The timeout parameters are consistently applied across all NVMe controller connections. However, we should verify that the decreased timeout values don't cause premature disconnections in other scenarios.

Let's verify the usage of these timeout parameters:

Also applies to: 1090-1090, 1660-1660, 1966-1968

✅ Verification successful

Based on the shell script results, I can now generate the final response:

Timeout parameters are consistently applied across all NVMe connections

The codebase shows consistent usage of timeout parameters:

  • All connectNVMfBdev calls properly pass through ctrlrLossTimeout and fastIOFailTimeoutSec
  • The parameters are correctly propagated to the underlying BdevNvmeAttachController calls
  • Different timeout values are appropriately used for different scenarios (replica vs disk)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check all usages of connectNVMfBdev and BdevNvmeAttachController
# to ensure consistent timeout parameter usage

echo "Checking connectNVMfBdev calls..."
rg "connectNVMfBdev\(" -A 2

echo "Checking BdevNvmeAttachController calls..."
rg "BdevNvmeAttachController\(" -A 4

Length of output: 6148

pkg/spdk/disk/nvme/nvme.go Outdated Show resolved Hide resolved
Copy link

codecov bot commented Nov 28, 2024

Codecov Report

Attention: Patch coverage is 0% with 36 lines in your changes missing coverage. Please review.

Project coverage is 8.38%. Comparing base (2e42cfc) to head (4f15c1a).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/spdk/engine.go 0.00% 16 Missing ⚠️
pkg/spdk/server.go 0.00% 6 Missing ⚠️
pkg/spdk/replica.go 0.00% 5 Missing ⚠️
pkg/spdk/backup.go 0.00% 4 Missing ⚠️
pkg/spdk/restore.go 0.00% 2 Missing ⚠️
pkg/spdk/util.go 0.00% 2 Missing ⚠️
pkg/spdk/disk/nvme/nvme.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##            main    #247      +/-   ##
========================================
- Coverage   8.39%   8.38%   -0.01%     
========================================
  Files         21      21              
  Lines       7150    7155       +5     
========================================
  Hits         600     600              
- Misses      6470    6475       +5     
  Partials      80      80              
Flag Coverage Δ
unittests 8.38% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@derekbit
Copy link
Member Author

Ready for review cc @shuo-wu @c3y1huang @DamiaSan @innobead

Longhorn 9874

Signed-off-by: Derek Su <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (6)
pkg/spdk/restore.go (2)

Line range hint 189-204: Enhance error detection mechanism

The current implementation uses string matching for error detection, which is fragile and maintenance-heavy.

Consider using error types instead of string matching:

+type RestoreCanceledError struct {
+    cause error
+}
+
+func (e *RestoreCanceledError) Error() string {
+    return fmt.Sprintf("restore canceled: %v", e.cause)
+}
+
 func (r *Restore) UpdateRestoreStatus(snapshotLvolName string, progress int, err error) {
     // ...
-    if strings.Contains(err.Error(), btypes.ErrorMsgRestoreCancelled) {
+    if _, isCanceled := err.(*RestoreCanceledError); isCanceled {
         r.log.WithError(err).Warn("Backup restoration is cancelled")
         r.State = btypes.ProgressStateCanceled
     } else {

Line range hint 234-236: Address TODO comment and implement Revert functionality

There are two issues here:

  1. The TODO comment contains a typo ("TODL" instead of "TODO")
  2. The Revert functionality is missing implementation, which could be crucial for system reliability

Would you like me to help create a GitHub issue to track the implementation of the Revert functionality? I can provide a detailed specification of the requirements and implementation approach.

pkg/spdk/disk.go (1)

Line range hint 1-379: Consider adding timeout parameters to disk management functions

Given that this file handles NVMe disk operations and the PR aims to address controller loss timeout issues, consider:

  1. Adding ctrlr_loss_timeout_sec as a parameter to relevant disk management functions, particularly in addBlockDevice and svcDiskCreate where NVMe devices are initialized.
  2. Propagating the timeout configuration through the disk creation chain.

This would provide more granular control over NVMe controller timeout behavior at the disk level.

Would you like me to propose specific code changes to incorporate timeout parameters in these functions?

pkg/spdk/backup.go (1)

170-173: Consider adding defer for cleanup in error cases

While the device opening logic is correct, consider adding a deferred cleanup in case of subsequent errors in the function. This ensures the device file handle is properly closed even if later operations fail.

 	devFh, err := os.OpenFile(b.initiator.Endpoint, os.O_RDONLY, 0666)
 	if err != nil {
 		return errors.Wrapf(err, "failed to open NVMe device %v for snapshot lvol bdev %v", b.initiator.Endpoint, lvolName)
 	}
+	// Ensure device is closed if any subsequent operation fails
+	success := false
+	defer func() {
+		if !success {
+			devFh.Close()
+		}
+	}()
 	b.devFh = devFh
+	success = true
pkg/spdk/engine.go (2)

100-102: Consider adding documentation for timeout values

While the initialization is correct, it would be beneficial to add documentation explaining:

  • The purpose and impact of these timeout values
  • The rationale behind the default values
  • Future plans for user configuration as indicated by the TODO

52-54: Consider adding monitoring for timeout events

While the timeout configuration changes look good, consider:

  1. Adding metrics/logging to track controller loss events
  2. Monitoring the frequency and duration of reconnection attempts
  3. Setting up alerts for repeated timeout occurrences

This would help validate the effectiveness of the timeout changes and provide early warning of potential issues.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5fd74c6 and 7123d16.

⛔ Files ignored due to path filters (4)
  • go.mod is excluded by !go.mod
  • go.sum is excluded by !**/*.sum, !go.sum
  • vendor/github.com/longhorn/go-spdk-helper/pkg/types/types.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
📒 Files selected for processing (11)
  • pkg/client/types.go (1 hunks)
  • pkg/spdk/backup.go (3 hunks)
  • pkg/spdk/disk.go (1 hunks)
  • pkg/spdk/disk/nvme/nvme.go (2 hunks)
  • pkg/spdk/engine.go (11 hunks)
  • pkg/spdk/replica.go (3 hunks)
  • pkg/spdk/restore.go (2 hunks)
  • pkg/spdk/server.go (2 hunks)
  • pkg/spdk/types.go (2 hunks)
  • pkg/spdk/util.go (3 hunks)
  • pkg/util/block.go (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • pkg/client/types.go
  • pkg/util/block.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/spdk/disk/nvme/nvme.go
  • pkg/spdk/replica.go
  • pkg/spdk/types.go
  • pkg/spdk/util.go
🧰 Additional context used
📓 Learnings (1)
pkg/spdk/server.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-spdk-engine#247
File: pkg/spdk/server.go:70-70
Timestamp: 2024-11-28T12:13:14.445Z
Learning: In the Longhorn SPDK engine, the `ctrlr_loss_timeout_sec` parameter is set in the `connectNVMfBdev` function (used in `bdev_nvme_attach_controller`), so it's not necessary to set it in `BdevNvmeSetOptions` within `pkg/spdk/server.go`.
🔇 Additional comments (11)
pkg/spdk/restore.go (2)

13-14: LGTM: Import addition is appropriate

The addition of the nvme package import is necessary for the new NVMe initiator functionality and follows the project's import organization.


Line range hint 1-236: Verify alignment with PR objectives

The changes in this file appear to be tangential to the main PR objective of decreasing ctrlr_loss_timeout_sec. Please clarify if these changes are necessary for implementing the timeout modifications.

✅ Verification successful

Let me gather more specific information about the changes and their relation to the timeout configuration.


Based on the search results, I can now generate the final response:

Changes in restore.go are necessary for proper NVMe controller handling

The changes in restore.go are actually relevant to the PR's objective of decreasing ctrlr_loss_timeout_sec. This file implements the NVMe initiator connection logic that directly interacts with the timeout settings. Specifically:

  • It handles NVMe controller initialization and connection management through NewInitiator and Start functions
  • The file is responsible for exposing and managing NVMe devices that are subject to the timeout configurations
  • The timeout settings defined in vendor/.../types/nvme.go (CtrlrLossTimeoutSec, FastIOFailTimeoutSec) are applied to these NVMe connections
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for timeout-related configurations in the codebase
rg -l "ctrlr_loss_timeout|timeout_sec" --type go

# Search for NVMe controller configuration patterns
ast-grep --pattern 'func $_($$) {
  $$$
  timeout $$ = $_
  $$$
}'

Length of output: 316


Script:

#!/bin/bash
# Check the actual content of files containing timeout configurations
rg -A 5 -B 5 "ctrlr_loss_timeout|timeout_sec" --type go

# Look for NVMe controller and timeout related functions in the restore.go
rg -A 5 -B 5 "NewInitiator|Start|exposeSnapshotLvolBdev" pkg/spdk/restore.go

# Check if there are any related configuration settings in the types
cat vendor/github.com/longhorn/go-spdk-helper/pkg/spdk/types/nvmf.go

Length of output: 11790

pkg/spdk/disk.go (1)

9-22: Verify alignment with PR objectives

While the import reorganization is clean, this file doesn't contain any changes related to the PR's primary objective of decreasing ctrlr_loss_timeout_sec. The changes appear to be purely organizational and don't address the timeout adjustment needed to fix the volume ERROR state issue described in longhorn/longhorn#9874.

Let's verify if the timeout-related changes exist in other files:

pkg/spdk/backup.go (3)

15-16: LGTM: Clean import addition

The NVMe package import is properly organized and necessary for the NVMe functionality.


224-226: LGTM: Improved error messages

The error messages now include the device endpoint, which provides better context for debugging NVMe device issues.


170-173: Verify NVMe controller timeout configuration

While the NVMe device handling is improved, the PR objective mentions decreasing ctrlr_loss_timeout_sec. Ensure this timeout is properly configured when establishing the NVMe connection.

✅ Verification successful

NVMe controller timeout is properly configured

The timeout configuration is correctly implemented in the codebase:

  • The default controller loss timeout is set to 15 seconds in the vendor library
  • The timeout is properly passed through the BdevNvmeAttachController calls
  • The value is configurable via the ctrlrLossTimeout field in the engine configuration
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for ctrlr_loss_timeout_sec configuration in the codebase

# Look for timeout configuration in SPDK-related files
rg -l "ctrlr_loss_timeout|ConnectNVMf" --type go

# Look for any SPDK configuration files
fd -e json -e conf -e yaml --exec rg "timeout|ctrlr_loss"

Length of output: 242


Script:

#!/bin/bash
# Let's examine the content of the files found and their context
rg -A 10 "ctrlr_loss_timeout|ConnectNVMf" vendor/github.com/longhorn/go-spdk-helper/pkg/spdk/types/nvme.go vendor/github.com/longhorn/go-spdk-helper/pkg/spdk/client/basic.go

# Also search for NVMe configuration related code
ast-grep --pattern 'func $_ ($$$) {
  $$$
  NVMeTransport$$$
  $$$
}'

Length of output: 3669


Script:

#!/bin/bash
# Let's look for the actual configuration of these timeout values in the codebase
rg -A 5 "bdev_nvme_attach_controller" .

# Also search for any struct initialization with these timeout values
ast-grep --pattern '{
  $$$
  ctrlr_loss_timeout_sec: $_,
  $$$
}'

# And check for any RPC calls that might set these values
rg -A 5 "BdevNvmeAttachController" .

Length of output: 5004


Script:

#!/bin/bash
# Let's check the actual values being used for the timeout
rg -B 5 "ctrlrLossTimeout" pkg/spdk/engine.go

# And check if there are any configuration changes in the PR
rg "ctrlrLossTimeout.*=" .

# Also check for any default values
rg "DefaultCtrlrLossTimeout" .

Length of output: 1776

pkg/spdk/server.go (2)

14-23: LGTM! Import organization looks good.

The import statements are well-organized, following Go's standard grouping convention: standard library, external packages, and internal packages.


70-75: LGTM! NVMe options configuration looks good.

The timeout parameters are properly configured for the SPDK NVMe operations. Based on the learnings, setting ctrlr_loss_timeout_sec here is correct as it's handled in the connectNVMfBdev function.

Let's verify the timeout values are defined correctly:

✅ Verification successful

Let me verify the usage of these timeout values with one more check to ensure they are used consistently.


NVMe timeout configuration values are properly set and used

The verification confirms that:

  • The timeout constants are well-defined in pkg/spdk/types.go with appropriate values:
    • replicaCtrlrLossTimeoutSec = 15
    • replicaReconnectDelaySec = 2
    • replicaFastIOFailTimeoutSec = 10
    • replicaTransportAckTimeout = 14
    • replicaKeepAliveTimeoutMs = 10000
  • These values are correctly passed to BdevNvmeSetOptions which is documented to set global parameters for all bdev NVMe
  • The function properly maps these values to the corresponding JSON fields in the SPDK RPC request
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the timeout constants definition
# Expected: Find the timeout constants and their values

# Search for the timeout constants
rg -A 1 'replica.*Timeout.*=' 

Length of output: 497


Script:

#!/bin/bash
# Search for the BdevNvmeSetOptions function implementation and its usage
ast-grep --pattern 'BdevNvmeSetOptions($$$)'

# Also check if these timeout values are referenced elsewhere
rg "BdevNvmeSetOptions|ctrlr_loss_timeout_sec" -B 2 -A 2

Length of output: 3170

pkg/spdk/engine.go (3)

52-53: LGTM: Well-structured timeout configuration fields

The addition of ctrlrLossTimeout and fastIOFailTimeoutSec fields to the Engine struct provides a clean way to configure NVMe controller timeouts.


219-219: LGTM: Consistent timeout handling in NVMe connections

The changes properly integrate the new timeout parameters across all NVMe connection points, ensuring consistent behavior for controller loss scenarios.

Also applies to: 1090-1090, 1660-1660, 1966-1968


685-685: LGTM: Improved error handling and validation

The enhanced error messages and validation logic provide better diagnostics for NVMe-related issues, making it easier to identify and debug problems with replica connections.

Also applies to: 934-934, 939-939, 943-943, 947-947

pkg/spdk/restore.go Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)
pkg/spdk/disk/nvme/nvme.go (1)

19-25: Consider making timeouts configurable via environment variables

These timeout values are critical for system stability and might need adjustment in different environments. Consider making them configurable through environment variables with these constants serving as defaults.

Example approach:

const (
        // Timeouts for disk bdev
-       diskCtrlrLossTimeoutSec  = 30
-       diskReconnectDelaySec    = 2
-       diskFastIOFailTimeoutSec = 15
+       defaultDiskCtrlrLossTimeoutSec  = 10
+       defaultDiskReconnectDelaySec    = 2
+       defaultDiskFastIOFailTimeoutSec = 15
)

+func getTimeoutConfig() (ctrlrLoss, reconnectDelay, fastIOFail int) {
+       if val := os.Getenv("DISK_CTRLR_LOSS_TIMEOUT"); val != "" {
+               if timeout, err := strconv.Atoi(val); err == nil {
+                       ctrlrLoss = timeout
+               }
+       }
+       // Similar for other timeouts...
+       return defaultDiskCtrlrLossTimeoutSec, defaultDiskReconnectDelaySec, defaultDiskFastIOFailTimeoutSec
+}
pkg/spdk/restore.go (1)

Line range hint 134-143: Add controller loss timeout configuration

According to the PR objectives, the ctrlr_loss_timeout_sec needs to be decreased to prevent v2 volumes from entering ERROR state. However, the timeout is not configured when creating the NVMe initiator.

Consider adding timeout configuration when creating the initiator:

 initiator, err := nvme.NewInitiator(lvolName, helpertypes.GetNQN(lvolName), nvme.HostProc)
 if err != nil {
     return nil, "", errors.Wrapf(err, "failed to create NVMe initiator for lvol bdev %v", lvolName)
 }
+// Configure shorter timeout to prevent ERROR state during instance manager deletion
+if err := initiator.SetControllerLossTimeout(15); err != nil {
+    return nil, "", errors.Wrapf(err, "failed to set controller loss timeout for lvol bdev %v", lvolName)
+}
 if _, err := initiator.Start(r.ip, strconv.Itoa(int(r.port)), true); err != nil {
pkg/spdk/engine.go (2)

52-53: LGTM! Consider adding field documentation.

The new timeout fields are well-placed in the Engine struct. Consider adding godoc comments to document their purpose and units.

 type Engine struct {
+    // ctrlrLossTimeout is the number of seconds to wait before marking 
+    // an NVMe controller as lost/dead. Default: 30 seconds
     ctrlrLossTimeout     int
+    // fastIOFailTimeoutSec is the number of seconds to wait before failing 
+    // I/O operations when an NVMe controller is unresponsive
     fastIOFailTimeoutSec int

100-102: Consider implementing user-defined timeout values.

The TODO comment indicates that user-defined timeout values should be supported. Consider:

  1. Adding timeout parameters to the engine configuration
  2. Implementing validation for user-provided values
  3. Falling back to defaults if not specified
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7123d16 and ea2b04f.

📒 Files selected for processing (11)
  • pkg/client/types.go (1 hunks)
  • pkg/spdk/backup.go (3 hunks)
  • pkg/spdk/disk.go (1 hunks)
  • pkg/spdk/disk/nvme/nvme.go (2 hunks)
  • pkg/spdk/engine.go (11 hunks)
  • pkg/spdk/replica.go (3 hunks)
  • pkg/spdk/restore.go (2 hunks)
  • pkg/spdk/server.go (2 hunks)
  • pkg/spdk/types.go (2 hunks)
  • pkg/spdk/util.go (3 hunks)
  • pkg/util/block.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • pkg/client/types.go
  • pkg/spdk/backup.go
  • pkg/spdk/disk.go
  • pkg/spdk/replica.go
  • pkg/spdk/server.go
  • pkg/spdk/types.go
  • pkg/spdk/util.go
  • pkg/util/block.go
🔇 Additional comments (9)
pkg/spdk/disk/nvme/nvme.go (2)

20-20: ⚠️ Potential issue

Decrease diskCtrlrLossTimeoutSec to prevent race conditions

The 30-second timeout is too long and may cause v2 volumes to enter ERROR state. This needs to be decreased to minimize the window for race conditions where the initiator incorrectly concludes that the target has failed.

-       diskCtrlrLossTimeoutSec  = 30
+       diskCtrlrLossTimeoutSec  = 10  // Reduced to minimize race condition window

58-58: LGTM! Verify timeout values in production

The implementation correctly uses the new timeout constants. However, given the critical nature of these timeouts in production environments, we should verify these values work well under various conditions.

Let's verify the usage of these timeout values across the codebase:

✅ Verification successful

Timeout configurations are consistent and well-documented

The codebase shows consistent and well-documented timeout configurations:

  • Disk timeouts (pkg/spdk/disk/nvme/nvme.go):

    • Controller Loss: 30s
    • Reconnect Delay: 2s
    • Fast I/O Fail: 15s
  • Replica timeouts (pkg/spdk/types.go):

    • Controller Loss: 15s
    • Reconnect Delay: 2s
    • Fast I/O Fail: 10s

The implementation is correct and the timeout values are properly documented with clear rationale, particularly for replica timeouts where it's noted that "ctrl-loss-tmo for the NVMe-oF initiator connecting to the RAID target is set to 30 seconds" and the replica timeouts are intentionally set lower to ensure proper error handling cascade.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any other timeout configurations that might need adjustment
rg -g '*.go' "CtrlrLossTimeout|ReconnectDelay|FastIOFailTimeout" -B 2 -A 2

# Look for any related error handling or timeout scenarios
ast-grep --pattern 'if err != nil {
  $$$
  return $_, errors.Wrap($_, "timeout")
  $$$
}'

Length of output: 12260

pkg/spdk/restore.go (4)

13-14: LGTM: Import addition is appropriate

The addition of the nvme package import is necessary for the NVMe initiator functionality and is properly organized.


Line range hint 191-218: LGTM: Proper state management

The status update logic correctly handles different scenarios:

  • Distinguishes between cancellation and errors
  • Maintains proper state transitions
  • Updates LastRestored field appropriately

160-162: 🛠️ Refactor suggestion

Improve error handling in NVMe device closure

The error handling for closing the NVMe device should ensure proper cleanup even if an error occurs.

Consider implementing the previously suggested improvement:

 func (r *Restore) CloseVolumeDev(volDev *os.File) error {
+    var closeErr error
     r.log.Infof("Closing NVMe device %v", r.initiator.Endpoint)
-    if err := volDev.Close(); err != nil {
-        return errors.Wrapf(err, "failed to close NVMe device %v", r.initiator.Endpoint)
+    if err := volDev.Close(); err != nil {
+        closeErr = errors.Wrapf(err, "failed to close NVMe device %v", r.initiator.Endpoint)
     }
+    // Continue with cleanup even if close failed

Line range hint 134-143: Verify timeout configuration consistency

Let's ensure the timeout configuration is consistently applied across the codebase.

pkg/spdk/engine.go (3)

934-947: LGTM! Robust error handling for NVMe operations.

The error handling for NVMe operations has been improved with:

  • Detailed validation of NVMe controller information
  • Clear error messages with context
  • Proper validation of address family and transport type

Line range hint 1258-1262: LGTM! Well-documented error handling strategy.

The error handling for replica operations is well-implemented:

  • Clear comments explaining the rationale for error handling decisions
  • Proper distinction between fatal and non-fatal errors
  • Appropriate logging of errors while allowing the flow to continue

219-219: Verify consistent timeout parameter usage across all NVMe operations.

The timeout parameters are consistently used in all NVMe connection operations. This change effectively implements the decreased ctrlr_loss_timeout_sec requirement across:

  • Initial replica connection
  • Rebuilding replica connection
  • Snapshot revert operations
  • Backup restore operations

Let's verify that all NVMe connection points use the new timeout parameters:

Also applies to: 1090-1090, 1660-1660, 1966-1968

✅ Verification successful

Based on the search results, I can generate the final response as I have sufficient information to verify the timeout parameter usage:

Timeout parameters are consistently used across all NVMe operations

All NVMe connection points in the codebase correctly use the timeout parameters:

  • connectNVMfBdev calls consistently pass ctrlrLossTimeout and fastIOFailTimeoutSec in:
    • Initial replica connection (engine.go:219)
    • Rebuilding operations (replica.go)
    • External snapshot operations (replica.go)
  • BdevNvmeAttachController calls properly use the timeout parameters in:
    • NVMe device attachment (nvme.go)
    • Replica controller attachment (engine.go:1966)
    • NVMf connection utility (util.go)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all NVMe connection operations use the timeout parameters
# Test: Search for NVMe connection function calls. Expect: All calls include timeout parameters

# Check for connectNVMfBdev calls
echo "Checking connectNVMfBdev calls..."
rg "connectNVMfBdev\(" -A 3

# Check for BdevNvmeAttachController calls
echo "Checking BdevNvmeAttachController calls..."
rg "BdevNvmeAttachController\(" -A 5

Length of output: 7256

@derekbit derekbit merged commit d213adf into longhorn:main Nov 29, 2024
7 of 8 checks passed
@derekbit derekbit deleted the issue-9874 branch November 29, 2024 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants