-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: decrease ctrlr_loss_timeout_sec for base bdev #247
Conversation
WalkthroughThe pull request introduces significant modifications to the Changes
Assessment against linked issues
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (9)
pkg/spdk/util.go (1)
Line range hint
80-92
: Document the timeout parametersPlease add documentation for the new parameters explaining:
- Expected value ranges
- Default recommendations
- Impact on error recovery
- Relationship with
DefaultReplicaReconnectDelaySec
This will help users understand how to properly configure these timeouts.
+// connectNVMfBdev connects to the NVMe-oF target, which is exposed by a remote lvol bdev. +// Parameters: +// - controllerName: typically the lvol name +// - address: the IP:port of the NVMe-oF target +// - ctrlrLossTimeout: duration in seconds that SPDK will try to reconnect to the controller +// - fastIOFailTimeoutSec: duration in seconds before failing I/O operations +// Returns the bdev name and any error encountered. func connectNVMfBdev(spdkClient *spdkclient.Client, controllerName, address string, ctrlrLossTimeout, fastIOFailTimeoutSec int) (bdevName string, err error) {pkg/spdk/server.go (5)
Line range hint
94-103
: Add error handling for SPDK target reconnection failuresThe new method
tryEnsureSPDKTgtConnectionHealthy
effectively checks the SPDK target's health and attempts to reconnect. However, ifclientReconnect
fails, the error is logged but not acted upon. Consider implementing a retry mechanism with exponential backoff or notifying upstream components to handle prolonged disconnections gracefully.
Line range hint
261-273
: Simplify the monitoring loop logicThe
done
flag in themonitoring
method is unnecessary, as thefor
loop can be exited directly using areturn
statement when the context is done. This can simplify the code and improve readability.Apply this diff to remove the
done
variable:func (s *Server) monitoring() { ticker := time.NewTicker(MonitorInterval) defer ticker.Stop() - done := false for { select { case <-s.ctx.Done(): logrus.Info("spdk gRPC server: stopped monitoring replicas due to the context done") - done = true - case <-ticker.C: + return + case <-ticker.C: err := s.verify() if err == nil { break } logrus.WithError(err).Errorf("spdk gRPC server: failed to verify and update replica cache, will retry later") if jsonrpc.IsJSONRPCRespErrorBrokenPipe(err) || jsonrpc.IsJSONRPCRespErrorInvalidCharacter(err) { err = s.tryEnsureSPDKTgtConnectionHealthy() if err != nil { logrus.WithError(err).Error("spdk gRPC server: failed to ensure spdk_tgt connection healthy") } } } - if done { - break - } } }
Line range hint
483-503
: Handle errors in deferred function in 'verify' method carefullyIn the
verify
method, a deferred function checks iferr
is notnil
to handle JSON RPC errors. Iferr
is modified within the method after certain operations, the deferred function might not capture the intended error. Consider capturingerr
in a local variable inside the deferred function to ensure accurate error handling.Modify the deferred function as follows:
func (s *Server) verify() (err error) { // ... existing code ... defer func() { + deferredErr := err if deferredErr == nil { return } if jsonrpc.IsJSONRPCRespErrorBrokenPipe(deferredErr) { logrus.WithError(deferredErr).Warn("spdk gRPC server: marking all non-stopped and non-error replicas and engines as error") for _, r := range replicaMapForSync { r.SetErrorState() } for _, e := range engineMapForSync { e.SetErrorState() } } }() // ... existing code ... }
This change ensures that the deferred function uses the correct value of
err
at the time of deferral.
Line range hint
575-589
: Improve error messages for engine operationsIn methods like
EngineCreate
, error messages can be more descriptive. For instance, when invalidFrontend
types are provided, specify the allowed values to guide the user.Apply this diff to enhance the error message:
if req.Frontend != types.FrontendSPDKTCPBlockdev && req.Frontend != types.FrontendSPDKTCPNvmf && req.Frontend != types.FrontendEmpty { - return nil, grpcstatus.Error(grpccodes.InvalidArgument, "engine frontend is required") + return nil, grpcstatus.Errorf(grpccodes.InvalidArgument, "invalid engine frontend '%v'; allowed values are '%v', '%v', or '%v'", + req.Frontend, types.FrontendSPDKTCPBlockdev, types.FrontendSPDKTCPNvmf, types.FrontendEmpty) }This provides clearer guidance on acceptable
Frontend
values.
Line range hint
1142-1160
: Check for nil references before accessing 'restore' fieldsIn the
ReplicaRestoreStatus
method, there is a condition that checksif replica.restore == nil
. Ensure that subsequent accesses toreplica.restore
fields are safe and won't cause a nil pointer dereference.Verify that all usages of
replica.restore
are protected againstnil
references. Consider restructuring the code to prevent potential panics.pkg/spdk/replica.go (2)
1329-1330
: Ensure consistent usage of timeout configurationsIn the call to
connectNVMfBdev
, the timeout valueshelpertypes.DefaultReplicaCtrlrLossTimeoutSec
andhelpertypes.DefaultReplicaFastIOFailTimeoutSec
are hardcoded. However, in other parts of the codebase, such as within theEngine
methods, these timeout values are passed as parameters from theEngine
struct. For consistency and enhanced configurability, consider passing the timeout parameters from theEngine
struct or making them configurable at a higher level.
1437-1438
: Ensure consistent usage of timeout configurationsSimilar to the previous comment, the call to
connectNVMfBdev
here uses hardcoded default timeout values. To maintain consistency across the codebase and improve flexibility, consider passing these timeout values as parameters or making them configurable, aligning with how timeouts are managed in other parts of the code.pkg/spdk/engine.go (1)
52-54
: Add comments to new timeout fields for clarityConsider adding comments to the newly added fields
ctrlrLossTimeout
andfastIOFailTimeoutSec
in theEngine
struct to explain their purpose and how they affect the engine's behavior.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (5)
go.mod
is excluded by!go.mod
go.sum
is excluded by!**/*.sum
,!go.sum
vendor/github.com/longhorn/go-spdk-helper/pkg/spdk/client/basic.go
is excluded by!vendor/**
vendor/github.com/longhorn/go-spdk-helper/pkg/types/types.go
is excluded by!vendor/**
vendor/modules.txt
is excluded by!vendor/**
📒 Files selected for processing (4)
pkg/spdk/engine.go
(7 hunks)pkg/spdk/replica.go
(3 hunks)pkg/spdk/server.go
(1 hunks)pkg/spdk/util.go
(2 hunks)
🔇 Additional comments (6)
pkg/spdk/util.go (1)
92-92
: Verify the timeout values across the codebase
Let's ensure the timeout values being passed to this function align with the PR objectives of decreasing ctrlr_loss_timeout_sec
.
✅ Verification successful
Based on the search results, I can now generate the final response as the information is sufficient to verify the timeout values:
Timeout values are properly aligned with PR objectives
The codebase shows consistent usage of decreased timeout values:
- Default controller loss timeout is set to 15 seconds (
DefaultReplicaCtrlrLossTimeoutSec = 15
) - Fast I/O failure timeout is set to 10 seconds (
DefaultReplicaFastIOFailTimeoutSec = 10
) - These values are consistently used across replica and engine operations for NVMf connections
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for calls to connectNVMfBdev to verify the timeout values being used
rg -A 2 "connectNVMfBdev\("
# Search for any other timeout-related constants that might affect these parameters
rg "LossTimeout|FailTimeout"
Length of output: 7686
pkg/spdk/server.go (1)
Line range hint 127-144
: Avoid potential race conditions during client reconnection
In the clientReconnect
method, the SPDK client is replaced while holding a lock. Ensure that all accesses to s.spdkClient
elsewhere in the code properly acquire the necessary locks to prevent race conditions. Verify that no other goroutines might access s.spdkClient
concurrently without synchronization.
To confirm there are no unsynchronized accesses to s.spdkClient
, run the following script:
The script searches for instances where spdkClient
is used without the s.
prefix, which may indicate direct accesses without proper locking.
✅ Verification successful
Based on the search results, I can now generate the final response since I have enough information to verify the concern about potential race conditions during client reconnection.
Proper synchronization is in place for s.spdkClient
access
All accesses to s.spdkClient
throughout the codebase are properly synchronized using the server's mutex locks. The code consistently follows these patterns:
- Read access: Protected by
s.RLock()
ands.RUnlock()
- Write access: Protected by
s.Lock()
ands.Unlock()
The client reconnection in clientReconnect
method is properly synchronized, and all other methods accessing s.spdkClient
also acquire the appropriate locks before use. This includes all RPC handlers and internal methods.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for unsynchronized accesses to s.spdkClient
# Find all occurrences where s.spdkClient is accessed without locking
rg --pcre2 '(?<!s\.)spdkClient' pkg/spdk/ -A 3 -B 3
Length of output: 102405
pkg/spdk/replica.go (1)
18-19
: Imports updated appropriately
The new imports github.com/longhorn/go-spdk-helper/pkg/jsonrpc
and github.com/longhorn/types/pkg/generated/spdkrpc
are necessary for the added functionality and appear correctly included.
pkg/spdk/engine.go (3)
100-102
: Ensure default timeout values are appropriate
Confirm that initializing ctrlrLossTimeout
and fastIOFailTimeoutSec
with helpertypes.DefaultReplicaCtrlrLossTimeoutSec
and helpertypes.DefaultFastIOFailTimeoutSec
respectively aligns with the desired behavior. If these values should be configurable externally, consider exposing them as parameters.
218-218
: Use consistent timeout parameters across connectNVMfBdev
calls
In this call to connectNVMfBdev
, you are passing e.ctrlrLossTimeout
and e.fastIOFailTimeoutSec
. Ensure that these engine-configured timeout values are intended here and verify their consistency with other calls to the same function.
1660-1663
: Consistent use of engine's timeout parameters
Good job using e.ctrlrLossTimeout
and e.fastIOFailTimeoutSec
in this connectNVMfBdev
call. This ensures consistency with the engine's configured timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
pkg/spdk/replica.go (1)
1437-1438
: LGTM! Consider documenting the timeout behavior.The addition of timeout parameters is consistent with the previous changes and helps prevent timeout-related issues during rebuilding operations.
Consider adding a comment explaining the timeout behavior and its impact on rebuilding operations, especially in scenarios where instance managers are deleted.
pkg/spdk/engine.go (1)
1965-1968
: Consider using engine's timeout for reconnectDelaySecWhile the change properly uses engine's timeout values for
ctrlrLossTimeout
andfastIOFailTimeoutSec
, it still uses the default value forreconnectDelaySec
. For consistency, consider makingreconnectDelaySec
configurable through the engine as well.- helpertypes.DefaultReplicaReconnectDelaySec, int32(e.fastIOFailTimeoutSec), + int32(e.reconnectDelaySec), int32(e.fastIOFailTimeoutSec),This would require adding a new field to the Engine struct:
reconnectDelaySec int
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (4)
pkg/spdk/engine.go
(7 hunks)pkg/spdk/replica.go
(3 hunks)pkg/spdk/server.go
(1 hunks)pkg/spdk/util.go
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/spdk/util.go
🔇 Additional comments (5)
pkg/spdk/replica.go (1)
1329-1330
: LGTM! Verify the default timeout values.
The addition of timeout parameters to connectNVMfBdev
is aligned with the PR's objective. The use of default values from helpertypes
package ensures consistent timeout handling.
Let's verify the default timeout values:
✅ Verification successful
Default timeout values are correctly defined and used
The verification confirms that the default timeout values are properly defined in the go-spdk-helper
package:
DefaultReplicaCtrlrLossTimeoutSec = 15
DefaultReplicaFastIOFailTimeoutSec = 10
These values are consistently used across the codebase in pkg/spdk/replica.go
and pkg/spdk/engine.go
. The timeouts are set to reasonable values, with the controller loss timeout being longer than the fast I/O fail timeout, which is the expected configuration.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the default timeout values in helpertypes package
# Search for DefaultReplicaCtrlrLossTimeoutSec and DefaultReplicaFastIOFailTimeoutSec definitions
rg -A 1 "DefaultReplicaCtrlrLossTimeoutSec|DefaultReplicaFastIOFailTimeoutSec"
Length of output: 1423
pkg/spdk/engine.go (3)
52-53
: LGTM: Well-structured timeout field additions
The new timeout fields are appropriately named and typed for their purpose.
100-101
: LGTM: Proper initialization of timeout fields
The timeout fields are correctly initialized using default values from the helpertypes package.
218-218
: LGTM: Consistent timeout parameter usage
The engine's timeout values are consistently passed to all connectNVMfBdev calls, properly implementing the timeout control across the codebase.
Also applies to: 1089-1089, 1659-1659
pkg/spdk/server.go (1)
70-70
:
Ensure 'ctrlr_loss_timeout_sec' is set to the decreased value
To meet the PR objective of decreasing ctrlr_loss_timeout_sec
, include this parameter in the BdevNvmeSetOptions
call with the desired reduced value. Currently, it's not being set, which may prevent the intended behavior change.
Apply this diff to set the ctrlr_loss_timeout_sec
parameter:
if _, err = cli.BdevNvmeSetOptions(
helpertypes.DefaultTransportAckTimeout,
helpertypes.DefaultKeepAliveTimeoutMs,
+ helpertypes.DecreasedCtrlrLossTimeoutSec,
); err != nil {
return nil, errors.Wrap(err, "failed to set NVMe options")
}
Ensure that helpertypes.DecreasedCtrlrLossTimeoutSec
is defined with the appropriate timeout value as per the PR requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
pkg/spdk/replica.go (1)
1329-1330
: Consider adding monitoring for timeout-related events.To validate the effectiveness of these timeout parameters in production and catch potential issues early:
- Consider adding metrics/logging for:
- NVMe controller connection loss events
- Reconnection attempt durations
- Failed reconnection counts
- Set up alerts for cases where reconnection consistently takes longer than expected
This will help tune these timeout values based on real-world data.
Also applies to: 1437-1438
pkg/spdk/engine.go (1)
52-53
: LGTM! Consider adding field documentation.The new timeout fields are well-placed and follow Go naming conventions. Consider adding documentation comments to explain their purpose and units.
Add documentation like:
// ctrlrLossTimeout specifies how long (in seconds) SPDK will attempt to reconnect to the NVMe controller after a connection loss // fastIOFailTimeoutSec specifies how long (in seconds) to wait for I/O to complete before failing the I/O
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (4)
go.mod
is excluded by!go.mod
go.sum
is excluded by!**/*.sum
,!go.sum
vendor/github.com/longhorn/go-spdk-helper/pkg/types/types.go
is excluded by!vendor/**
vendor/modules.txt
is excluded by!vendor/**
📒 Files selected for processing (4)
pkg/spdk/engine.go
(7 hunks)pkg/spdk/replica.go
(3 hunks)pkg/spdk/server.go
(1 hunks)pkg/spdk/util.go
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/spdk/util.go
🧰 Additional context used
📓 Learnings (1)
pkg/spdk/server.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-spdk-engine#247
File: pkg/spdk/server.go:70-70
Timestamp: 2024-11-28T12:13:14.445Z
Learning: In the Longhorn SPDK engine, the `ctrlr_loss_timeout_sec` parameter is set in the `connectNVMfBdev` function (used in `bdev_nvme_attach_controller`), so it's not necessary to set it in `BdevNvmeSetOptions` within `pkg/spdk/server.go`.
🔇 Additional comments (6)
pkg/spdk/server.go (2)
73-73
: LGTM: Improved error message capitalization
The error message now correctly capitalizes "NVMe" which improves consistency.
68-70
: Verify timeout values align with PR objectives
The changes introduce new timeout parameters for NVMe options. Based on the PR objectives to decrease ctrlr_loss_timeout_sec
, we should verify these values.
✅ Verification successful
Timeout values are properly aligned with PR objectives
The verification confirms that the timeout values have been decreased as intended:
DefaultReplicaCtrlrLossTimeoutSec = 15
(controller loss timeout)DefaultReplicaReconnectDelaySec = 2
(reconnect delay)DefaultReplicaFastIOFailTimeoutSec = 10
(fast I/O fail timeout)
These values are consistently used across the codebase in pkg/spdk/server.go
, pkg/spdk/replica.go
, pkg/spdk/util.go
, and pkg/spdk/engine.go
, with proper hierarchical relationships (reconnect delay < fast I/O fail timeout < controller loss timeout).
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the default timeout values in helper types
# Search for the timeout constant definitions
rg -A 1 "DefaultReplicaCtrlrLossTimeoutSec|DefaultReplicaReconnectDelaySec|DefaultReplicaFastIOFailTimeoutSec"
# Search for any tests validating these timeouts
ast-grep --pattern 'func Test$_($_, $_) {
$$$
DefaultReplicaCtrlrLossTimeoutSec
$$$
}'
Length of output: 1860
pkg/spdk/replica.go (2)
1329-1330
: LGTM! Addition of timeout parameters improves connection stability.
The addition of DefaultReplicaCtrlrLossTimeoutSec
and DefaultReplicaFastIOFailTimeoutSec
parameters helps prevent the v2 volume from entering ERROR state by providing better control over NVMe controller reconnection attempts during rebuilding.
Let's verify the timeout values are consistently used across the codebase:
✅ Verification successful
Timeout parameters are consistently used across NVMf connections
Based on the search results, I can verify that:
- All
connectNVMfBdev
calls consistently use timeout parameters - The engine uses
DefaultReplicaCtrlrLossTimeoutSec
from initialization - Both replica rebuilding paths (source and destination) use the same timeout values
- The default value is set to 15 seconds in the helper types package
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for consistent usage of timeout parameters
# Expected: All connectNVMfBdev calls should use these timeout parameters
# Search for connectNVMfBdev calls
rg "connectNVMfBdev\(" -A 3
# Search for DefaultReplicaCtrlrLossTimeoutSec usage
rg "DefaultReplicaCtrlrLossTimeoutSec"
Length of output: 5505
1437-1438
: LGTM! Consistent timeout handling for rebuilding process.
The addition of timeout parameters here ensures consistent behavior between source and destination replicas during rebuilding, which helps prevent race conditions that could lead to ERROR states.
Let's verify the rebuilding process handles these timeouts correctly:
✅ Verification successful
LGTM! Timeout parameters are properly handled in the rebuilding process
The code correctly implements timeout handling for NVMe connections during rebuilding:
- The
connectNVMfBdev
function properly sets both controller loss timeout and fast I/O fail timeout parameters - Default timeout values are well-defined constants:
DefaultReplicaCtrlrLossTimeoutSec=15
andDefaultReplicaFastIOFailTimeoutSec=10
- Error handling is in place for connection failures and timeouts
- The implementation is consistent across both source and destination replicas during rebuilding
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check rebuilding process error handling
# Expected: Proper error handling around NVMe controller timeouts
# Search for rebuilding error handling
rg "rebuildingError.*timeout" -B 2 -A 2
# Search for rebuilding state transitions
rg "rebuildingState.*Error.*timeout" -B 2 -A 2
Length of output: 90
Script:
#!/bin/bash
# Let's check for error handling in rebuilding process more broadly
# Search for rebuilding error handling patterns
rg "RebuildingDstStart.*error" -A 5
rg "RebuildingDstStart.*err" -A 5
# Search for NVMf connection error handling
rg "connectNVMfBdev.*error" -A 5
rg "connectNVMfBdev.*err" -A 5
# Look for timeout related error handling in the replica code
rg "timeout.*error" pkg/spdk/
rg "timeout.*err" pkg/spdk/
Length of output: 19856
Script:
#!/bin/bash
# Let's check the error handling in connectNVMfBdev implementation
ast-grep --pattern 'func connectNVMfBdev($$$) {
$$$
}'
# Check for NVMe controller timeout handling
rg "BdevNvmeSetControllerTimeout|BdevNvmeSetFastIOFailTimeout" -A 3 -B 3
# Look for NVMe connection error handling in the rebuilding process
rg "connectNVMfBdev.*rebuilding" -A 5 -B 5
Length of output: 2404
Script:
#!/bin/bash
# Let's examine the implementation of connectNVMfBdev and timeout handling
rg -l "func connectNVMfBdev" pkg/spdk/
cat $(rg -l "func connectNVMfBdev" pkg/spdk/)
# Check for timeout parameter usage in NVMe connections
rg "DefaultReplicaCtrlrLossTimeoutSec|DefaultReplicaFastIOFailTimeoutSec" -A 2 -B 2
Length of output: 8085
pkg/spdk/engine.go (2)
218-218
: LGTM! Consistent usage of timeout parameters.
The changes consistently use the engine's configured timeout values across all NVMe connection points, which aligns with the PR objective to adjust controller loss timeout behavior.
Also applies to: 1089-1089, 1659-1659, 1965-1968
100-101
:
Verify timeout values for edge cases.
The default timeout values should be carefully considered as they affect error recovery. Please verify:
- The decreased
ctrlrLossTimeout
value is sufficient for network hiccups - The timeout values work well in high-load scenarios
- The values are appropriate for various network conditions
✅ Verification successful
Based on the search results, I can now provide a definitive response about the timeout values and their usage:
The default timeout values are appropriate and well-configured
The timeout values are properly set with a good balance between reliability and responsiveness:
DefaultReplicaCtrlrLossTimeoutSec = 15
- This 15-second timeout provides sufficient time for network recovery while preventing extended hangsDefaultReplicaFastIOFailTimeoutSec = 10
- The 10-second fast I/O failure timeout allows quick failure detection while avoiding premature timeouts- The timeouts are used consistently throughout the codebase for NVMe controller operations
The relationship between these timeouts is correctly structured:
- Controller loss timeout (15s) > Fast I/O failure timeout (10s)
- This hierarchy ensures proper error handling escalation
The error handling code shows these timeouts are used appropriately with proper connection recovery mechanisms in place.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any existing timeout-related issues or discussions
# Search for timeout-related error patterns in the codebase
# Check for existing timeout configurations
echo "Checking for existing timeout configurations..."
rg "DefaultReplicaCtrlrLossTimeoutSec|DefaultReplicaFastIOFailTimeoutSec" -A 2
# Look for timeout-related error patterns
echo "Checking for timeout-related error patterns..."
rg "timeout|connection.*lost|reconnect.*fail" --type go
# Check for related test cases
echo "Checking for related test cases..."
fd "test.*\.go$" | xargs rg "TestCtrlrLossTimeout|TestFastIOFailTimeout"
Length of output: 102268
Script:
#!/bin/bash
# Let's check the actual timeout values and their usage in the codebase
# Check the default timeout values
echo "Checking default timeout values..."
rg "DefaultReplicaCtrlrLossTimeoutSec.*=|DefaultReplicaFastIOFailTimeoutSec.*=" -A 1
# Check where these timeouts are used in the codebase
echo "Checking timeout usage..."
rg "ctrlrLossTimeout|fastIOFailTimeout" --type go -A 2
# Look for any related error handling code
echo "Checking error handling related to timeouts..."
rg "timeout.*error|connection.*lost|reconnect.*fail" --type go -A 2
Length of output: 78374
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
pkg/spdk/disk/nvme/nvme.go (1)
18-26
: Consider making timeout values configurableWhile the current implementation addresses the immediate issue, consider making these timeout values configurable through environment variables or configuration files. This would allow operators to tune these values based on their specific environment needs without requiring code changes.
Example approach:
- Define these as configuration parameters
- Provide environment variable overrides
- Keep current values as defaults
Would you like me to propose a detailed implementation for this enhancement?
pkg/spdk/types.go (1)
34-42
: Add documentation for the timeout constantsConsider adding documentation comments to explain:
- The purpose and impact of each timeout
- The relationships between these timeouts
- The rationale behind the chosen values
Example:
const ( - // Timeout for RAID base bdev (replica) + // Timeouts for RAID base bdev (replica) connection management: + // - Controller loss timeout: Maximum time to wait for controller reconnection + // - Reconnect delay: Time to wait between reconnection attempts + // - Fast I/O failure: Time to wait before failing I/O operations + // - Transport ACK: Maximum time to wait for transport layer acknowledgments + // - Keep alive: Interval for connection health checks replicaCtrlrLossTimeoutSec = 15 replicaReconnectDelaySec = 2 replicaFastIOFailTimeoutSec = 10pkg/spdk/engine.go (1)
52-53
: Document timeout parameters and make them configurableThe new timeout parameters lack documentation explaining their purpose and impact. Additionally, the TODO comment suggests these should be user-configurable.
Consider:
- Adding field documentation explaining the purpose and impact of each timeout
- Making these values configurable through engine options or configuration
type Engine struct { + // ctrlrLossTimeout defines how long SPDK will attempt to reconnect to the NVMe + // controller after a connection loss before removing the controller ctrlrLossTimeout int + // fastIOFailTimeoutSec defines how long SPDK will wait for I/O completion before + // failing the I/O with error fastIOFailTimeoutSec intAlso applies to: 100-102
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (4)
go.mod
is excluded by!go.mod
go.sum
is excluded by!**/*.sum
,!go.sum
vendor/github.com/longhorn/go-spdk-helper/pkg/types/types.go
is excluded by!vendor/**
vendor/modules.txt
is excluded by!vendor/**
📒 Files selected for processing (6)
pkg/spdk/disk/nvme/nvme.go
(2 hunks)pkg/spdk/engine.go
(7 hunks)pkg/spdk/replica.go
(3 hunks)pkg/spdk/server.go
(2 hunks)pkg/spdk/types.go
(1 hunks)pkg/spdk/util.go
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/spdk/replica.go
- pkg/spdk/util.go
🧰 Additional context used
📓 Learnings (1)
pkg/spdk/server.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-spdk-engine#247
File: pkg/spdk/server.go:70-70
Timestamp: 2024-11-28T12:13:14.445Z
Learning: In the Longhorn SPDK engine, the `ctrlr_loss_timeout_sec` parameter is set in the `connectNVMfBdev` function (used in `bdev_nvme_attach_controller`), so it's not necessary to set it in `BdevNvmeSetOptions` within `pkg/spdk/server.go`.
🔇 Additional comments (7)
pkg/spdk/disk/nvme/nvme.go (1)
58-58
: LGTM! Verify timeout behavior in integration tests
The implementation correctly uses the new timeout constants. However, given the critical nature of this change, we should verify the behavior in integration tests.
Let's verify if there are any existing integration tests that cover this scenario:
pkg/spdk/types.go (1)
34-42
: LGTM! Well-structured timeout hierarchy
The new timeout constants form a well-thought-out hierarchy:
- Controller loss timeout (15s) is appropriately decreased from 30s to reduce the race condition window
- Fast I/O failure timeout (10s) < Controller loss timeout (15s) ensures I/O issues are detected first
- Transport ACK timeout (14s) < Controller loss timeout (15s) ensures transport issues are caught before declaring controller loss
pkg/spdk/server.go (3)
14-23
: LGTM! Import changes are well-organized.
The new imports for jsonrpc
and spdkrpc
packages are properly organized and necessary for the functionality.
75-75
: LGTM! Error handling is well-implemented.
The error handling follows good practices:
- Uses
errors.Wrap
to preserve the error context - Provides a clear and descriptive error message
70-75
: Verify the necessity of setting timeout parameters in BdevNvmeSetOptions.
Based on previous learnings, the ctrlr_loss_timeout_sec
parameter should be set in the connectNVMfBdev
function, not in BdevNvmeSetOptions
. Please verify if setting these parameters here is necessary or potentially redundant:
- replicaCtrlrLossTimeoutSec
- replicaReconnectDelaySec
- replicaFastIOFailTimeoutSec
- replicaTransportAckTimeout
- replicaKeepAliveTimeoutMs
pkg/spdk/engine.go (2)
1090-1093
: LGTM: Robust error handling for controller timeouts
The error handling for NVMe controller connections is robust and properly propagates errors while maintaining the engine state. The code correctly handles timeout-related failures and includes appropriate recovery mechanisms.
219-219
: Verify the impact of decreased timeout values
The timeout parameters are consistently applied across all NVMe controller connections. However, we should verify that the decreased timeout values don't cause premature disconnections in other scenarios.
Let's verify the usage of these timeout parameters:
Also applies to: 1090-1090, 1660-1660, 1966-1968
✅ Verification successful
Based on the shell script results, I can now generate the final response:
Timeout parameters are consistently applied across all NVMe connections
The codebase shows consistent usage of timeout parameters:
- All
connectNVMfBdev
calls properly pass throughctrlrLossTimeout
andfastIOFailTimeoutSec
- The parameters are correctly propagated to the underlying
BdevNvmeAttachController
calls - Different timeout values are appropriately used for different scenarios (replica vs disk)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check all usages of connectNVMfBdev and BdevNvmeAttachController
# to ensure consistent timeout parameter usage
echo "Checking connectNVMfBdev calls..."
rg "connectNVMfBdev\(" -A 2
echo "Checking BdevNvmeAttachController calls..."
rg "BdevNvmeAttachController\(" -A 4
Length of output: 6148
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #247 +/- ##
========================================
- Coverage 8.39% 8.38% -0.01%
========================================
Files 21 21
Lines 7150 7155 +5
========================================
Hits 600 600
- Misses 6470 6475 +5
Partials 80 80
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Ready for review cc @shuo-wu @c3y1huang @DamiaSan @innobead |
Longhorn 9874 Signed-off-by: Derek Su <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (6)
pkg/spdk/restore.go (2)
Line range hint
189-204
: Enhance error detection mechanismThe current implementation uses string matching for error detection, which is fragile and maintenance-heavy.
Consider using error types instead of string matching:
+type RestoreCanceledError struct { + cause error +} + +func (e *RestoreCanceledError) Error() string { + return fmt.Sprintf("restore canceled: %v", e.cause) +} + func (r *Restore) UpdateRestoreStatus(snapshotLvolName string, progress int, err error) { // ... - if strings.Contains(err.Error(), btypes.ErrorMsgRestoreCancelled) { + if _, isCanceled := err.(*RestoreCanceledError); isCanceled { r.log.WithError(err).Warn("Backup restoration is cancelled") r.State = btypes.ProgressStateCanceled } else {
Line range hint
234-236
: Address TODO comment and implement Revert functionalityThere are two issues here:
- The TODO comment contains a typo ("TODL" instead of "TODO")
- The
Revert
functionality is missing implementation, which could be crucial for system reliabilityWould you like me to help create a GitHub issue to track the implementation of the
Revert
functionality? I can provide a detailed specification of the requirements and implementation approach.pkg/spdk/disk.go (1)
Line range hint
1-379
: Consider adding timeout parameters to disk management functionsGiven that this file handles NVMe disk operations and the PR aims to address controller loss timeout issues, consider:
- Adding
ctrlr_loss_timeout_sec
as a parameter to relevant disk management functions, particularly inaddBlockDevice
andsvcDiskCreate
where NVMe devices are initialized.- Propagating the timeout configuration through the disk creation chain.
This would provide more granular control over NVMe controller timeout behavior at the disk level.
Would you like me to propose specific code changes to incorporate timeout parameters in these functions?
pkg/spdk/backup.go (1)
170-173
: Consider adding defer for cleanup in error casesWhile the device opening logic is correct, consider adding a deferred cleanup in case of subsequent errors in the function. This ensures the device file handle is properly closed even if later operations fail.
devFh, err := os.OpenFile(b.initiator.Endpoint, os.O_RDONLY, 0666) if err != nil { return errors.Wrapf(err, "failed to open NVMe device %v for snapshot lvol bdev %v", b.initiator.Endpoint, lvolName) } + // Ensure device is closed if any subsequent operation fails + success := false + defer func() { + if !success { + devFh.Close() + } + }() b.devFh = devFh + success = truepkg/spdk/engine.go (2)
100-102
: Consider adding documentation for timeout valuesWhile the initialization is correct, it would be beneficial to add documentation explaining:
- The purpose and impact of these timeout values
- The rationale behind the default values
- Future plans for user configuration as indicated by the TODO
52-54
: Consider adding monitoring for timeout eventsWhile the timeout configuration changes look good, consider:
- Adding metrics/logging to track controller loss events
- Monitoring the frequency and duration of reconnection attempts
- Setting up alerts for repeated timeout occurrences
This would help validate the effectiveness of the timeout changes and provide early warning of potential issues.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (4)
go.mod
is excluded by!go.mod
go.sum
is excluded by!**/*.sum
,!go.sum
vendor/github.com/longhorn/go-spdk-helper/pkg/types/types.go
is excluded by!vendor/**
vendor/modules.txt
is excluded by!vendor/**
📒 Files selected for processing (11)
pkg/client/types.go
(1 hunks)pkg/spdk/backup.go
(3 hunks)pkg/spdk/disk.go
(1 hunks)pkg/spdk/disk/nvme/nvme.go
(2 hunks)pkg/spdk/engine.go
(11 hunks)pkg/spdk/replica.go
(3 hunks)pkg/spdk/restore.go
(2 hunks)pkg/spdk/server.go
(2 hunks)pkg/spdk/types.go
(2 hunks)pkg/spdk/util.go
(3 hunks)pkg/util/block.go
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- pkg/client/types.go
- pkg/util/block.go
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/spdk/disk/nvme/nvme.go
- pkg/spdk/replica.go
- pkg/spdk/types.go
- pkg/spdk/util.go
🧰 Additional context used
📓 Learnings (1)
pkg/spdk/server.go (1)
Learnt from: derekbit
PR: longhorn/longhorn-spdk-engine#247
File: pkg/spdk/server.go:70-70
Timestamp: 2024-11-28T12:13:14.445Z
Learning: In the Longhorn SPDK engine, the `ctrlr_loss_timeout_sec` parameter is set in the `connectNVMfBdev` function (used in `bdev_nvme_attach_controller`), so it's not necessary to set it in `BdevNvmeSetOptions` within `pkg/spdk/server.go`.
🔇 Additional comments (11)
pkg/spdk/restore.go (2)
13-14
: LGTM: Import addition is appropriate
The addition of the nvme package import is necessary for the new NVMe initiator functionality and follows the project's import organization.
Line range hint 1-236
: Verify alignment with PR objectives
The changes in this file appear to be tangential to the main PR objective of decreasing ctrlr_loss_timeout_sec
. Please clarify if these changes are necessary for implementing the timeout modifications.
✅ Verification successful
Let me gather more specific information about the changes and their relation to the timeout configuration.
Based on the search results, I can now generate the final response:
Changes in restore.go are necessary for proper NVMe controller handling
The changes in restore.go
are actually relevant to the PR's objective of decreasing ctrlr_loss_timeout_sec
. This file implements the NVMe initiator connection logic that directly interacts with the timeout settings. Specifically:
- It handles NVMe controller initialization and connection management through
NewInitiator
andStart
functions - The file is responsible for exposing and managing NVMe devices that are subject to the timeout configurations
- The timeout settings defined in
vendor/.../types/nvme.go
(CtrlrLossTimeoutSec
,FastIOFailTimeoutSec
) are applied to these NVMe connections
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for timeout-related configurations in the codebase
rg -l "ctrlr_loss_timeout|timeout_sec" --type go
# Search for NVMe controller configuration patterns
ast-grep --pattern 'func $_($$) {
$$$
timeout $$ = $_
$$$
}'
Length of output: 316
Script:
#!/bin/bash
# Check the actual content of files containing timeout configurations
rg -A 5 -B 5 "ctrlr_loss_timeout|timeout_sec" --type go
# Look for NVMe controller and timeout related functions in the restore.go
rg -A 5 -B 5 "NewInitiator|Start|exposeSnapshotLvolBdev" pkg/spdk/restore.go
# Check if there are any related configuration settings in the types
cat vendor/github.com/longhorn/go-spdk-helper/pkg/spdk/types/nvmf.go
Length of output: 11790
pkg/spdk/disk.go (1)
9-22
: Verify alignment with PR objectives
While the import reorganization is clean, this file doesn't contain any changes related to the PR's primary objective of decreasing ctrlr_loss_timeout_sec
. The changes appear to be purely organizational and don't address the timeout adjustment needed to fix the volume ERROR state issue described in longhorn/longhorn#9874.
Let's verify if the timeout-related changes exist in other files:
pkg/spdk/backup.go (3)
15-16
: LGTM: Clean import addition
The NVMe package import is properly organized and necessary for the NVMe functionality.
224-226
: LGTM: Improved error messages
The error messages now include the device endpoint, which provides better context for debugging NVMe device issues.
170-173
: Verify NVMe controller timeout configuration
While the NVMe device handling is improved, the PR objective mentions decreasing ctrlr_loss_timeout_sec
. Ensure this timeout is properly configured when establishing the NVMe connection.
✅ Verification successful
NVMe controller timeout is properly configured
The timeout configuration is correctly implemented in the codebase:
- The default controller loss timeout is set to 15 seconds in the vendor library
- The timeout is properly passed through the
BdevNvmeAttachController
calls - The value is configurable via the
ctrlrLossTimeout
field in the engine configuration
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for ctrlr_loss_timeout_sec configuration in the codebase
# Look for timeout configuration in SPDK-related files
rg -l "ctrlr_loss_timeout|ConnectNVMf" --type go
# Look for any SPDK configuration files
fd -e json -e conf -e yaml --exec rg "timeout|ctrlr_loss"
Length of output: 242
Script:
#!/bin/bash
# Let's examine the content of the files found and their context
rg -A 10 "ctrlr_loss_timeout|ConnectNVMf" vendor/github.com/longhorn/go-spdk-helper/pkg/spdk/types/nvme.go vendor/github.com/longhorn/go-spdk-helper/pkg/spdk/client/basic.go
# Also search for NVMe configuration related code
ast-grep --pattern 'func $_ ($$$) {
$$$
NVMeTransport$$$
$$$
}'
Length of output: 3669
Script:
#!/bin/bash
# Let's look for the actual configuration of these timeout values in the codebase
rg -A 5 "bdev_nvme_attach_controller" .
# Also search for any struct initialization with these timeout values
ast-grep --pattern '{
$$$
ctrlr_loss_timeout_sec: $_,
$$$
}'
# And check for any RPC calls that might set these values
rg -A 5 "BdevNvmeAttachController" .
Length of output: 5004
Script:
#!/bin/bash
# Let's check the actual values being used for the timeout
rg -B 5 "ctrlrLossTimeout" pkg/spdk/engine.go
# And check if there are any configuration changes in the PR
rg "ctrlrLossTimeout.*=" .
# Also check for any default values
rg "DefaultCtrlrLossTimeout" .
Length of output: 1776
pkg/spdk/server.go (2)
14-23
: LGTM! Import organization looks good.
The import statements are well-organized, following Go's standard grouping convention: standard library, external packages, and internal packages.
70-75
: LGTM! NVMe options configuration looks good.
The timeout parameters are properly configured for the SPDK NVMe operations. Based on the learnings, setting ctrlr_loss_timeout_sec
here is correct as it's handled in the connectNVMfBdev
function.
Let's verify the timeout values are defined correctly:
✅ Verification successful
Let me verify the usage of these timeout values with one more check to ensure they are used consistently.
NVMe timeout configuration values are properly set and used
The verification confirms that:
- The timeout constants are well-defined in
pkg/spdk/types.go
with appropriate values:replicaCtrlrLossTimeoutSec = 15
replicaReconnectDelaySec = 2
replicaFastIOFailTimeoutSec = 10
replicaTransportAckTimeout = 14
replicaKeepAliveTimeoutMs = 10000
- These values are correctly passed to
BdevNvmeSetOptions
which is documented to set global parameters for all bdev NVMe - The function properly maps these values to the corresponding JSON fields in the SPDK RPC request
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the timeout constants definition
# Expected: Find the timeout constants and their values
# Search for the timeout constants
rg -A 1 'replica.*Timeout.*='
Length of output: 497
Script:
#!/bin/bash
# Search for the BdevNvmeSetOptions function implementation and its usage
ast-grep --pattern 'BdevNvmeSetOptions($$$)'
# Also check if these timeout values are referenced elsewhere
rg "BdevNvmeSetOptions|ctrlr_loss_timeout_sec" -B 2 -A 2
Length of output: 3170
pkg/spdk/engine.go (3)
52-53
: LGTM: Well-structured timeout configuration fields
The addition of ctrlrLossTimeout
and fastIOFailTimeoutSec
fields to the Engine struct provides a clean way to configure NVMe controller timeouts.
219-219
: LGTM: Consistent timeout handling in NVMe connections
The changes properly integrate the new timeout parameters across all NVMe connection points, ensuring consistent behavior for controller loss scenarios.
Also applies to: 1090-1090, 1660-1660, 1966-1968
685-685
: LGTM: Improved error handling and validation
The enhanced error messages and validation logic provide better diagnostics for NVMe-related issues, making it easier to identify and debug problems with replica connections.
Also applies to: 934-934, 939-939, 943-943, 947-947
Longhorn 9874 Signed-off-by: Derek Su <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (4)
pkg/spdk/disk/nvme/nvme.go (1)
19-25
: Consider making timeouts configurable via environment variablesThese timeout values are critical for system stability and might need adjustment in different environments. Consider making them configurable through environment variables with these constants serving as defaults.
Example approach:
const ( // Timeouts for disk bdev - diskCtrlrLossTimeoutSec = 30 - diskReconnectDelaySec = 2 - diskFastIOFailTimeoutSec = 15 + defaultDiskCtrlrLossTimeoutSec = 10 + defaultDiskReconnectDelaySec = 2 + defaultDiskFastIOFailTimeoutSec = 15 ) +func getTimeoutConfig() (ctrlrLoss, reconnectDelay, fastIOFail int) { + if val := os.Getenv("DISK_CTRLR_LOSS_TIMEOUT"); val != "" { + if timeout, err := strconv.Atoi(val); err == nil { + ctrlrLoss = timeout + } + } + // Similar for other timeouts... + return defaultDiskCtrlrLossTimeoutSec, defaultDiskReconnectDelaySec, defaultDiskFastIOFailTimeoutSec +}pkg/spdk/restore.go (1)
Line range hint
134-143
: Add controller loss timeout configurationAccording to the PR objectives, the
ctrlr_loss_timeout_sec
needs to be decreased to prevent v2 volumes from entering ERROR state. However, the timeout is not configured when creating the NVMe initiator.Consider adding timeout configuration when creating the initiator:
initiator, err := nvme.NewInitiator(lvolName, helpertypes.GetNQN(lvolName), nvme.HostProc) if err != nil { return nil, "", errors.Wrapf(err, "failed to create NVMe initiator for lvol bdev %v", lvolName) } +// Configure shorter timeout to prevent ERROR state during instance manager deletion +if err := initiator.SetControllerLossTimeout(15); err != nil { + return nil, "", errors.Wrapf(err, "failed to set controller loss timeout for lvol bdev %v", lvolName) +} if _, err := initiator.Start(r.ip, strconv.Itoa(int(r.port)), true); err != nil {pkg/spdk/engine.go (2)
52-53
: LGTM! Consider adding field documentation.The new timeout fields are well-placed in the Engine struct. Consider adding godoc comments to document their purpose and units.
type Engine struct { + // ctrlrLossTimeout is the number of seconds to wait before marking + // an NVMe controller as lost/dead. Default: 30 seconds ctrlrLossTimeout int + // fastIOFailTimeoutSec is the number of seconds to wait before failing + // I/O operations when an NVMe controller is unresponsive fastIOFailTimeoutSec int
100-102
: Consider implementing user-defined timeout values.The TODO comment indicates that user-defined timeout values should be supported. Consider:
- Adding timeout parameters to the engine configuration
- Implementing validation for user-provided values
- Falling back to defaults if not specified
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (11)
pkg/client/types.go
(1 hunks)pkg/spdk/backup.go
(3 hunks)pkg/spdk/disk.go
(1 hunks)pkg/spdk/disk/nvme/nvme.go
(2 hunks)pkg/spdk/engine.go
(11 hunks)pkg/spdk/replica.go
(3 hunks)pkg/spdk/restore.go
(2 hunks)pkg/spdk/server.go
(2 hunks)pkg/spdk/types.go
(2 hunks)pkg/spdk/util.go
(3 hunks)pkg/util/block.go
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- pkg/client/types.go
- pkg/spdk/backup.go
- pkg/spdk/disk.go
- pkg/spdk/replica.go
- pkg/spdk/server.go
- pkg/spdk/types.go
- pkg/spdk/util.go
- pkg/util/block.go
🔇 Additional comments (9)
pkg/spdk/disk/nvme/nvme.go (2)
20-20
:
Decrease diskCtrlrLossTimeoutSec
to prevent race conditions
The 30-second timeout is too long and may cause v2 volumes to enter ERROR state. This needs to be decreased to minimize the window for race conditions where the initiator incorrectly concludes that the target has failed.
- diskCtrlrLossTimeoutSec = 30
+ diskCtrlrLossTimeoutSec = 10 // Reduced to minimize race condition window
58-58
: LGTM! Verify timeout values in production
The implementation correctly uses the new timeout constants. However, given the critical nature of these timeouts in production environments, we should verify these values work well under various conditions.
Let's verify the usage of these timeout values across the codebase:
✅ Verification successful
Timeout configurations are consistent and well-documented
The codebase shows consistent and well-documented timeout configurations:
-
Disk timeouts (pkg/spdk/disk/nvme/nvme.go):
- Controller Loss: 30s
- Reconnect Delay: 2s
- Fast I/O Fail: 15s
-
Replica timeouts (pkg/spdk/types.go):
- Controller Loss: 15s
- Reconnect Delay: 2s
- Fast I/O Fail: 10s
The implementation is correct and the timeout values are properly documented with clear rationale, particularly for replica timeouts where it's noted that "ctrl-loss-tmo for the NVMe-oF initiator connecting to the RAID target is set to 30 seconds" and the replica timeouts are intentionally set lower to ensure proper error handling cascade.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any other timeout configurations that might need adjustment
rg -g '*.go' "CtrlrLossTimeout|ReconnectDelay|FastIOFailTimeout" -B 2 -A 2
# Look for any related error handling or timeout scenarios
ast-grep --pattern 'if err != nil {
$$$
return $_, errors.Wrap($_, "timeout")
$$$
}'
Length of output: 12260
pkg/spdk/restore.go (4)
13-14
: LGTM: Import addition is appropriate
The addition of the nvme package import is necessary for the NVMe initiator functionality and is properly organized.
Line range hint 191-218
: LGTM: Proper state management
The status update logic correctly handles different scenarios:
- Distinguishes between cancellation and errors
- Maintains proper state transitions
- Updates LastRestored field appropriately
160-162
: 🛠️ Refactor suggestion
Improve error handling in NVMe device closure
The error handling for closing the NVMe device should ensure proper cleanup even if an error occurs.
Consider implementing the previously suggested improvement:
func (r *Restore) CloseVolumeDev(volDev *os.File) error {
+ var closeErr error
r.log.Infof("Closing NVMe device %v", r.initiator.Endpoint)
- if err := volDev.Close(); err != nil {
- return errors.Wrapf(err, "failed to close NVMe device %v", r.initiator.Endpoint)
+ if err := volDev.Close(); err != nil {
+ closeErr = errors.Wrapf(err, "failed to close NVMe device %v", r.initiator.Endpoint)
}
+ // Continue with cleanup even if close failed
Line range hint 134-143
: Verify timeout configuration consistency
Let's ensure the timeout configuration is consistently applied across the codebase.
pkg/spdk/engine.go (3)
934-947
: LGTM! Robust error handling for NVMe operations.
The error handling for NVMe operations has been improved with:
- Detailed validation of NVMe controller information
- Clear error messages with context
- Proper validation of address family and transport type
Line range hint 1258-1262
: LGTM! Well-documented error handling strategy.
The error handling for replica operations is well-implemented:
- Clear comments explaining the rationale for error handling decisions
- Proper distinction between fatal and non-fatal errors
- Appropriate logging of errors while allowing the flow to continue
219-219
: Verify consistent timeout parameter usage across all NVMe operations.
The timeout parameters are consistently used in all NVMe connection operations. This change effectively implements the decreased ctrlr_loss_timeout_sec
requirement across:
- Initial replica connection
- Rebuilding replica connection
- Snapshot revert operations
- Backup restore operations
Let's verify that all NVMe connection points use the new timeout parameters:
Also applies to: 1090-1090, 1660-1660, 1966-1968
✅ Verification successful
Based on the search results, I can generate the final response as I have sufficient information to verify the timeout parameter usage:
Timeout parameters are consistently used across all NVMe operations
All NVMe connection points in the codebase correctly use the timeout parameters:
connectNVMfBdev
calls consistently passctrlrLossTimeout
andfastIOFailTimeoutSec
in:- Initial replica connection (engine.go:219)
- Rebuilding operations (replica.go)
- External snapshot operations (replica.go)
BdevNvmeAttachController
calls properly use the timeout parameters in:- NVMe device attachment (nvme.go)
- Replica controller attachment (engine.go:1966)
- NVMf connection utility (util.go)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify all NVMe connection operations use the timeout parameters
# Test: Search for NVMe connection function calls. Expect: All calls include timeout parameters
# Check for connectNVMfBdev calls
echo "Checking connectNVMfBdev calls..."
rg "connectNVMfBdev\(" -A 3
# Check for BdevNvmeAttachController calls
echo "Checking BdevNvmeAttachController calls..."
rg "BdevNvmeAttachController\(" -A 5
Length of output: 7256
Which issue(s) this PR fixes:
Issue longhorn/longhorn#9874
Signed-off-by: Derek Su [email protected]
What this PR does / why we need it:
Special notes for your reviewer:
Additional documentation or context