[BUGFIX] add Health Check for Range over gRPC Connection Loop #2798

kpango · 2025-01-09T09:40:13Z

Description

Related Issue

Versions

Vald Version: v1.7.15
Go Version: v1.23.4
Rust Version: v1.83.0
Docker Version: v27.4.0
Kubernetes Version: v1.32.0
Helm Version: v3.16.3
NGT Version: v2.3.5
Faiss Version: v1.9.0

Checklist

I have read the CONTRIBUTING document and completed our CLA agreement.
I have checked open Pull Requests for the similar feature or fixes?

Special notes for your reviewer

Summary by CodeRabbit

New Features
- Enhanced context management across gRPC client and connection handling methods.
- Improved logging for error tracking in search request broadcasting.
Bug Fixes
- Refined connection management with context-aware operations.
- Updated method signatures to support context-based timeout and cancellation.
Refactor
- Integrated context parameters in multiple service methods.
- Added more robust error handling and logging mechanisms.

coderabbitai · 2025-01-09T09:40:21Z

Caution

Review failed

The head commit changed during the review from 7139c17 to 5b4d93b.

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive context-aware modifications across multiple gRPC-related files. The primary focus is on enhancing method signatures to include context.Context parameters, particularly in connection management and client interaction methods. Changes span files in internal/net/grpc, pkg/gateway/mirror, and test mock implementations, with the goal of improving context handling, error logging, and connection management. The modifications enable more robust error handling, cancellation support, and provide better control over network operations.

Changes

File	Change Summary
`internal/net/grpc/client.go`	- Added `ctx context.Context` to `ConnectedAddrs` method - Updated `rangeConns` to accept context parameter - Modified multiple methods to propagate context
`internal/net/grpc/pool/pool.go`	- Updated `Disconnect` method to require context - Modified `Get` method signature to accept context - Added `group` field of type `singleflight.Group[Conn]`
`internal/test/mock/grpc/grpc_client_mock.go`	- Updated `ConnectedAddrs` method signature to include context
`internal/test/mock/grpc_testify_mock.go`	- Modified `ConnectedAddrs` method to accept context
`pkg/gateway/lb/handler/grpc/aggregation.go`	- Added debug logging for error handling
`pkg/gateway/mirror/service/mirror.go`	- Updated `IsConnected` and `MirrorTargets` methods to accept context

Sequence Diagram

sequenceDiagram
    participant Client
    participant gRPCClient
    participant ConnectionPool
    
    Client->>gRPCClient: ConnectedAddrs(ctx)
    gRPCClient->>ConnectionPool: rangeConns(ctx)
    ConnectionPool-->>gRPCClient: Return connected addresses
    gRPCClient-->>Client: Return addresses

Possibly related PRs

Add option to disable dns resolve #2634: This PR introduces a new method to the gRPCClient struct, which is related to the handling of DNS resolution, similar to the context-aware modifications made in the main PR.
Fix gRPC error msg handling for lb-gateway handler #2663: This PR enhances error handling in the lb-gateway handler, which aligns with the overall improvements in error handling and context management in the main PR.
Refactor internal/net/grpc/client.go #2674: This PR refactors the client.go file, which is directly related to the changes made in the main PR regarding the gRPCClient methods.

Suggested labels

size/XL, actions/e2e-deploy

Suggested reviewers

hlts2
vankichi

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

cloudflare-workers-and-pages · 2025-01-09T09:40:23Z

Deploying vald with Cloudflare Pages

Latest commit:	`da515e3`
Status:	⚡️ Build in progress...

View logs

vdaas-ci · 2025-01-09T09:40:31Z

[CHATOPS:HELP] ChatOps commands.

🙆‍♀️ /approve - approve
🍱 /format - format codes and add licenses
✅ /gen-test - generate test codes
🏷️ /label - add labels
🔚 2️⃣ 🔚 /label actions/e2e-deploy - run E2E deploy & integration test

coderabbitai · 2025-01-09T09:50:48Z

Caution

No docstrings were generated.

codecov · 2025-01-09T10:00:26Z

Codecov Report

Attention: Patch coverage is 0% with 98 lines in your changes missing coverage. Please review.

Project coverage is 23.91%. Comparing base (6065fd9) to head (288aec0).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/net/grpc/pool/pool.go	0.00%	59 Missing ⚠️
internal/net/grpc/client.go	0.00%	32 Missing ⚠️
pkg/gateway/lb/handler/grpc/aggregation.go	0.00%	4 Missing ⚠️
internal/test/mock/grpc/grpc_client_mock.go	0.00%	1 Missing ⚠️
internal/test/mock/grpc_testify_mock.go	0.00%	1 Missing ⚠️
pkg/gateway/mirror/service/mirror.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2798      +/-   ##
==========================================
- Coverage   23.93%   23.91%   -0.03%     
==========================================
  Files         546      546              
  Lines       54555    54578      +23     
==========================================
- Hits        13058    13051       -7     
- Misses      40712    40742      +30     
  Partials      785      785

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

internal/net/grpc/client.go (2)
1117-1130: Enhance error handling and logging in rangeConns.

The connection management could be improved in several ways:

The error messages could be more descriptive

The health check logging could be structured

The connection state transitions should be logged at debug level

Consider these improvements:
 func (g *gRPCClient) rangeConns(ctx context.Context, fn func(addr string, p pool.Conn) bool) error {
     var cnt int
     g.conns.Range(func(addr string, p pool.Conn) bool {
         if p == nil || !p.IsHealthy(ctx) {
             pc, err := p.Connect(ctx)
             if pc == nil || err != nil || !pc.IsHealthy(ctx) {
                 if pc != nil {
                     pc.Disconnect()
                 }
-                log.Debugf("Unhealthy connection detected for %s during gRPC Connection Range over Loop:\t%s", addr, p.String())
+                log.WithFields(log.Fields{
+                    "addr": addr,
+                    "error": err,
+                    "connection": p.String(),
+                }).Debug("Unhealthy connection detected during gRPC connection range")
                 return true
             }
             p = pc
+            log.WithFields(log.Fields{
+                "addr": addr,
+            }).Debug("Successfully reconnected during gRPC connection range")
         }
         cnt++
         return fn(addr, p)
     })
     if cnt == 0 {
-        return errors.ErrGRPCClientConnNotFound("*")
+        return errors.ErrGRPCClientConnNotFound("no healthy connections available")
     }
     return nil
 }
🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1117-1117: internal/net/grpc/client.go#L1117
Added line #L1117 was not covered by tests

[warning] 1120-1127: internal/net/grpc/client.go#L1120-L1127
Added lines #L1120 - L1127 were not covered by tests

[warning] 1129-1129: internal/net/grpc/client.go#L1129
Added line #L1129 was not covered by tests

Line range hint 1-1130: Consider architectural improvements for connection management.

While the health check implementation improves reliability, consider these architectural enhancements:

Implement a connection state machine to manage transitions more reliably

Add metrics for connection health and reconnection attempts

Consider implementing circuit breakers at the connection pool level

These changes would improve:

Reliability: Clear connection state transitions

Observability: Better monitoring of connection health

Resilience: Faster failure detection and recovery

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1117-1117: internal/net/grpc/client.go#L1117
Added line #L1117 was not covered by tests

[warning] 1120-1127: internal/net/grpc/client.go#L1120-L1127
Added lines #L1120 - L1127 were not covered by tests

[warning] 1129-1129: internal/net/grpc/client.go#L1129
Added line #L1129 was not covered by tests

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6065fd9 and d9e319e.

📒 Files selected for processing (1)

internal/net/grpc/client.go (9 hunks)

🧰 Additional context used

🪛 GitHub Check: codecov/patch

internal/net/grpc/client.go

[warning] 252-252: internal/net/grpc/client.go#L252
Added line #L252 was not covered by tests

[warning] 289-289: internal/net/grpc/client.go#L289
Added line #L289 was not covered by tests

[warning] 418-418: internal/net/grpc/client.go#L418
Added line #L418 was not covered by tests

[warning] 481-481: internal/net/grpc/client.go#L481
Added line #L481 was not covered by tests

[warning] 568-568: internal/net/grpc/client.go#L568
Added line #L568 was not covered by tests

[warning] 637-637: internal/net/grpc/client.go#L637
Added line #L637 was not covered by tests

[warning] 704-704: internal/net/grpc/client.go#L704
Added line #L704 was not covered by tests

[warning] 1090-1092: internal/net/grpc/client.go#L1090-L1092
Added lines #L1090 - L1092 were not covered by tests

[warning] 1117-1117: internal/net/grpc/client.go#L1117
Added line #L1117 was not covered by tests

[warning] 1120-1127: internal/net/grpc/client.go#L1120-L1127
Added lines #L1120 - L1127 were not covered by tests

[warning] 1129-1129: internal/net/grpc/client.go#L1129
Added line #L1129 was not covered by tests

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: runner / go build
GitHub Check: runner / go build
GitHub Check: runner / go build
GitHub Check: runner / go build
GitHub Check: runner / go build
GitHub Check: runner / go build
GitHub Check: runner / go build
GitHub Check: runner / go build
GitHub Check: runner / go build
GitHub Check: runner / go build

🔇 Additional comments (3)

internal/net/grpc/client.go (3)

Line range hint 252-289: Add test coverage for connection monitoring.

The connection monitoring logic for both pool rebalancing and health checks lacks test coverage. This is critical functionality that should be thoroughly tested.

Would you like me to help generate comprehensive test cases for:

Pool rebalancing with various connection states

Health check reconnection scenarios

Error handling paths

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 252-252: internal/net/grpc/client.go#L252
Added line #L252 was not covered by tests

1090-1092: Add test coverage for ConnectedAddrs.

The ConnectedAddrs method lacks test coverage. This is important functionality for connection management and monitoring.

Would you like me to help generate test cases covering:

Empty connection pool

Mixed healthy/unhealthy connections

Connection state transitions during enumeration

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1090-1092: internal/net/grpc/client.go#L1090-L1092
Added lines #L1090 - L1092 were not covered by tests

Line range hint 418-704: Add test coverage and consider potential race conditions.

The range operations lack test coverage. Additionally, there's a potential race condition in connection health checks where IsHealthy is called before using the connection, but the connection state could change between these calls.

Consider using a connection state lock or implementing a transactional health check. Example approach:
- if p != nil && p.IsHealthy(ctx) {
+ if p != nil {
+   if err := p.CheckHealth(ctx); err == nil {
      // use connection
+   }
+ }
Would you like me to help generate test cases covering:

Concurrent range operations

Connection state transitions

Error handling scenarios

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 481-481: internal/net/grpc/client.go#L481
Added line #L481 was not covered by tests

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (2)

internal/net/grpc/client.go (2)
Test coverage needed for OrderedRange operations

The review comment is valid. The test file shows test templates exist but are not implemented for OrderedRange operations. Critical paths including health checks and error handling need test coverage.

Test_gRPCClient_OrderedRange: Template exists but needs implementation

Test_gRPCClient_OrderedRangeConcurrent: Template exists but needs implementation

🔗 Analysis chain

Line range hint 568-637: Add test coverage for OrderedRange operations.

The OrderedRange and OrderedRangeConcurrent implementations include proper health checks and error handling. However, these critical paths lack test coverage.

Run the following script to check existing test patterns:

Would you like me to help generate test cases for the OrderedRange operations?
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check existing test patterns for OrderedRange operations
# Look for similar test patterns that we can follow

rg -A 5 "TestOrderedRange|TestOrderedRangeConcurrent" --type go
Length of output: 65

Script:
#!/bin/bash
# Find test files in the same package directory and check their content
fd "client_test.go" internal/net/grpc/ --type f --exec cat {}

# Search for any test cases related to gRPC client operations
rg -B 2 -A 5 "func Test.*Client" internal/net/grpc/ --type go
Length of output: 65663
🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 568-568: internal/net/grpc/client.go#L568
Added line #L568 was not covered by tests

Test coverage is indeed missing for Range operations

The review comment is correct. While test stubs exist in internal/net/grpc/client_test.go for both Range and RangeConcurrent operations, they are commented out and contain only type definitions without actual test cases or assertions. These stubs can serve as templates for implementing the required test coverage.

Test_gRPCClient_Range: Only contains type definitions, no test implementation

Test_gRPCClient_RangeConcurrent: Only contains type definitions, no test implementation

The existing stubs provide a good foundation for implementing comprehensive test coverage, as they already define the necessary parameter and field types.

🔗 Analysis chain

Line range hint 418-481: Add test coverage for Range operations.

The Range and RangeConcurrent implementations properly handle context propagation and include comprehensive tracing. However, these critical paths lack test coverage.

Run the following script to check existing test patterns:

Would you like me to help generate test cases for the Range operations?
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check existing test patterns for Range operations
# Look for similar test patterns that we can follow

rg -A 5 "TestRange|TestRangeConcurrent" --type go
Length of output: 51

Script:
#!/bin/bash
# Search for test files in the grpc package directory
fd ".*_test.go" internal/net/grpc/

# Look for any client-related test functions
rg "func Test.*Client" internal/net/grpc/ --type go -A 5

# Check if there are any existing tests that call Range or RangeConcurrent
rg "Range\(|RangeConcurrent\(" internal/net/grpc/ --type go
Length of output: 11893

Script:
#!/bin/bash
# Check the actual content of the Range-related tests
cat internal/net/grpc/client_test.go | grep -A 30 "func Test_gRPCClient_Range" 
cat internal/net/grpc/client_test.go | grep -A 30 "func Test_gRPCClient_RangeConcurrent"
Length of output: 3960
🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 418-418: internal/net/grpc/client.go#L418
Added line #L418 was not covered by tests

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6065fd9 and d9e319e.

📒 Files selected for processing (1)

internal/net/grpc/client.go (9 hunks)