Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] add Health Check for Range over gRPC Connection Loop #2828

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kpango
Copy link
Collaborator

@kpango kpango commented Feb 3, 2025

Description

[WIP]

Related Issue

Versions

  • Vald Version: v1.7.16
  • Go Version: v1.23.5
  • Rust Version: v1.83.0
  • Docker Version: v27.4.0
  • Kubernetes Version: v1.32.0
  • Helm Version: v3.16.3
  • NGT Version: v2.3.5
  • Faiss Version: v1.9.0

Checklist

Special notes for your reviewer

Summary by CodeRabbit

  • Documentation

    • API documentation has been extensively enhanced with expanded overviews, structured status code tables, and detailed troubleshooting guides for key endpoints.
    • Typographical errors have been corrected for improved clarity.
  • Chores

    • Multiple dependency and tooling versions have been updated (e.g., container orchestration, messaging, and utility tools) to the latest releases.
    • Build and configuration processes have been refined to bolster overall system stability.

Copy link
Contributor

coderabbitai bot commented Feb 3, 2025

📝 Walkthrough

Walkthrough

This pull request introduces extensive changes across the repository. The updates include modifications to Makefiles (adding new variables and updating command invocations), refactoring various Go packages to pass context parameters in gRPC client, pool, and test methods, and improvements to error logging and handling. API documentation in Swagger JSON and additional docs has been greatly enhanced with expanded summaries, structured status code tables, and troubleshooting information. Additionally, numerous dependency and version updates have been applied—including changes to Kubernetes deployment annotations, GitHub action versions, and Rust toolchain updates—along with minor typographical corrections in API and documentation files.

Changes

File(s) Change Summary
Makefile.d/k3d.mk, Makefile.d/functions.mk Added new variable K3D_NETWORK, updated k3d installation command (setting K3D_INSTALL_DIR), duplicated chmod command placement, and revised the method for fetching GitHub actions versions.
internal/backoff/backoff.go Replaced manual map copy with maps.Clone for simpler code in the Metrics function.
internal/net/grpc/client.go
internal/net/grpc/client_test.go
Updated gRPC client methods (e.g. ConnectedAddrs, rangeConns) to accept a context.Context parameter; refined error handling and updated test calls accordingly.
internal/net/grpc/pool/pool.go
internal/net/grpc/pool/pool_test.go
internal/observability/metrics/grpc/grpc.go
Modified pool methods to be context‑aware (renamed Reconnect to singleTargetConnect and added a new Metrics function), enhanced logging and error handling, and introduced new observable metrics for pool connection health.
internal/test/mock/grpc/grpc_client_mock.go
internal/test/mock/grpc_testify_mock.go
Updated mock methods, such as ConnectedAddrs, to include a context parameter.
pkg/gateway/lb/handler/grpc/aggregation.go
pkg/gateway/mirror/service/mirror.go
Added additional error logging in aggregation search; adjusted the mirror service’s Start method to pass context into gRPC client calls.
apis/swagger/v1/* (e.g. mirror, vald/filter, flush, index, insert, object, remove, search, update, upsert) Expanded and refined API documentation: enhanced endpoint summaries with structured status code tables, troubleshooting sections, and corrected typographical errors to improve clarity.
.gitfiles
.github/ISSUE_TEMPLATE/*
.github/PULL_REQUEST_TEMPLATE.md
Added new workflow and documentation files (including Rust unit test workflows), and updated environment version numbers in the GitHub issue and pull request templates.
go.mod, example/client/go.mod, versions/* Updated dependency versions and module replacements across the Go modules and version files to newer releases.
k8s/*/deployment.yaml Updated ConfigMap checksum annotations in multiple Kubernetes deployment YAML files to reflect configuration changes.
rust/rust-toolchain Updated the Rust toolchain version from 1.83.0 to 1.84.1.
apis/docs/v1/insert.md
apis/proto/v1/vald/insert.proto
docs/api/insert.md
Corrected typographical errors in the documentation for the Insert RPC method.
versions/actions/* Incremented version numbers for various GitHub actions used in the project.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant gRPCClient
    participant ConnectionPool

    Client->>+gRPCClient: ConnectedAddrs(ctx)
    gRPCClient->>+ConnectionPool: rangeConns(ctx, force, callback)
    ConnectionPool-->>-gRPCClient: Connection health data
    gRPCClient-->>-Client: List of active addresses
Loading

Suggested labels

size/XL, actions/e2e-deploy, actions/backport/release/v1.7

Suggested reviewers

  • vankichi
  • kmrmt

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 56a7e41 and 9a713bb.

⛔ Files ignored due to path filters (1)
  • apis/grpc/v1/vald/insert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
📒 Files selected for processing (5)
  • Makefile.d/functions.mk (1 hunks)
  • apis/docs/v1/insert.md (1 hunks)
  • apis/proto/v1/vald/insert.proto (1 hunks)
  • apis/swagger/v1/vald/insert.swagger.json (2 hunks)
  • docs/api/insert.md (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • docs/api/insert.md
  • apis/docs/v1/insert.md
  • apis/proto/v1/vald/insert.proto
🚧 Files skipped from review as they are similar to previous changes (1)
  • Makefile.d/functions.mk
⏰ Context from checks skipped due to timeout of 90000ms (110)
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
🔇 Additional comments (2)
apis/swagger/v1/vald/insert.swagger.json (2)

17-18: Enhanced API Documentation for /insert Endpoint
The updated summary now provides a detailed overview of the insert operation, including a clear status code table and troubleshooting guidance. This change successfully corrects the previous typographical error (from "Inset" to "Insert") and enhances clarity.


50-51: Enhanced API Documentation for /insert/multiple Endpoint
The updated summary and description offer a comprehensive explanation for handling multiple insert operations. The inclusion of an HTML notice about gRPC message size limitations, along with a structured status code table and detailed troubleshooting section, significantly improves the documentation’s clarity and usefulness.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@vdaas-ci
Copy link
Collaborator

vdaas-ci commented Feb 3, 2025

[CHATOPS:HELP] ChatOps commands.

  • 🙆‍♀️ /approve - approve
  • 🍱 /format - format codes and add licenses
  • /gen-test - generate test codes
  • 🏷️ /label - add labels
  • 🔚 2️⃣ 🔚 /label actions/e2e-deploy - run E2E deploy & integration test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
internal/net/grpc/pool/pool.go (2)

143-160: Fallback initial connection dial.
The fallback logic for re-dialing on the same or a new port is well-structured. Consider extracting this repeated dial fallback into a helper function if it grows more complex.


437-464: New Reconnect method logic.
The flow checks if the pool is closing or already healthy, then conditionally triggers fresh DNS resolution. This seems correct; however, ensure that concurrent calls to Reconnect won’t lead to inconsistent states (e.g., changing reconnectHash while another goroutine is also resolving DNS). Overall, a good expansion of functionality.

pkg/gateway/lb/handler/grpc/aggregation.go (1)

108-108: Consider using structured logging with error context.

The added debug logging is good for visibility, but consider using structured logging to include additional context like target address, request ID, etc. This will make debugging easier by providing more context around the errors.

Example:

-log.Debug(err)
+log.Debug("aggregation search error",
+    "error", err,
+    "target", target,
+    "request_id", bcfg.GetRequestId())

Also applies to: 120-120, 177-177, 189-189

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aee34f9 and 3b75a31.

📒 Files selected for processing (12)
  • Makefile.d/k3d.mk (3 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (12 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (27 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • versions/K3S_VERSION (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • versions/K3S_VERSION
  • pkg/gateway/mirror/service/mirror.go
🔇 Additional comments (36)
internal/net/grpc/pool/pool.go (17)

23-40: Imports look consistent.
These newly added imports (maps, codes, status, singleflight) seem fitting for the added functionalities (maps.Clone usage, gRPC error codes, singleflight to handle concurrency). No issues spotted.


50-54: Context-aware method signatures.
Introducing context parameters into Disconnect and Get helps ensure clean cancellations and timeouts. This update aligns well with modern best practices.


79-79: Singleflight for connection concurrency control.
Defining group singleflight.Group[Conn] is a neat approach for preventing duplicate connections and consolidating in-flight requests.


93-94: Shared metrics map under locking.
Using mu to guard the metrics map ensures thread safety. Confirm consistent locking around every access to avoid data races.


107-107: Instantiating singleflight.
Initializing p.group = singleflight.New[Conn]() on creation is helpful for grouping connection calls. Good addition.


193-193: Debug log usage.
The debug log here is straightforward. No issues noted.


241-241: Lock usage within store method.
Acquiring a write lock for the pool store operation is correct. Ensures correct pointer updates.


355-356: Additional debug message.
Just a debug statement; no functional impact.


368-370: Check if pool is closing before singleTargetConnect.
Short-circuiting when p == nil || p.closing.Load() helps avoid redundant dial attempts. Overall logic is sound.

Also applies to: 376-376


466-467: Context-aware disconnection.
Providing a ctx argument to Disconnect is a nice extension for controlled shutdowns.


494-494: Dial debug logging.
This addition is harmless and helpful for tracing dial attempts.


555-559: Calculation of unhealthy connections.
Assigning unhealthy = pl - cnt implies “remaining connections are unhealthy.” If cnt ever surpasses pl, this might yield a negative result. Verify that such a scenario cannot occur, or add a safeguard if needed.


604-633: Enhanced Do method error recovery flow.
• Incorporating status checks and re-closing the connection if it’s canceled or fails is a solid approach to preserve resilience.
• The subsequent refreshConn call (line 626) properly attempts a retry.
Looks reasonable overall.


640-644: Get method with a context parameter.
Using getHealthyConn matches the “do once, fail fast” logic. Good step for context awareness.


795-843: poolConn.Close with delayed shutdown logic.
The approach thoroughly checks connection states, waiting for a stable state or the context to expire. Good for graceful closings.


845-872: isHealthy extended logging.
The function properly triggers conn.Connect() for IDLE states and recurses upon state changes. This handles transient states robustly.


874-883: Metrics function.
Exposing a safe clone of the metrics map is a good pattern to avoid concurrency issues with external callers.

internal/net/grpc/client.go (7)

93-93: Method now accepts context.
Switching ConnectedAddrs() to be context-aware aligns with gRPC best practices, enabling cancellation/timeouts during address checking.


253-253: Using rangeConns with force=true for internal loops.
These lines trigger a forced iteration over connections for rebalancing/health checks. Ensure concurrency is well-managed when multiple rebalances happen simultaneously.

Also applies to: 289-289


418-418: Non-forced rangeConns calls.
Calling rangeConns(ctx, false, ...) indicates partial reconnection logic only if connections are unhealthy. This approach is consistent with your incremental healing design.

Also applies to: 481-481


568-568: Checking IsHealthy before usage.
The approach to skip or reconnect invalid pools is correct. Logging warns about unreachable addresses, which is helpful for debugging.

Also applies to: 637-637


703-703: RoundRobin approach calls rangeConns.
Applying rangeConns(ctx, false, ...) is a valid pattern for enumerating connections until one is healthy. No concerns here.


1088-1101: ConnectedAddrs now uses rangeConns.
Implementation filters out only healthy connections, returning an updated slice of addresses. Helps external callers handle connection states.


1121-1149: New internal helper rangeConns.
force param usage for forced iteration vs. selective iteration is a nice design.
• The check that re-connects unhealthy pool objects ensures self-healing.
Looks cohesive.

Makefile.d/k3d.mk (3)

20-20: New K3D_NETWORK variable.
Declaring K3D_NETWORK = bridge is straightforward; no issues.


33-34: Customized install directory.
Using K3D_INSTALL_DIR=$(BINDIR) and ensuring chmod a+x is performed afterwards looks good. This keeps binaries organized.


46-47: Eviction policy arguments for K3S.
These additional kubelet arguments help manage disk usage thresholds. The usage is correct for the “agent” node contexts. No issues spotted.

internal/test/mock/grpc/grpc_client_mock.go (1)

54-54: LGTM! Context parameter properly added.

The method signature update is consistent with the interface changes across the codebase.

internal/observability/metrics/grpc/grpc.go (3)

33-34: LGTM! New metrics for pool connection health.

Clear and descriptive metric name and description for tracking pool connection health.


69-77: LGTM! Proper metric view configuration.

The view is correctly configured with sum aggregation, which is appropriate for connection count metrics.


83-102: LGTM! Well-implemented metrics registration.

The implementation properly:

  • Creates an observable gauge
  • Handles registration errors
  • Sets up callback for metrics collection
internal/backoff/backoff.go (1)

244-244: LGTM! Improved metrics cloning.

Good use of maps.Clone for a more concise and efficient implementation.

internal/net/grpc/client_test.go (1)

3110-3110: LGTM! Context parameter additions improve consistency.

The addition of context parameters to ConnectedAddrs and rangeConns methods aligns with the context-aware design pattern used throughout the codebase. This change enables better control over cancellation and timeouts.

Also applies to: 3306-3306

internal/net/grpc/pool/pool_test.go (3)

2347-2347: LGTM! Context parameter added correctly.

The change to add context.Background() to the Disconnect method call aligns with the updated method signature.


1-14: LGTM! License header is properly maintained.

The Apache 2.0 license header is correctly formatted and includes the appropriate copyright information.


18-108: Implement missing test cases.

The file contains extensive test case structures but most are commented out with TODO markers. Consider implementing these test cases to ensure proper coverage of the pool functionality, especially for the updated Disconnect method.

Let's check the test coverage:

Also applies to: 110-265, 267-422, 424-584, 586-744, 746-909, 911-1061, 1063-1213, 1215-1360, 1362-1531, 1533-1696, 1698-1866, 1868-2035, 2037-2201, 2203-2352, 2354-2522, 2524-2684, 2686-2849, 2851-3015, 3017-3187, 3189-3339, 3341-3491, 3493-3657, 3659-3823, 3825-3975, 3977-4127, 4129-4232, 4234-4321

internal/test/mock/grpc_testify_mock.go Outdated Show resolved Hide resolved
@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch from 3b75a31 to ac007bc Compare February 3, 2025 08:38
Copy link

cloudflare-workers-and-pages bot commented Feb 3, 2025

Deploying vald with  Cloudflare Pages  Cloudflare Pages

Latest commit: 9a713bb
Status: ✅  Deploy successful!
Preview URL: https://daffb205.vald.pages.dev
Branch Preview URL: https://bugfix-internal-grpc-add-hea.vald.pages.dev

View logs

@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch from ac007bc to e317f7a Compare February 3, 2025 09:02
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
internal/observability/metrics/grpc/grpc.go (1)

81-103: Consider error handling improvements in the callback function.

The callback function for observing pool connection metrics could benefit from more robust error handling.

Consider this improvement:

 func (gm *grpcServerMetrics) Register(m metrics.Meter) error {
     healthyConn, err := m.Int64ObservableGauge(
         poolConnMetricsName,
         metrics.WithDescription(poolConnMetricsDescription),
         metrics.WithUnit(metrics.Dimensionless),
     )
     if err != nil {
         return err
     }
     _, err = m.RegisterCallback(
         func(ctx context.Context, o api.Observer) error {
             ms := pool.Metrics(ctx)
             if len(ms) == 0 {
                 return nil
             }
+            var errs error
             for name, cnt := range ms {
-                o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name)))
+                if err := o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name))); err != nil {
+                    errs = errors.Join(errs, fmt.Errorf("failed to observe metrics for %s: %w", name, err))
+                }
             }
-            return nil
+            return errs
         }, healthyConn,
     )
     return err
 }
internal/net/grpc/pool/pool.go (2)

92-95: Consider using sync.Map for metrics.

The current implementation uses a global map with mutex, which could be replaced with sync.Map for better concurrent access patterns.

Consider this improvement:

-var (
-    mu      sync.RWMutex
-    metrics map[string]int64 = make(map[string]int64)
-)
+var metrics sync.Map[string, int64]

466-492: Consider adding metrics for disconnection events.

The Disconnect method could benefit from tracking disconnection events in metrics.

Consider adding a new metric for tracking disconnection events:

 func (p *pool) Disconnect(ctx context.Context) (err error) {
     log.Debug("Disconnecting...")
     p.closing.Store(true)
     defer p.closing.Store(false)
     emap := make(map[string]error, p.len())
+    disconnectedConns := 0
     err = p.loop(ctx, func(ctx context.Context, _ int, pc *poolConn) bool {
         if pc != nil && pc.conn != nil {
             ierr := pc.conn.Close()
             if ierr != nil {
                 if !errors.Is(ierr, context.DeadlineExceeded) &&
                     !errors.Is(ierr, context.Canceled) {
                     log.Debugf("failed to close connection pool addr = %s\terror = %v", pc.addr, ierr)
                     emap[ierr.Error()] = err
                 } else {
                     log.Debugf("Disconnect loop operation canceled while closing pool member connection to %s,\terror: %v", pc.addr, ierr)
                     return false
                 }
             }
+            disconnectedConns++
         }
         return true
     })
     p.flush()
+    mu.Lock()
+    metrics[p.addr] = 0 // Reset metrics for this pool
+    mu.Unlock()
     for _, e := range emap {
         err = errors.Join(err, e)
     }
     return err
 }
internal/net/grpc/client.go (1)

1128-1140: Consider adding retry mechanism for connection recovery.

The connection recovery logic could benefit from a retry mechanism with backoff.

Consider this improvement:

 if p == nil || !p.IsHealthy(ctx) {
+    var retries int
+    maxRetries := 3
+    backoffDuration := time.Second
+    for retries < maxRetries {
         pc, err := p.Connect(ctx)
         if pc == nil || err != nil || !pc.IsHealthy(ctx) {
             if pc != nil {
                 if derr := pc.Disconnect(ctx); derr != nil {
                     log.Debugf("Failed to disconnect unhealthy connection for %s: %v", addr, derr)
                 }
             }
+            retries++
+            if retries < maxRetries {
+                select {
+                case <-ctx.Done():
+                    return true
+                case <-time.After(backoffDuration * time.Duration(retries)):
+                    continue
+                }
+            }
             log.Debugf("Unhealthy connection detected for %s during gRPC Connection Range over Loop:\t%s", addr, p.String())
             return true
         }
         p = pc
+        break
+    }
 }
internal/net/grpc/pool/pool_test.go (1)

2347-2347: Update test to include context parameter.

The test is calling p.Disconnect(context.Background()) with a default background context. Consider adding test cases that verify context cancellation behavior.

Add test cases to verify:

  • Context cancellation during disconnect
  • Context timeout handling
  • Context propagation to underlying connections
-err := p.Disconnect(context.Background())
+ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+defer cancel()
+err := p.Disconnect(ctx)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b75a31 and e317f7a.

📒 Files selected for processing (12)
  • Makefile.d/k3d.mk (3 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (12 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (27 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • versions/K3S_VERSION (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • versions/K3S_VERSION
  • internal/backoff/backoff.go
  • internal/test/mock/grpc_testify_mock.go
  • pkg/gateway/mirror/service/mirror.go
  • internal/test/mock/grpc/grpc_client_mock.go
  • Makefile.d/k3d.mk
  • pkg/gateway/lb/handler/grpc/aggregation.go
  • internal/net/grpc/client_test.go
⏰ Context from checks skipped due to timeout of 90000ms (93)
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
🔇 Additional comments (9)
internal/observability/metrics/grpc/grpc.go (3)

33-34: LGTM! Clear and descriptive metric names.

The new metric names and descriptions are clear and follow the existing naming pattern.


37-39: LGTM! Well-structured struct update.

The addition of poolTargetAddrKey field with a clear purpose enhances the struct's functionality for tracking pool connection metrics.


69-77: LGTM! Consistent view registration.

The new view for pool connection metrics follows the same pattern as existing views and uses appropriate aggregation type (Sum) for counting connections.

internal/net/grpc/pool/pool.go (2)

586-589: LGTM! Thread-safe metrics update.

The metrics update is properly protected by mutex lock.


873-881: LGTM! Clean metrics retrieval implementation.

The Metrics function safely clones and returns the metrics map.

internal/net/grpc/client.go (2)

1121-1148: LGTM! Improved connection health checking.

The rangeConns method now properly handles connection health checks and includes force parameter for different use cases.


1088-1100: LGTM! Context-aware ConnectedAddrs implementation.

The ConnectedAddrs method now properly uses context for health checks.

internal/net/grpc/pool/pool_test.go (2)

16-108: Implement missing test cases for New function.

The test cases for the New function are commented out with TODO placeholders. Since this PR adds health check functionality, it's important to have test coverage for pool initialization.

Run this script to check if there are any existing tests for the New function:

Would you like me to help implement test cases for the New function that verify:

  • Pool initialization with health check configuration
  • Error cases for invalid health check parameters
  • Default health check behavior

4234-4320: Add test cases for health check functionality.

The isHealthy test function is crucial for this PR's health check feature but lacks test implementations.

Run this script to check if there are any existing health check tests:

Would you like me to help implement test cases that verify:

  • Health check success/failure scenarios
  • Health check timeout handling
  • Health check error conditions
  • Health check metrics collection

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 3, 2025
@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch 2 times, most recently from 716f8e0 to a453411 Compare February 3, 2025 09:16
Copy link

codecov bot commented Feb 3, 2025

Codecov Report

Attention: Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Project coverage is 17.18%. Comparing base (c3b5832) to head (9a713bb).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
internal/test/mock/grpc_testify_mock.go 0.00% 2 Missing ⚠️
internal/test/mock/grpc/grpc_client_mock.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2828      +/-   ##
==========================================
- Coverage   23.92%   17.18%   -6.75%     
==========================================
  Files         547      136     -411     
  Lines       54586    15042   -39544     
==========================================
- Hits        13061     2585   -10476     
+ Misses      40741    12154   -28587     
+ Partials      784      303     -481     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch 2 times, most recently from 8759e30 to 9ecb684 Compare February 3, 2025 10:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
internal/net/grpc/client_test.go (1)

1-3490: Test Implementation Incompleteness – Add Test Cases

It appears that the test file contains extensive scaffolding with 38 TODO markers and 38 placeholder "test_case_X" sections. Although the placeholders are present, no actual test logic is implemented. To ensure thorough coverage, we need to fill these scaffolds with meaningful tests addressing:

• Validating gRPC client connectivity, including context parameter changes
• Testing both happy paths and error scenarios
• Covering concurrent and ordered operations as defined in methods such as Range, RangeConcurrent, ConnectedAddrs, and rangeConns

Would you like assistance in generating concrete test implementations for these sections?

🔗 Analysis chain

Add test cases for the TODO sections.

The test file has extensive test case structures defined but most are commented out and marked as TODO. This indicates incomplete test coverage which should be addressed.

Let's verify the test coverage:

Would you like me to help generate test cases for the TODO sections? I can provide implementations that cover:

  • Happy path scenarios
  • Error cases
  • Edge cases
  • Concurrent execution scenarios
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Count total number of test functions vs implemented test cases
echo "Test functions with TODO markers:"
rg -c "TODO test cases" internal/net/grpc/client_test.go

echo "Actual implemented test cases:"
rg -c "test_case_[0-9]+" internal/net/grpc/client_test.go

Length of output: 272

🧹 Nitpick comments (6)
internal/observability/metrics/grpc/grpc.go (1)

81-103: Consider adding error handling for empty metrics.

The implementation looks good but could benefit from additional error handling.

Consider adding error handling when ms is nil:

 func (gm *grpcServerMetrics) Register(m metrics.Meter) error {
     healthyConn, err := m.Int64ObservableGauge(
         poolConnMetricsName,
         metrics.WithDescription(poolConnMetricsDescription),
         metrics.WithUnit(metrics.Dimensionless),
     )
     if err != nil {
         return err
     }
     _, err = m.RegisterCallback(
         func(ctx context.Context, o api.Observer) error {
             ms := pool.Metrics(ctx)
-            if len(ms) == 0 {
+            if ms == nil || len(ms) == 0 {
                 return nil
             }
             for name, cnt := range ms {
                 o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name)))
             }
             return nil
         }, healthyConn,
     )
     return err
 }
internal/net/grpc/pool/pool.go (1)

800-848: Consider adding timeout validation.

The connection closing logic could benefit from timeout validation.

Consider validating the delay parameter:

 func (pc *poolConn) Close(ctx context.Context, delay time.Duration) error {
+    if delay <= 0 {
+        return errors.New("delay must be positive")
+    }
     tdelay := delay / 10
     if tdelay < time.Millisecond*5 {
         tdelay = time.Millisecond * 5
internal/net/grpc/client.go (1)

1121-1148: Consider adding connection retry limit.

The connection retry logic in rangeConns could benefit from a retry limit to prevent infinite loops.

Consider adding a retry limit:

 func (g *gRPCClient) rangeConns(ctx context.Context, force bool, fn func(addr string, p pool.Conn) bool) error {
     var cnt int
+    const maxRetries = 3
+    retries := 0
     g.conns.Range(func(addr string, p pool.Conn) bool {
         if force {
             cnt++
             return fn(addr, p)
         }
         if p == nil || !p.IsHealthy(ctx) {
+            if retries >= maxRetries {
+                log.Debugf("Max retries reached for %s", addr)
+                return true
+            }
+            retries++
             pc, err := p.Connect(ctx)
internal/net/grpc/client_test.go (2)

87-87: Consider disabling parallel test execution.

The tests use t.Parallel() but given the shared nature of gRPC connections and the potential for race conditions, parallel execution may lead to flaky tests.

Consider removing t.Parallel() or implementing proper test isolation through:

  • Unique connection pools per test
  • Isolated test servers
  • Proper cleanup between tests

Also applies to: 441-441, 632-632, 824-824, 1019-1019, 1212-1212, 1408-1408, 1611-1611, 1786-1786, 2138-2138, 2322-2322, 2518-2518, 2707-2707, 3071-3071


87-103: Add goleak cleanup in TestMain.

While individual tests use goleak.VerifyNone(), it would be better to centralize this in a TestMain function.

Add a TestMain function:

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m,
        goleak.IgnoreCurrent(),
        // Add other specific goroutine ignore rules
    )
}
internal/net/grpc/pool/pool_test.go (1)

45-46: Implement test cases for the pool operations.

The test file has a well-structured test table setup but lacks actual test cases. I can help implement test cases for various pool operations like connection management, health checks, and reconnection logic.

Would you like me to help generate test cases for any specific pool operations? For example, I can start with:

  1. Connection initialization tests
  2. Health check tests
  3. Reconnection tests
  4. Error handling tests

Also applies to: 147-148, 304-305, 466-467, 624-625, 789-790, 949-950, 1101-1102, 1248-1249, 1407-1408, 1579-1580, 1746-1747, 1915-1916, 2083-2084, 2241-2242, 2402-2403, 2566-2567, 2729-2730, 2897-2898, 3065-3066, 3227-3228, 3379-3380, 3539-3540, 3704-3705, 3863-3864, 4015-4016, 4157-4158, 4257-4258

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e317f7a and c45bf8a.

📒 Files selected for processing (13)
  • Makefile.d/k3d.mk (3 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (12 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (26 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/net/net.go (1 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • versions/K3S_VERSION (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • pkg/gateway/mirror/service/mirror.go
  • internal/test/mock/grpc_testify_mock.go
  • internal/test/mock/grpc/grpc_client_mock.go
  • Makefile.d/k3d.mk
  • versions/K3S_VERSION
  • internal/backoff/backoff.go
  • pkg/gateway/lb/handler/grpc/aggregation.go
🔇 Additional comments (9)
internal/observability/metrics/grpc/grpc.go (3)

33-34: LGTM! Clear and descriptive metric naming.

The new metric for tracking pool connections has a clear name and description that accurately reflects its purpose.


37-39: LGTM! Good field naming and initialization.

The poolTargetAddrKey field is appropriately named and initialized with a descriptive value in the New function.


69-77: LGTM! Proper metric view configuration.

The view for pool connections is correctly configured with AggregationSum which is appropriate for counting healthy connections.

internal/net/net.go (1)

199-199: LGTM! Improved IP address logging.

Using StringExpanded() provides a more detailed representation of IP addresses in logs, which is helpful for debugging.

internal/net/grpc/pool/pool.go (2)

544-607: LGTM! Robust health check implementation.

The health check implementation is thorough and handles various edge cases appropriately:

  • Checks connection state
  • Handles reconnection attempts
  • Updates metrics
  • Proper error handling

92-95: 🛠️ Refactor suggestion

Consider using a more thread-safe metrics initialization.

The global metrics map initialization could be improved for thread safety.

Consider using sync.Once for thread-safe initialization:

 var (
     mu      sync.RWMutex
-    metrics map[string]int64 = make(map[string]int64)
+    metrics map[string]int64
+    metricsOnce sync.Once
 )

+func init() {
+    metricsOnce.Do(func() {
+        metrics = make(map[string]int64)
+    })
+}

Likely invalid or redundant comment.

internal/net/grpc/client.go (1)

1088-1100: LGTM! Proper context handling in ConnectedAddrs.

The implementation correctly uses context for connection health checks and properly handles errors.

internal/net/grpc/pool/pool_test.go (2)

2347-2347: LGTM! The context parameter addition is consistent.

The change to add context.Background() parameter to p.Disconnect() call aligns with the goal of improving context handling in gRPC connections.


16-108: Implement the TODO test cases to ensure comprehensive test coverage.

The file contains numerous test function stubs with TODO placeholders. Given that this is a gRPC connection pool implementation, having comprehensive test coverage is crucial for ensuring reliability and proper connection handling.

Run the following script to check test coverage:

Also applies to: 109-265, 266-422, 423-584, 585-744, 745-909, 910-1061, 1062-1213, 1214-1360, 1361-1531, 1532-1697, 1698-1866, 1867-2035, 2036-2201, 2202-2353, 2354-2522, 2523-2684, 2685-2849, 2850-3015, 3016-3187, 3188-3339, 3340-3491, 3492-3657, 3658-3823, 3824-3975, 3976-4127, 4128-4232, 4233-4321

internal/net/grpc/client_test.go Show resolved Hide resolved
internal/net/grpc/client_test.go Outdated Show resolved Hide resolved
@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch 4 times, most recently from 278a532 to 4247d97 Compare February 4, 2025 07:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
internal/observability/metrics/grpc/grpc.go (1)

81-104: Consider adding error handling for callback registration.

The error from RegisterCallback is correctly propagated, but consider adding specific error handling for the callback function itself to prevent potential panics.

 func (gm *grpcServerMetrics) Register(m metrics.Meter) error {
 	// The metrics are dynamically registered at the grpc server interceptor package,
 	healthyConn, err := m.Int64ObservableGauge(
 		poolConnMetricsName,
 		metrics.WithDescription(poolConnMetricsDescription),
 		metrics.WithUnit(metrics.Dimensionless),
 	)
 	if err != nil {
 		return err
 	}
 	_, err = m.RegisterCallback(
 		func(ctx context.Context, o api.Observer) error {
 			ms := pool.Metrics(ctx)
 			if len(ms) == 0 {
 				return nil
 			}
+			var errs error
 			for name, cnt := range ms {
-				o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name)))
+				if err := o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name))); err != nil {
+					errs = errors.Join(errs, fmt.Errorf("failed to observe metrics for %s: %w", name, err))
+				}
 			}
-			return nil
+			return errs
 		}, healthyConn,
 	)
 	return err
 }
internal/net/grpc/pool/pool.go (2)

92-95: Consider using sync.Map for metrics.

The global metrics map with mutex could potentially become a bottleneck under high concurrency. Consider using sync.Map for better performance.

-var (
-	mu      sync.RWMutex
-	metrics map[string]int64 = make(map[string]int64)
-)
+var metrics sync.Map[string, int64]

592-594: Consider batching metrics updates.

The metrics update in IsHealthy could be optimized by batching updates or using atomic operations.

-mu.Lock()
-metrics[p.addr] = int64(pl - unhealthy)
-mu.Unlock()
+atomic.StoreInt64(&metrics[p.addr], int64(pl - unhealthy))
internal/net/grpc/client.go (1)

1129-1156: Consider adding timeout for health checks.

The rangeConns method should consider adding a timeout for health checks to prevent long-running operations.

 func (g *gRPCClient) rangeConns(ctx context.Context, force bool, fn func(addr string, p pool.Conn) bool) error {
+	// Add timeout for health checks
+	ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
+	defer cancel()
+
 	var cnt int
 	g.conns.Range(func(addr string, p pool.Conn) bool {
 		if force {
 			cnt++
 			return fn(addr, p)
 		}
 		if p == nil || !p.IsHealthy(ctx) {
 			pc, err := p.Connect(ctx)
 			if pc == nil || err != nil || !pc.IsHealthy(ctx) {
 				if pc != nil {
 					if derr := pc.Disconnect(ctx); derr != nil {
 						log.Debugf("Failed to disconnect unhealthy connection for %s: %v", addr, derr)
 					}
 				}
 				log.Debugf("Unhealthy connection detected for %s during gRPC Connection Range over Loop:\t%s", addr, p.String())
 				return true
 			}
 			p = pc
 		}
 		cnt++
 		return fn(addr, p)
 	})
 	if cnt == 0 {
 		return errors.ErrGRPCClientConnNotFound("*")
 	}
 	return nil
 }
internal/net/grpc/client_test.go (1)

3110-3110: Add timeout to context in test.

When testing network operations, it's recommended to use context with timeout to prevent tests from hanging indefinitely.

-gotAddrs := g.ConnectedAddrs(context.Background())
+ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+defer cancel()
+gotAddrs := g.ConnectedAddrs(ctx)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c45bf8a and 9f83fdd.

📒 Files selected for processing (15)
  • Makefile (2 hunks)
  • Makefile.d/functions.mk (2 hunks)
  • Makefile.d/k3d.mk (3 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (16 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (26 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/net/net.go (1 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • versions/K3S_VERSION (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • versions/K3S_VERSION
  • pkg/gateway/mirror/service/mirror.go
  • internal/backoff/backoff.go
  • internal/net/net.go
  • Makefile.d/k3d.mk
  • internal/test/mock/grpc/grpc_client_mock.go
  • pkg/gateway/lb/handler/grpc/aggregation.go
  • internal/test/mock/grpc_testify_mock.go
  • internal/net/grpc/pool/pool_test.go
🔇 Additional comments (13)
Makefile.d/functions.mk (1)

62-67: LGTM: CGO flags are properly defined and ordered.

The CGO flags are correctly defined with proper ordering, ensuring that all necessary flags (CFLAGS, CXXFLAGS, FFLAGS) are set before they're used in CGO-specific variables.

Makefile (4)

161-164: Excellent optimization and security flags configuration.

The new COMMON_FLAGS variable provides a comprehensive set of optimization and security flags:

  • Performance: -Ofast, -march=native, -mtune=native, -flto=auto
  • Security: -fvisibility=hidden
  • Size optimization: -ffunction-sections, -fdata-sections

165-167: LGTM: Language standards are properly set.

The language standards are correctly set for each compiler:

  • C: gnu23
  • C++: gnu++23
  • Fortran: f2018

181-182: LGTM: Clean flag handling for arm64.

The flags for arm64 architecture are cleanly handled without unnecessary modifications.


176-178: Verify AVX-512 disabling flags for NGT on amd64.

The NGT_FLAGS disable AVX-512 instructions which might affect performance on newer CPUs that support these instructions.

Run this script to check if the target system supports AVX-512:

internal/observability/metrics/grpc/grpc.go (3)

33-34: LGTM! Clear and descriptive metric names.

The new metric names and descriptions are clear and accurately describe their purpose.


37-39: LGTM! Well-structured struct update.

The addition of poolTargetAddrKey field to grpcServerMetrics struct is appropriate for storing the target address key.


69-77: LGTM! Appropriate metric view configuration.

The new view for pool connection metrics is correctly configured with sum aggregation, which is appropriate for counting healthy connections.

internal/net/grpc/pool/pool.go (1)

879-887: LGTM! Clean metrics retrieval implementation.

The Metrics function correctly uses maps.Clone to return a copy of the metrics map, preventing external modifications.

internal/net/grpc/client.go (2)

1096-1108: LGTM! Context-aware connection listing.

The ConnectedAddrs method now correctly uses context for health checks and connection listing.


1115-1124: LGTM! Proper context handling in Close method.

The Close method now correctly handles context cancellation during disconnection.

internal/net/grpc/client_test.go (2)

3306-3308: LGTM! Context parameter correctly added.

The test args struct has been updated to include the context parameter as previously requested.


20-105: Implement test cases for better coverage.

The test file contains extensive boilerplate but most test cases are commented out with TODO placeholders. Consider implementing these test cases to ensure proper test coverage of the gRPC client functionality.

Let's check the current test coverage:

Would you like me to help generate test cases for the key functionality?

Also applies to: 159-246, 346-435, 535-627, 728-819, 921-1014, 1118-1207, 1312-1403, 1510-1605, 1702-1781, 1876-1957, 2052-2133, 2228-2317, 2422-2513, 2612-2702, 2802-2891, 2986-3067, 3166-3254, 3352-3440

Makefile.d/functions.mk Outdated Show resolved Hide resolved
coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 4, 2025
@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch from 6c78959 to 7ba32c0 Compare February 4, 2025 09:54
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (9)
internal/observability/metrics/grpc/grpc.go (1)

81-103: Improve error handling in callback registration.

While the implementation is solid, consider handling errors from the callback function more explicitly. Currently, if the callback returns an error, it's only returned from the Register method without any logging or cleanup.

Consider this improvement:

 func (gm *grpcServerMetrics) Register(m metrics.Meter) error {
     healthyConn, err := m.Int64ObservableGauge(
         poolConnMetricsName,
         metrics.WithDescription(poolConnMetricsDescription),
         metrics.WithUnit(metrics.Dimensionless),
     )
     if err != nil {
         return err
     }
     _, err = m.RegisterCallback(
         func(ctx context.Context, o api.Observer) error {
             ms := pool.Metrics(ctx)
             if len(ms) == 0 {
                 return nil
             }
             for name, cnt := range ms {
                 o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name)))
             }
             return nil
         }, healthyConn,
     )
-    return err
+    if err != nil {
+        log.Errorf("Failed to register metrics callback: %v", err)
+        return fmt.Errorf("failed to register metrics callback: %w", err)
+    }
+    return nil
 }
internal/net/grpc/pool/pool.go (3)

92-95: Consider encapsulating metrics storage.

While the implementation is thread-safe, consider encapsulating the metrics storage in a dedicated type to avoid global variables and provide better control over the metrics lifecycle.

Consider this alternative approach:

-var (
-    mu      sync.RWMutex
-    metrics map[string]int64 = make(map[string]int64)
-)

+type metricsStorage struct {
+    mu      sync.RWMutex
+    metrics map[string]int64
+}
+
+func newMetricsStorage() *metricsStorage {
+    return &metricsStorage{
+        metrics: make(map[string]int64),
+    }
+}
+
+var defaultMetricsStorage = newMetricsStorage()

544-607: Enhance logging consistency in health check.

While the health check implementation is robust, the logging could be more structured and consistent. Consider using structured logging with consistent fields across all log messages.

Consider this improvement:

 func (p *pool) IsHealthy(ctx context.Context) (healthy bool) {
     if p == nil || p.closing.Load() {
         return false
     }
-    log.Debug("Checking health...")
+    log.WithFields(log.Fields{
+        "addr": p.addr,
+        "pool_len": p.len(),
+    }).Debug("Starting health check")
     var cnt, unhealthy int
     pl := p.len()
     err := p.loop(ctx, func(ctx context.Context, idx int, pc *poolConn) bool {
         if pc == nil || !isHealthy(ctx, pc.conn) {
             if p.isIP {
                 if pc != nil && pc.addr != "" {
                     err := p.refreshConn(ctx, idx, pc, pc.addr)
                     if err != nil {
-                        unhealthy = pl - cnt
+                        unhealthy = pl - cnt
+                        log.WithFields(log.Fields{
+                            "addr": pc.addr,
+                            "error": err,
+                        }).Debug("Failed to refresh IP connection")
                         return false
                     }
                     return true
                 }
                 unhealthy = pl - cnt
                 return false
             }

879-887: Consider optimizing metrics map cloning.

While the implementation is thread-safe, consider optimizing the map cloning for better performance with large maps. Pre-allocating the map with the exact size would avoid map growth and rehashing.

Consider this improvement:

 func Metrics(context.Context) map[string]int64 {
     mu.RLock()
-    defer mu.RUnlock()
-
     if len(metrics) == 0 {
+        mu.RUnlock()
         return nil
     }
-    return maps.Clone(metrics)
+    result := make(map[string]int64, len(metrics))
+    for k, v := range metrics {
+        result[k] = v
+    }
+    mu.RUnlock()
+    return result
 }
internal/net/grpc/client.go (3)

1129-1158: Enhance error handling in rangeConns.

While the implementation is solid, consider improving error handling by aggregating errors from connection operations and providing more context in error messages.

Consider this improvement:

 func (g *gRPCClient) rangeConns(
     ctx context.Context, force bool, fn func(addr string, p pool.Conn) bool,
 ) error {
     var cnt int
+    var errs []error
     g.conns.Range(func(addr string, p pool.Conn) bool {
         if force {
             cnt++
             return fn(addr, p)
         }
         if p == nil || !p.IsHealthy(ctx) {
             pc, err := p.Connect(ctx)
             if pc == nil || err != nil || !pc.IsHealthy(ctx) {
                 if pc != nil {
                     if derr := pc.Disconnect(ctx); derr != nil {
-                        log.Debugf("Failed to disconnect unhealthy connection for %s: %v", addr, derr)
+                        errs = append(errs, fmt.Errorf("disconnect %s: %w", addr, derr))
                     }
                 }
-                log.Debugf("Unhealthy connection detected for %s during gRPC Connection Range over Loop:\t%s", addr, p.String())
+                log.WithFields(log.Fields{
+                    "addr": addr,
+                    "error": err,
+                    "pool": p.String(),
+                }).Debug("Unhealthy connection detected")
                 return true
             }
             p = pc
         }
         cnt++
         return fn(addr, p)
     })
     if cnt == 0 {
-        return errors.ErrGRPCClientConnNotFound("*")
+        if len(errs) > 0 {
+            return fmt.Errorf("no healthy connections found: %w", errors.Join(errs...))
+        }
+        return errors.ErrGRPCClientConnNotFound("*")
     }
     return nil
 }

1096-1108: Optimize ConnectedAddrs implementation.

While the implementation is correct, consider pre-allocating the slice with the exact capacity to avoid reallocations during append operations.

Consider this improvement:

 func (g *gRPCClient) ConnectedAddrs(ctx context.Context) (addrs []string) {
-    addrs = make([]string, 0, g.conns.Len())
+    // Pre-count healthy connections to allocate exact capacity
+    healthyCount := 0
+    g.rangeConns(ctx, false, func(addr string, p pool.Conn) bool {
+        if p != nil && p.IsHealthy(ctx) {
+            healthyCount++
+        }
+        return true
+    })
+    addrs = make([]string, 0, healthyCount)
     err := g.rangeConns(ctx, false, func(addr string, p pool.Conn) bool {
         if p != nil && p.IsHealthy(ctx) {
             addrs = append(addrs, addr)
         }
         return true
     })
     if err != nil {
         return nil
     }
     return addrs
 }

1115-1124: Consider parallel disconnection in Close.

While the implementation is correct, consider implementing parallel disconnection for faster cleanup of multiple connections.

Consider this improvement:

-select {
-case <-ctx.Done():
-    return false
-default:
-    derr := g.Disconnect(ctx, addr)
-    if derr != nil && !errors.Is(derr, errors.ErrGRPCClientConnNotFound(addr)) {
-        err = errors.Join(err, derr)
-    }
-    return true
-}
+eg, egctx := errgroup.New(ctx)
+eg.SetLimit(runtime.GOMAXPROCS(0)) // Or another suitable concurrency limit
+g.conns.Range(func(addr string, p pool.Conn) bool {
+    addr := addr // Capture for goroutine
+    eg.Go(safety.RecoverFunc(func() error {
+        select {
+        case <-egctx.Done():
+            return egctx.Err()
+        default:
+            if derr := g.Disconnect(egctx, addr); derr != nil &&
+                !errors.Is(derr, errors.ErrGRPCClientConnNotFound(addr)) {
+                return derr
+            }
+            return nil
+        }
+    }))
+    return true
+})
+if err := eg.Wait(); err != nil {
+    return err
+}
apis/swagger/v1/vald/insert.swagger.json (1)

17-18: Fix typo in API documentation.

There's a typo in the summary: "Inset RPC" should be "Insert RPC"

Apply this diff to fix the typo:

-"summary": "Overview\nInset RPC is the method to add a new single vector.\n
+"summary": "Overview\nInsert RPC is the method to add a new single vector.\n
apis/swagger/v1/vald/update.swagger.json (1)

50-52: MultiUpdate Endpoint Documentation Added:
The changes for the /update/multiple endpoint clearly articulate the multi-update functionality and include a helpful notice regarding gRPC message size limitations. For consistency, consider adding more detailed troubleshooting information in future revisions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ba32c0 and 9c82017.

⛔ Files ignored due to path filters (31)
  • apis/grpc/v1/agent/core/agent.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/agent/sidecar/sidecar.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/discoverer/discoverer.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/egress/egress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/ingress/ingress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/meta/meta.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/payload/payload.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/rpc/errdetails/error_details.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • example/client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • rust/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (68)
  • .gitfiles (4 hunks)
  • .github/ISSUE_TEMPLATE/bug_report.md (1 hunks)
  • .github/ISSUE_TEMPLATE/security_issue_report.md (1 hunks)
  • .github/PULL_REQUEST_TEMPLATE.md (1 hunks)
  • Makefile (2 hunks)
  • Makefile.d/functions.mk (2 hunks)
  • Makefile.d/k3d.mk (3 hunks)
  • apis/swagger/v1/mirror/mirror.swagger.json (1 hunks)
  • apis/swagger/v1/vald/filter.swagger.json (8 hunks)
  • apis/swagger/v1/vald/flush.swagger.json (1 hunks)
  • apis/swagger/v1/vald/index.swagger.json (5 hunks)
  • apis/swagger/v1/vald/insert.swagger.json (2 hunks)
  • apis/swagger/v1/vald/object.swagger.json (4 hunks)
  • apis/swagger/v1/vald/remove.swagger.json (3 hunks)
  • apis/swagger/v1/vald/search.swagger.json (8 hunks)
  • apis/swagger/v1/vald/update.swagger.json (3 hunks)
  • apis/swagger/v1/vald/upsert.swagger.json (2 hunks)
  • example/client/go.mod (2 hunks)
  • go.mod (12 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (16 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (26 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/net/net.go (1 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • k8s/discoverer/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/lb/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/mirror/deployment.yaml (1 hunks)
  • k8s/index/operator/deployment.yaml (1 hunks)
  • k8s/manager/index/deployment.yaml (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • rust/rust-toolchain (1 hunks)
  • versions/BUF_VERSION (1 hunks)
  • versions/CMAKE_VERSION (1 hunks)
  • versions/DOCKER_VERSION (1 hunks)
  • versions/FAISS_VERSION (1 hunks)
  • versions/GOLANGCILINT_VERSION (1 hunks)
  • versions/HELM_VERSION (1 hunks)
  • versions/K3S_VERSION (1 hunks)
  • versions/KIND_VERSION (1 hunks)
  • versions/KUBECTL_VERSION (1 hunks)
  • versions/NGT_VERSION (1 hunks)
  • versions/OPERATOR_SDK_VERSION (1 hunks)
  • versions/PROMETHEUS_STACK_VERSION (1 hunks)
  • versions/PROTOBUF_VERSION (1 hunks)
  • versions/RUST_VERSION (1 hunks)
  • versions/TELEPRESENCE_VERSION (1 hunks)
  • versions/USEARCH_VERSION (1 hunks)
  • versions/YQ_VERSION (1 hunks)
  • versions/actions/ACTIONS_SETUP_GO (1 hunks)
  • versions/actions/ACTIONS_SETUP_NODE (1 hunks)
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT (1 hunks)
  • versions/actions/CODECOV_CODECOV_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_QEMU_ACTION (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_INIT (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF (1 hunks)
  • versions/actions/GITHUB_ISSUE_METRICS (1 hunks)
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_HADOLINT (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL (1 hunks)
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (55)
  • versions/HELM_VERSION
  • versions/actions/ACTIONS_SETUP_GO
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST
  • versions/KIND_VERSION
  • versions/actions/ACTIONS_SETUP_NODE
  • versions/OPERATOR_SDK_VERSION
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL
  • versions/KUBECTL_VERSION
  • versions/NGT_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE
  • versions/DOCKER_VERSION
  • versions/YQ_VERSION
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION
  • versions/actions/DOCKER_SETUP_QEMU_ACTION
  • versions/PROTOBUF_VERSION
  • versions/K3S_VERSION
  • versions/actions/CODECOV_CODECOV_ACTION
  • versions/BUF_VERSION
  • versions/TELEPRESENCE_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_INIT
  • versions/actions/REVIEWDOG_ACTION_HADOLINT
  • versions/RUST_VERSION
  • versions/USEARCH_VERSION
  • apis/swagger/v1/vald/flush.swagger.json
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD
  • versions/actions/GITHUB_ISSUE_METRICS
  • versions/PROMETHEUS_STACK_VERSION
  • versions/GOLANGCILINT_VERSION
  • rust/rust-toolchain
  • .github/ISSUE_TEMPLATE/bug_report.md
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT
  • versions/FAISS_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF
  • pkg/gateway/mirror/service/mirror.go
  • internal/test/mock/grpc_testify_mock.go
  • k8s/index/operator/deployment.yaml
  • Makefile.d/functions.mk
  • internal/test/mock/grpc/grpc_client_mock.go
  • apis/swagger/v1/vald/index.swagger.json
  • internal/backoff/backoff.go
  • k8s/discoverer/deployment.yaml
  • k8s/gateway/gateway/lb/deployment.yaml
  • Makefile.d/k3d.mk
  • versions/CMAKE_VERSION
  • .github/ISSUE_TEMPLATE/security_issue_report.md
  • internal/net/grpc/pool/pool_test.go
  • internal/net/net.go
  • k8s/gateway/gateway/mirror/deployment.yaml
  • k8s/manager/index/deployment.yaml
  • Makefile
  • .github/PULL_REQUEST_TEMPLATE.md
  • example/client/go.mod
  • pkg/gateway/lb/handler/grpc/aggregation.go
  • go.mod
🧰 Additional context used
📓 Learnings (1)
.gitfiles (1)
Learnt from: datelier
PR: vdaas/vald#2807
File: rust/bin/agent/src/handler/object.rs:36-38
Timestamp: 2025-01-22T06:26:59.130Z
Learning: In Rust code for Vald handlers, avoid using unwrap() on Option types. Instead, handle None values explicitly by returning appropriate error statuses to maintain robustness.
🔇 Additional comments (34)
apis/swagger/v1/mirror/mirror.swagger.json (1)

15-18: Enhanced Documentation through Detailed Summary:

The updated summary on line 17 now clearly outlines the endpoint's purpose and includes a structured table of status codes. This approach improves clarity and provides valuable context regarding the expected responses from the endpoint. However, please verify that the multiline formatting and table syntax (using newline characters and pipe symbols) is rendered as expected in your Swagger UI/editor. In some cases, certain tools might not fully support this markdown-like formatting within JSON strings.

.gitfiles (3)

126-126: LGTM! Addition of Rust unit test workflow.

The addition of a dedicated Rust unit test workflow improves the testing coverage for Rust components.


161-163: LGTM! New buf.gen configuration files.

The addition of buf.gen configuration files for documentation generation enhances the project's documentation tooling.


164-177: LGTM! Comprehensive API documentation.

The addition of multiple API documentation files improves the project's documentation coverage and maintainability.

internal/observability/metrics/grpc/grpc.go (3)

16-24: LGTM!

The imports and struct changes are well-organized and necessary for implementing the pool connection metrics functionality.

Also applies to: 37-39


33-34: LGTM!

The metric constants are well-named and have clear descriptions.


69-77: LGTM!

The pool connection metrics view is properly configured with sum aggregation, which is appropriate for tracking the count of healthy connections.

internal/net/grpc/client_test.go (2)

3110-3110: Verify context handling in ConnectedAddrs test.

The test should verify context cancellation and timeout scenarios.

Apply this diff to improve context handling:

-gotAddrs := g.ConnectedAddrs(context.Background())
+ctx, cancel := context.WithTimeout(context.Background(), time.Second)
+defer cancel()
+gotAddrs := g.ConnectedAddrs(ctx)

3306-3308: 🛠️ Refactor suggestion

Update rangeConns test to include context parameter.

The test function signature has been updated but test cases need to be updated accordingly.

Apply this diff to update the test:

type args struct {
+    ctx context.Context
     fn func(addr string, p pool.Conn) bool
}
✅ Verification successful

Test Case Update Required: Update the args struct to include the context parameter.

The test case in internal/net/grpc/client_test.go still contains commented-out code where the ctx context.Context field is present. This confirms that after the signature change, the test cases haven’t been updated to include the context parameter as expected. Please apply the provided diff to uncomment (or add) the ctx context.Context field within the args struct so that it aligns with the updated function signature.

  • Location: internal/net/grpc/client_test.go (lines around 3306–3308)
🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts or web queries found


Script:

#!/bin/bash
# List occurrences of the 'args' struct definition in the test file to inspect context usage.
echo "Occurrences of 'type args' in internal/net/grpc/client_test.go:"
rg -n "type args *{" internal/net/grpc/client_test.go

echo "--------------------------------------------"
# List occurrences of 'ctx context.Context' to see if the context parameter is currently present.
echo "Occurrences of 'ctx context.Context' in internal/net/grpc/client_test.go:"
rg -n "ctx context.Context" internal/net/grpc/client_test.go

echo "--------------------------------------------"
# List the lines around where the commented-out code exists to verify removal or changes.
echo "Context around commented code for 'fn func(addr string, p pool.Conn) bool':"
rg -n -C 5 "fn func\(addr string, p pool.Conn\)" internal/net/grpc/client_test.go

Length of output: 2154

apis/swagger/v1/vald/object.swagger.json (2)

17-18: LGTM! Well-structured API documentation.

The documentation for /exists/{id} endpoint is comprehensive and follows best practices:

  • Clear overview section
  • Structured status code table
  • Detailed troubleshooting guide with common errors and resolutions

106-107: LGTM! Well-structured API documentation.

The documentation for /object/{id.id} endpoint is comprehensive and follows best practices:

  • Clear overview section
  • Structured status code table
  • Detailed troubleshooting guide with common errors and resolutions
apis/swagger/v1/vald/insert.swagger.json (1)

50-51: LGTM! Well-structured API documentation.

The documentation for /insert/multiple endpoint is comprehensive and follows best practices:

  • Clear overview section
  • Important notice about gRPC message size limitation
  • Structured status code table
  • Detailed troubleshooting guide with common errors and resolutions
apis/swagger/v1/vald/upsert.swagger.json (2)

17-18: Enhanced Upsert Endpoint Documentation:
The updated summary and description now clearly explain that the Upsert RPC handles both updating an existing vector and inserting a new one. The detailed status code table and troubleshooting section greatly improve clarity. Consider a slight rewording (for example, “update an existing vector or insert a new one”) to make it even clearer.


50-52: Enhanced MultiUpsert Endpoint Documentation:
The revised summary for the /upsert/multiple endpoint now explains the dual functionality of updating existing vectors and adding new ones in a single request. The added notice on gRPC message size limitations and the structured table of status codes/troubleshooting details provide excellent guidance.

apis/swagger/v1/vald/update.swagger.json (1)

17-18: Enhanced Update Endpoint Documentation:
The updated summary and description for the /update endpoint now include a comprehensive status code table with troubleshooting details. This clear presentation helps users understand the purpose and common error scenarios of this RPC.

apis/swagger/v1/vald/remove.swagger.json (3)

17-19: Improved Single Remove Documentation:
The revised summary and description for the /remove endpoint now include a detailed table of possible responses and troubleshooting tips. This enhancement clarifies what a client should expect in error conditions, improving the overall API usability.


50-52: Enhanced MultiRemove Endpoint Documentation:
The updated documentation for the /remove/multiple endpoint now mirrors the style of the single remove endpoint—with a detailed status code table and troubleshooting suggestions. The clarity and consistency across endpoints are commendable.


83-85: Detailed RemoveByTimestamp Documentation:
The summary and description for the /remove/timestamp endpoint now provide a comprehensive explanation—including how multiple timestamps are combined using an AND condition—with a clear status code table and troubleshooting section.

apis/swagger/v1/vald/filter.swagger.json (8)

17-18: InsertObject Endpoint Documentation Enhancement:
The updated summary now specifies that the InsertObject RPC is used for inserting objects via the Vald Filter Gateway. The added status code table clearly lays out possible responses, which enhances user understanding.


49-50: MultiInsertObject Endpoint Documentation Enhancement:
The revised summary for the /insert/object/multiple endpoint makes it clear that multiple new objects can be inserted in a single request. The detailed presentation of status codes improves clarity.


81-82: SearchObject Endpoint Documentation Updated:
The updated summary for the /search/object endpoint clearly articulates that the SearchObject RPC is used for object-based searches. The error handling table and troubleshooting information add valuable context.


112-113: StreamSearchObject Endpoint Documentation Updated:
The enhanced summary for the /search/object/multiple endpoint now includes a notice on bidirectional streaming, which is very useful. The structured status code table and troubleshooting details further improve the documentation.


145-146: UpdateObject Endpoint Documentation Improved:
For the /update/object endpoint, the revised summary now effectively communicates the update operation for a single object, complete with a clear status code table.


177-178: MultiUpdateObject Endpoint Documentation Updated:
The updated summary and description emphasize the multi-update functionality and note the gRPC message size limitation. Clear troubleshooting steps are provided, enhancing the endpoint’s usability.


210-211: UpsertObject Endpoint Documentation Enhanced:
The summary for the /upsert/object endpoint now clearly states that it is used for both updating an existing object and inserting a new one. The concise explanation is very helpful.


242-244: MultiUpsertObject Endpoint Documentation Enhanced:
The changes for the /upsert/object/multiple endpoint now offer a detailed summary with a clear notice on gRPC message size limitations and troubleshooting information. This consistency and level of detail across endpoints is excellent.

apis/swagger/v1/vald/search.swagger.json (8)

16-19: LinearSearch Endpoint Documentation Enhanced:
The updated summary for the /linearsearch endpoint clearly describes the LinearSearch RPC with a detailed status code table and troubleshooting guide. This helps set a clear expectation for users.


49-52: LinearSearchByID Endpoint Documentation Updated:
The revised documentation for the /linearsearch/id endpoint now explains that the search is performed by a user-defined vector ID – including a note that the vector must be indexed prior to querying. This adds useful context regarding prerequisites.


81-85: MultiLinearSearchByID Endpoint Documentation Enhanced:
The enhanced summary for the /linearsearch/id/multiple endpoint now includes a clear notice about gRPC message size limitations, a thorough status code table, and detailed troubleshooting steps, which together offer a comprehensive guide.


115-118: MultiLinearSearch Endpoint Documentation Improved:
The updated summary for /linearsearch/multiple explains the functionality clearly and provides users with detailed troubleshooting information. This ensures that clients are aware of both the capabilities and limitations of the RPC.


148-151: Search Endpoint Documentation Enhanced:
The /search endpoint now features an expanded summary and detailed troubleshooting information. This helps in setting clear expectations regarding both the success and error conditions of the search operation.


180-184: SearchByID Endpoint Documentation Updated:
The summary for the /search/id endpoint now clearly indicates that the RPC searches based on a user-defined vector ID, including important validation notes and detailed error guidance.


213-217: MultiSearchByID Endpoint Documentation Enhanced:
The updated documentation for /search/id/multiple now includes a detailed status code table and clear troubleshooting advice. This helps users diagnose potential issues when searching with multiple IDs.


246-250: MultiSearch Endpoint Documentation Improved:
The enhanced summary and comprehensive troubleshooting information for the /search/multiple endpoint provide a clear and useful guide for users. The documentation now effectively communicates both functional details and considerations (such as gRPC limitations).

.gitfiles Show resolved Hide resolved
.gitfiles Show resolved Hide resolved
apis/swagger/v1/vald/object.swagger.json Show resolved Hide resolved
apis/swagger/v1/vald/object.swagger.json Show resolved Hide resolved
apis/swagger/v1/vald/update.swagger.json Show resolved Hide resolved
@github-actions github-actions bot removed the size/L label Feb 4, 2025
@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch from 9c82017 to 0e02ce2 Compare February 4, 2025 10:42
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.

Actionable comments posted: 4

🧹 Nitpick comments (7)
internal/observability/metrics/grpc/grpc.go (1)

81-104: Consider adding error handling for metrics registration.

While the implementation is good, consider adding cleanup or fallback handling if registration fails.

 func (gm *grpcServerMetrics) Register(m metrics.Meter) error {
+    var registeredMetrics []api.Observable
     healthyConn, err := m.Int64ObservableGauge(
         poolConnMetricsName,
         metrics.WithDescription(poolConnMetricsDescription),
         metrics.WithUnit(metrics.Dimensionless),
     )
     if err != nil {
+        // Clean up any previously registered metrics
+        for _, metric := range registeredMetrics {
+            m.UnregisterCallback(metric)
+        }
         return err
     }
+    registeredMetrics = append(registeredMetrics, healthyConn)
     _, err = m.RegisterCallback(
         func(ctx context.Context, o api.Observer) error {
             ms := pool.Metrics(ctx)
             if len(ms) == 0 {
                 return nil
             }
             for name, cnt := range ms {
                 o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name)))
             }
             return nil
         }, healthyConn,
     )
     return err
 }
internal/net/grpc/pool/pool.go (1)

592-594: Consider using deferred mutex unlock for metrics update.

The current implementation could lead to mutex contention. Consider using a deferred unlock pattern.

-    mu.Lock()
-    metrics[p.addr] = int64(pl - unhealthy)
-    mu.Unlock()
+    func() {
+        mu.Lock()
+        defer mu.Unlock()
+        metrics[p.addr] = int64(pl - unhealthy)
+    }()
apis/swagger/v1/vald/insert.swagger.json (1)

17-18: Typo in Endpoint Summary – “Inset RPC” vs “Insert RPC”.
The updated summary for the /insert endpoint now reads “Inset RPC is the method to add a new single vector.” This is likely a typographical mistake. Please update “Inset” to “Insert” for clarity and consistency with the rest of the API documentation.

apis/swagger/v1/vald/filter.swagger.json (1)

177-179: MultiUpdateObject Endpoint – Notice on Message Size Limitation.
The /update/object/multiple endpoint now has a noticed message about gRPC message size limitations, which is important for users creating bulk requests. Ensure the “exceed” wording is grammatically adjusted (consider “exceeds” instead of “exceed”).

Makefile (3)

161-167: Enhance Compiler and Linker Flags Consolidation
The new definitions for LDFLAGS, COMMON_FLAGS, and the language-specific flag variables are well organized. Using a central COMMON_FLAGS variable helps maintain consistency across C, C++ and Fortran builds.

Suggestion: Consider adding an inline comment above these definitions explaining the rationale behind using aggressive optimizations (e.g. -Ofast) and the potential trade-offs regarding floating‐point precision or portability.


181-182: Arm64 Flag Handling
For the arm64 architecture, the flags remain unchanged as indicated by the reassignment. Although functionally correct, it might be helpful to add a brief comment noting that no extra optimizations or disabling flags are required for arm64 targets.


185-186: Default Flag Preservation for Other Architectures
The explicit reassignment of CFLAGS and CXXFLAGS in the fallback path (for architectures other than amd64 or arm64) confirms that the defaults are preserved. This is clear and acceptable as long as no additional architecture-specific flags are needed.

Suggestion: A comment might be added to explain this branch if future maintainers expect architecture-specific optimizations.

🛑 Comments failed to post (4)
.gitfiles (1)

1951-1951: 💡 Codebase verification

Robust Error Handling: Multiple unwrap() Calls Detected

The QBG implementation and object handler contain numerous calls to unwrap(), which do not align with the robust error handling practices outlined in PR #2807. Specifically:

  • rust/libs/algorithms/qbg/src/lib.rs has multiple unwrap() usages (e.g., lines 379, 397, 448, etc.) that may cause runtime panics if an error occurs.
  • rust/bin/agent/src/handler/object.rs (line 43) calls unwrap() on the conversion of hostname, which can also lead to crashes if the hostname isn’t valid.

It’s recommended to refactor these instances to use proper error handling methods, such as ? for propagating errors or matching on the result to handle potential errors gracefully.

🔗 Analysis chain

LGTM: New Rust implementation files.

The addition of Rust source files for object handling and the QBG algorithm implementation aligns with the project's goals. Based on the retrieved learning from PR #2807, ensure that the Rust implementation follows robust error handling practices.

Run the following script to verify the Rust implementation follows the project's error handling guidelines:

Also applies to: 1970-1974

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for unwrap() usage in Rust files
# Note: unwrap() should be avoided as per PR #2807 learning

rg -n 'unwrap\(\)' 'rust/bin/agent/src/handler/object.rs' 'rust/libs/algorithms/qbg/src/lib.rs'

Length of output: 2839

apis/swagger/v1/vald/object.swagger.json (2)

77-77: 🛠️ Refactor suggestion

Add missing status codes and troubleshooting information.

The documentation for /object/meta/{id.id} endpoint has TODO placeholders for status codes and troubleshooting.

-        "summary": "Overview\nRepresent the RPC to get the vector metadata. This RPC is mainly used for index correction process\n---\nStatus Code\nTODO\n---\nTroubleshooting\nTODO",
+        "summary": "Overview\nRepresent the RPC to get the vector metadata. This RPC is mainly used for index correction process\n---\nStatus Code\n|  0   | OK                |\n|  1   | CANCELLED         |\n|  3   | INVALID_ARGUMENT  |\n|  4   | DEADLINE_EXCEEDED |\n|  5   | NOT_FOUND         |\n|  13  | INTERNAL          |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`.",
+        "description": "Here are some common reasons and how to resolve each error.\n\n| name              | common reason                                                                                   | how to resolve                                                                           |\n| :---------------- | :---------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------- |\n| CANCELLED         | Executed cancel() of rpc from client/server-side or network problems between client and server. | Check the code, especially around timeout and connection management, and fix if needed.  |\n| INVALID_ARGUMENT  | The Requested vector's ID is empty, or some request payload is invalid.                         | Check request payload and fix request payload.                                           |\n| DEADLINE_EXCEEDED | The RPC timeout setting is too short on the client/server side.                                 | Check the gRPC timeout setting on both the client and server sides and fix it if needed. |\n| NOT_FOUND         | Requested ID is NOT inserted.                                                                   | Send a request with an ID that is already inserted.                                      |\n| INTERNAL          | Target Vald cluster or network route has some critical error.                                   | Check target Vald cluster first and check network route including ingress as second.     |",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        "summary": "Overview\nRepresent the RPC to get the vector metadata. This RPC is mainly used for index correction process\n---\nStatus Code\n|  0   | OK                |\n|  1   | CANCELLED         |\n|  3   | INVALID_ARGUMENT  |\n|  4   | DEADLINE_EXCEEDED |\n|  5   | NOT_FOUND         |\n|  13  | INTERNAL          |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`.",
        "description": "Here are some common reasons and how to resolve each error.\n\n| name              | common reason                                                                                   | how to resolve                                                                           |\n| :---------------- | :---------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------- |\n| CANCELLED         | Executed cancel() of rpc from client/server-side or network problems between client and server. | Check the code, especially around timeout and connection management, and fix if needed.  |\n| INVALID_ARGUMENT  | The Requested vector's ID is empty, or some request payload is invalid.                         | Check request payload and fix request payload.                                           |\n| DEADLINE_EXCEEDED | The RPC timeout setting is too short on the client/server side.                                 | Check the gRPC timeout setting on both the client and server sides and fix it if needed. |\n| NOT_FOUND         | Requested ID is NOT inserted.                                                                   | Send a request with an ID that is already inserted.                                      |\n| INTERNAL          | Target Vald cluster or network route has some critical error.                                   | Check target Vald cluster first and check network route including ingress as second.     |",

47-47: 🛠️ Refactor suggestion

Add missing status codes and troubleshooting information.

The documentation for /object/list endpoint has TODO placeholders for status codes and troubleshooting.

-        "summary": "Overview\nA method to get all the vectors with server streaming\n---\nStatus Code\nTODO\n---\nTroubleshooting\nTODO",
+        "summary": "Overview\nA method to get all the vectors with server streaming\n---\nStatus Code\n|  0   | OK                |\n|  1   | CANCELLED         |\n|  3   | INVALID_ARGUMENT  |\n|  4   | DEADLINE_EXCEEDED |\n|  13  | INTERNAL          |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`.",
+        "description": "Here are some common reasons and how to resolve each error.\n\n| name              | common reason                                                                                   | how to resolve                                                                           |\n| :---------------- | :---------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------- |\n| CANCELLED         | Executed cancel() of rpc from client/server-side or network problems between client and server. | Check the code, especially around timeout and connection management, and fix if needed.  |\n| INVALID_ARGUMENT  | Some request payload is invalid.                                                                 | Check request payload and fix request payload.                                           |\n| DEADLINE_EXCEEDED | The RPC timeout setting is too short on the client/server side.                                 | Check the gRPC timeout setting on both the client and server sides and fix it if needed. |\n| INTERNAL          | Target Vald cluster or network route has some critical error.                                   | Check target Vald cluster first and check network route including ingress as second.     |",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        "summary": "Overview\nA method to get all the vectors with server streaming\n---\nStatus Code\n|  0   | OK                |\n|  1   | CANCELLED         |\n|  3   | INVALID_ARGUMENT  |\n|  4   | DEADLINE_EXCEEDED |\n|  13  | INTERNAL          |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`.",
        "description": "Here are some common reasons and how to resolve each error.\n\n| name              | common reason                                                                                   | how to resolve                                                                           |\n| :---------------- | :---------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------- |\n| CANCELLED         | Executed cancel() of rpc from client/server-side or network problems between client and server. | Check the code, especially around timeout and connection management, and fix if needed.  |\n| INVALID_ARGUMENT  | Some request payload is invalid.                                                                 | Check request payload and fix request payload.                                           |\n| DEADLINE_EXCEEDED | The RPC timeout setting is too short on the client/server side.                                 | Check the gRPC timeout setting on both the client and server sides and fix it if needed. |\n| INTERNAL          | Target Vald cluster or network route has some critical error.                                   | Check target Vald cluster first and check network route including ingress as second.     |",
apis/swagger/v1/vald/update.swagger.json (1)

83-84: ⚠️ Potential issue

Incomplete Placeholder for UpdateTimestamp Endpoint.
The summary for /update/timestamp still contains “TODO” placeholders for both status codes and troubleshooting. This should be completed before merging—the documentation must reflect actual status values and common error resolutions for the update timestamp operation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
apis/swagger/v1/vald/upsert.swagger.json (2)

50-51: Clarify MultiUpsert Endpoint Summary and Description.
The updated summary and description for the /upsert/multiple endpoint clearly outline the method’s purpose and call out the gRPC message size limitation. One minor nitpick: consider whether using "1 request" is optimal for clarity—using "a single request" might be more immediately clear for all readers.


252-253: Fix Typographical Error in UpsertMulti Request Definition.
There is a typo in the description field: "Represent mthe ultiple upsert request." should be corrected to "Represent the multiple upsert request." This change will improve clarity and maintain the professionalism of the API documentation.

Below is a suggested diff for the correction:

-        "description": "Represent mthe ultiple upsert request."
+        "description": "Represent the multiple upsert request."
internal/net/grpc/pool/pool.go (1)

1129-1158: Enhance error handling in rangeConns.

The rangeConns function has been updated to handle connection health checks, but there are a few improvements that could be made:

  1. Consider adding error propagation from Connect and Disconnect calls
  2. Consider adding retry logic for transient failures

Apply this diff to improve error handling:

 func (g *gRPCClient) rangeConns(
     ctx context.Context, force bool, fn func(addr string, p pool.Conn) bool,
 ) error {
     var cnt int
+    var errs []error
     g.conns.Range(func(addr string, p pool.Conn) bool {
         if force {
             cnt++
             return fn(addr, p)
         }
         if p == nil || !p.IsHealthy(ctx) {
             pc, err := p.Connect(ctx)
             if pc == nil || err != nil || !pc.IsHealthy(ctx) {
                 if pc != nil {
                     if derr := pc.Disconnect(ctx); derr != nil {
                         log.Debugf("Failed to disconnect unhealthy connection for %s: %v", addr, derr)
+                        errs = append(errs, derr)
                     }
                 }
                 log.Debugf("Unhealthy connection detected for %s during gRPC Connection Range over Loop:\t%s", addr, p.String())
                 return true
             }
             p = pc
         }
         cnt++
         return fn(addr, p)
     })
     if cnt == 0 {
         return errors.ErrGRPCClientConnNotFound("*")
     }
+    if len(errs) > 0 {
+        return errors.Join(errs...)
+    }
     return nil
 }
internal/net/grpc/client.go (1)

257-291: Consider adding retry logic for connection failures.

The connection monitoring loops could benefit from retry logic for transient failures.

Consider adding exponential backoff for connection retries:

 err = g.rangeConns(ctx, true, func(addr string, p pool.Conn) bool {
+    var retries int
+    var backoff = time.Millisecond * 100
     if addr == "" || p == nil {
         disconnectTargets = append(disconnectTargets, addr)
         return true
     }
     var err error
-    p, err = p.Reconnect(ctx, false)
+    for retries < 3 {
+        p, err = p.Reconnect(ctx, false)
+        if err == nil {
+            break
+        }
+        time.Sleep(backoff)
+        backoff *= 2
+        retries++
+    }
     if err != nil {
         // ... error handling ...
     }
     return true
 })

Also applies to: 294-333

apis/swagger/v1/vald/filter.swagger.json (2)

175-178: Revised Summary & Description for /update/object/multiple Endpoint
The updated text now incorporates a notice regarding potential gRPC message size limitations and provides a comprehensive status code table. One suggestion is to verify that the HTML snippet (i.e. <div class="notice">) renders correctly in your Swagger UI.


240-243: Enhanced Summary & Description for /upsert/object/multiple Endpoint
The updated summary and description now indicate that the endpoint performs multi-upsert operations, complete with a detailed status code table and an HTML-formatted notice about gRPC message size limitations. It is recommended to check that the embedded HTML is supported by all Swagger UI renderers in your environment.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c82017 and 0e02ce2.

⛔ Files ignored due to path filters (31)
  • apis/grpc/v1/agent/core/agent.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/agent/sidecar/sidecar.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/discoverer/discoverer.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/egress/egress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/ingress/ingress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/meta/meta.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/payload/payload.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/rpc/errdetails/error_details.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • example/client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • rust/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (68)
  • .gitfiles (4 hunks)
  • .github/ISSUE_TEMPLATE/bug_report.md (1 hunks)
  • .github/ISSUE_TEMPLATE/security_issue_report.md (1 hunks)
  • .github/PULL_REQUEST_TEMPLATE.md (1 hunks)
  • Makefile (6 hunks)
  • Makefile.d/functions.mk (2 hunks)
  • Makefile.d/k3d.mk (3 hunks)
  • apis/swagger/v1/mirror/mirror.swagger.json (1 hunks)
  • apis/swagger/v1/vald/filter.swagger.json (8 hunks)
  • apis/swagger/v1/vald/flush.swagger.json (1 hunks)
  • apis/swagger/v1/vald/index.swagger.json (5 hunks)
  • apis/swagger/v1/vald/insert.swagger.json (2 hunks)
  • apis/swagger/v1/vald/object.swagger.json (4 hunks)
  • apis/swagger/v1/vald/remove.swagger.json (3 hunks)
  • apis/swagger/v1/vald/search.swagger.json (8 hunks)
  • apis/swagger/v1/vald/update.swagger.json (3 hunks)
  • apis/swagger/v1/vald/upsert.swagger.json (2 hunks)
  • example/client/go.mod (2 hunks)
  • go.mod (12 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (16 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (26 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/net/net.go (1 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • k8s/discoverer/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/lb/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/mirror/deployment.yaml (1 hunks)
  • k8s/index/operator/deployment.yaml (1 hunks)
  • k8s/manager/index/deployment.yaml (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • rust/rust-toolchain (1 hunks)
  • versions/BUF_VERSION (1 hunks)
  • versions/CMAKE_VERSION (1 hunks)
  • versions/DOCKER_VERSION (1 hunks)
  • versions/FAISS_VERSION (1 hunks)
  • versions/GOLANGCILINT_VERSION (1 hunks)
  • versions/HELM_VERSION (1 hunks)
  • versions/K3S_VERSION (1 hunks)
  • versions/KIND_VERSION (1 hunks)
  • versions/KUBECTL_VERSION (1 hunks)
  • versions/NGT_VERSION (1 hunks)
  • versions/OPERATOR_SDK_VERSION (1 hunks)
  • versions/PROMETHEUS_STACK_VERSION (1 hunks)
  • versions/PROTOBUF_VERSION (1 hunks)
  • versions/RUST_VERSION (1 hunks)
  • versions/TELEPRESENCE_VERSION (1 hunks)
  • versions/USEARCH_VERSION (1 hunks)
  • versions/YQ_VERSION (1 hunks)
  • versions/actions/ACTIONS_SETUP_GO (1 hunks)
  • versions/actions/ACTIONS_SETUP_NODE (1 hunks)
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT (1 hunks)
  • versions/actions/CODECOV_CODECOV_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_QEMU_ACTION (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_INIT (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF (1 hunks)
  • versions/actions/GITHUB_ISSUE_METRICS (1 hunks)
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_HADOLINT (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL (1 hunks)
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (58)
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST
  • versions/KUBECTL_VERSION
  • versions/OPERATOR_SDK_VERSION
  • versions/CMAKE_VERSION
  • versions/actions/ACTIONS_SETUP_GO
  • versions/DOCKER_VERSION
  • versions/actions/ACTIONS_SETUP_NODE
  • versions/HELM_VERSION
  • versions/actions/DOCKER_SETUP_QEMU_ACTION
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF
  • versions/PROTOBUF_VERSION
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT
  • versions/KIND_VERSION
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION
  • versions/actions/CODECOV_CODECOV_ACTION
  • versions/RUST_VERSION
  • versions/NGT_VERSION
  • versions/PROMETHEUS_STACK_VERSION
  • versions/FAISS_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE
  • versions/BUF_VERSION
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE
  • versions/USEARCH_VERSION
  • versions/K3S_VERSION
  • internal/net/net.go
  • versions/actions/REVIEWDOG_ACTION_HADOLINT
  • versions/actions/GITHUB_ISSUE_METRICS
  • versions/actions/GITHUB_CODEQL_ACTION_INIT
  • versions/GOLANGCILINT_VERSION
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL
  • versions/YQ_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD
  • internal/test/mock/grpc/grpc_client_mock.go
  • Makefile.d/functions.mk
  • internal/test/mock/grpc_testify_mock.go
  • k8s/gateway/gateway/lb/deployment.yaml
  • pkg/gateway/lb/handler/grpc/aggregation.go
  • internal/backoff/backoff.go
  • versions/TELEPRESENCE_VERSION
  • .github/PULL_REQUEST_TEMPLATE.md
  • rust/rust-toolchain
  • apis/swagger/v1/vald/flush.swagger.json
  • apis/swagger/v1/mirror/mirror.swagger.json
  • k8s/index/operator/deployment.yaml
  • pkg/gateway/mirror/service/mirror.go
  • apis/swagger/v1/vald/index.swagger.json
  • k8s/discoverer/deployment.yaml
  • .github/ISSUE_TEMPLATE/bug_report.md
  • k8s/gateway/gateway/mirror/deployment.yaml
  • k8s/manager/index/deployment.yaml
  • .github/ISSUE_TEMPLATE/security_issue_report.md
  • Makefile.d/k3d.mk
  • internal/net/grpc/pool/pool_test.go
  • apis/swagger/v1/vald/remove.swagger.json
  • Makefile
  • example/client/go.mod
  • apis/swagger/v1/vald/update.swagger.json
  • go.mod
🧰 Additional context used
📓 Learnings (1)
.gitfiles (1)
Learnt from: datelier
PR: vdaas/vald#2807
File: rust/bin/agent/src/handler/object.rs:36-38
Timestamp: 2025-01-22T06:26:59.130Z
Learning: In Rust code for Vald handlers, avoid using unwrap() on Option types. Instead, handle None values explicitly by returning appropriate error statuses to maintain robustness.
🔇 Additional comments (34)
apis/swagger/v1/vald/upsert.swagger.json (1)

17-18: Enhance Upsert Endpoint Documentation.
The updated summary and description now provide a comprehensive overview of the Upsert RPC, including detailed status codes and troubleshooting guidelines. This enriched documentation should be very helpful to API consumers. Please verify that the multiline text with embedded markdown-style tables renders as expected in your Swagger UI.

apis/swagger/v1/vald/object.swagger.json (4)

17-18: Enhanced Documentation for the /exists/{id} Endpoint.
The updated summary and added description now clearly outline the purpose of the Exists RPC along with detailed error codes and troubleshooting steps. Please ensure that these textual details remain in sync with the backend’s error-handling logic and overall documentation standards.


47-47: Incomplete Documentation in /object/list Endpoint.
The summary field still has placeholders ("Status Code TODO" and "Troubleshooting TODO"). Please fill in these sections with the appropriate status codes and troubleshooting information so that API consumers receive complete documentation.


77-77: Incomplete Documentation in /object/meta/{id.id} Endpoint.
Similar to the /object/list endpoint, the summary contains placeholders for "Status Code" and "Troubleshooting". Completing these sections will improve clarity and consistency with other endpoints.


106-107: Enhanced Documentation for the /object/{id.id} Endpoint.
The improved summary and description now offer a detailed explanation of the GetObject RPC, including a clear table of status codes and error scenarios. Verify that these details accurately reflect the actual behavior and error responses from the service, and ensure consistency with similar endpoints such as /exists/{id}.

.gitfiles (2)

1951-1951: Verify error handling in the new Rust handler.

Based on previous feedback, ensure that the new object handler follows proper error handling practices by avoiding unwrap() on Option types.

Run the following script to check for unwrap() usage:

#!/bin/bash
# Description: Check for unwrap() usage in the new Rust handler
rg -A 2 'unwrap\(\)' rust/bin/agent/src/handler/object.rs

1970-1974: Integration Testing and Documentation for QBG Algorithm Are Missing.

The new QBG algorithm implementation requires proper integration testing and documentation.

Run the following script to check for tests and documentation:

#!/bin/bash
# Description: Check for test files and documentation for QBG algorithm
echo "Checking for tests..."
fd -t f "test" rust/libs/algorithms/qbg/

echo "Checking for documentation..."
fd -t f "README|\.md" rust/libs/algorithms/qbg/
internal/observability/metrics/grpc/grpc.go (4)

33-34: LGTM! New metrics for pool connection health.

The new metrics server_pool_conn with description "Count of healthy pool connections by target address" will help monitor the health of gRPC connection pools.


37-39: LGTM! Added field for target address key.

The poolTargetAddrKey field in grpcServerMetrics struct will be used as a label key for the pool connection metrics.


69-77: LGTM! Added view for pool connection metrics.

The new view for poolConnMetricsName with AggregationSum is correctly configured to track the count of healthy pool connections.


83-103: LGTM! Added registration for pool connection metrics.

The registration logic correctly:

  1. Creates an observable gauge for pool connections
  2. Registers a callback to observe connection health
  3. Uses the target address as an attribute
internal/net/grpc/pool/pool.go (3)

93-95: LGTM! Added global metrics storage.

The global metrics map with mutex synchronization will store connection health metrics per target address.


592-594: LGTM! Added metrics update in health check.

The health check function now updates the metrics map with the count of healthy connections per target address.


879-887: LGTM! Added metrics retrieval function.

The Metrics function safely returns a clone of the metrics map using maps.Clone.

internal/net/grpc/client.go (2)

1096-1108: LGTM! Added context to ConnectedAddrs.

The ConnectedAddrs function now accepts a context parameter and uses it for health checks, improving context propagation.


1115-1124: LGTM! Added context handling in Close.

The Close function now properly handles context cancellation during cleanup.

internal/net/grpc/client_test.go (2)

3110-3110: Context parameter added but test cases need update.

The ConnectedAddrs method now accepts a context parameter but the test cases haven't been fully updated to reflect this change.

The test should verify context cancellation and timeout scenarios. Apply this diff:

-gotAddrs := g.ConnectedAddrs(context.Background())
+ctx, cancel := context.WithTimeout(context.Background(), time.Second)
+defer cancel()
+gotAddrs := g.ConnectedAddrs(ctx)

3306-3308: Context parameter added to rangeConns but test cases need update.

The rangeConns method signature has been updated to include a context parameter but the test cases need to be updated accordingly.

The test should verify context handling. Apply this diff:

type args struct {
-    ctx context.Context
+    ctx context.Context
     fn func(addr string, p pool.Conn) bool
}
apis/swagger/v1/vald/insert.swagger.json (2)

17-18: LGTM! Comprehensive API documentation improvements.

The documentation improvements for the /insert endpoint are excellent, providing:

  • Clear status code table
  • Detailed troubleshooting guide
  • Common error scenarios and resolutions

50-51: LGTM! Well-structured documentation for multi-insert endpoint.

The documentation for /insert/multiple endpoint is well-structured with:

  • Clear warning about gRPC message size limitations
  • Comprehensive status code table
  • Detailed troubleshooting section
apis/swagger/v1/vald/filter.swagger.json (6)

15-17: Enhanced Summary for /insert/object Endpoint
The updated summary now clearly specifies that the endpoint uses the InsertObject RPC method and includes a detailed status code table. This increased clarity should help API consumers understand the possible responses.


47-50: Improved Summary for /insert/object/multiple Endpoint
The revised summary explicitly states that the endpoint corresponds to the MultiInsertObject RPC and incorporates a status code table with a notice on gRPC message size limitations. This additional information is useful for users concerned with message size constraints.


79-82: Updated Summary for /search/object Endpoint
The summary now clearly indicates that this endpoint uses the SearchObject RPC and provides a detailed status code table. This update enhances overall clarity and consistency.


111-114: Clarified Summary for /search/object/multiple Endpoint
The updated summary emphasizes that the endpoint employs the bidirectional streaming mechanism (StreamSearchObject RPC) while including a status code table. This change makes it easier for consumers to understand the streaming behavior.


143-146: Refined Summary for /update/object Endpoint
The summary update explicitly denotes the UpdateObject RPC along with a detailed list of status codes, which promotes better understanding of the operation’s expected outcomes.


208-211: Improved Summary for /upsert/object Endpoint
The revised summary clearly describes that this endpoint facilitates both updating and inserting a single object, and it details the corresponding status codes. This enhances the clarity of the API documentation.

apis/swagger/v1/vald/search.swagger.json (8)

15-19: Enhanced Summary and Troubleshooting for /linearsearch Endpoint
The updated summary now explains that the endpoint uses the LinearSearch RPC method and includes a detailed status code table with troubleshooting guidance. This comprehensive information should help users diagnose potential issues.


48-52: Updated Summary for /linearsearch/id Endpoint
The enhanced summary now identifies the endpoint as using the LinearSearchByID RPC, noting that the vector must be indexed into vald-agent beforehand. Detailed status codes and troubleshooting instructions are provided, which improve clarity.


81-85: Improved Summary for /linearsearch/id/multiple Endpoint
The revised summary now clarifies that this endpoint supports multi-linear search by IDs in a single call. The inclusion of an HTML notice regarding gRPC message size limitations and detailed troubleshooting steps significantly enhances the documentation.


114-118: Revised Summary for /linearsearch/multiple Endpoint
The updated summary and description clearly communicate that the endpoint enables multi-linear search with multiple vectors in one request. The detailed status codes and troubleshooting information offer valuable guidance for resolving issues.


147-151: Enhanced Summary for /search Endpoint
The updated summary now includes a detailed status code table and troubleshooting instructions for the Search RPC endpoint, thereby improving usability and user guidance.


180-184: Updated Summary for /search/id Endpoint
The summary has been refined to clearly indicate that the endpoint uses the SearchByID RPC. It also provides important notes about pre-indexing requirements along with a comprehensive list of status codes and troubleshooting advice.


213-217: Enhanced Summary for /search/id/multiple Endpoint
The revised summary now reflects that this endpoint processes multi-search requests by IDs and provides extensive troubleshooting details, including an HTML notice on gRPC message size limitations.


246-250: Updated Summary for /search/multiple Endpoint
The enhanced summary clearly outlines the multi-search functionality using multiple vectors in a single request. The inclusion of a detailed status code table, troubleshooting guidance, and a notice regarding gRPC message size limitations makes the endpoint documentation much clearer.

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 4, 2025
coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 4, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
Makefile.d/build.mk (1)

45-45: LGTM! Switching to static linking improves binary portability.

The change from $(LDFLAGS) to $(STATIC_LDFLAGS) for CGO-enabled targets is a good improvement that:

  • Reduces runtime dependencies by statically linking libraries
  • Makes binaries more portable across different environments
  • Ensures consistent behavior by bundling all dependencies

Note that this may increase binary sizes and build times.

Also applies to: 50-50, 102-102, 110-110, 114-114

Makefile (4)

161-162: Improved linker flags organization.

The separation of LDFLAGS and STATIC_LDFLAGS provides better control over linking behavior. The new flags focus on security (-z relro -z now), optimization (-flto=auto, -Wl,-O2), and size reduction (-Wl,--gc-sections).


170-175: Well-structured compiler optimization flags.

The introduction of COMMON_FLAGS with carefully chosen optimization flags is excellent:

  • Performance optimizations: -O3, -ffast-math, -march=native
  • Size optimizations: -ffunction-sections, -fdata-sections
  • Security improvements: -fvisibility=hidden

177-190: Good architecture-specific optimizations.

The architecture-specific handling is well-implemented:

  • amd64: Disables newer AVX instructions for better compatibility
  • arm64: Uses appropriate march setting
  • Fallback for other architectures

688-706: Consistent CMake configurations across different installations.

The CMake configurations for NGT, Faiss, and usearch are consistently structured with appropriate flags for:

  • Build type (Release)
  • Static linking
  • Optimization flags
  • Installation paths

Also applies to: 711-734, 740-766

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef74e7e and fd9a346.

📒 Files selected for processing (3)
  • Makefile (5 hunks)
  • Makefile.d/build.mk (2 hunks)
  • Makefile.d/tools.mk (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • Makefile.d/tools.mk
⏰ Context from checks skipped due to timeout of 90000ms (98)
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 4, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
internal/net/grpc/pool/pool.go (4)

303-347: Consider optimizing connection refresh logic.

The refreshConn function could be optimized by:

  1. Reducing nested if conditions
  2. Moving the connection closure logic to a separate helper function
  3. Adding retry mechanism for failed connections
 func (p *pool) refreshConn(ctx context.Context, idx int, pc *poolConn, addr string) (err error) {
+    closeConn := func(pc *poolConn) {
+        if pc != nil && pc.conn != nil {
+            p.eg.Go(safety.RecoverFunc(func() error {
+                log.Debugf("waiting for invalid connection to %s to be closed...", pc.addr)
+                err := pc.Close(ctx, p.roccd)
+                if err != nil {
+                    log.Debugf("failed to close connection pool addr = %s\terror = %v", pc.addr, err)
+                }
+                return nil
+            }))
+        }
+    }
+
     if pc != nil && pc.addr == addr && isHealthy(ctx, pc.conn) {
         return nil
     }
-    if pc != nil {
-        log.Debugf("connection for %s pool %d/%d len %d is unhealthy trying to establish new pool member connection to %s", pc.addr, idx+1, p.Size(), p.Len(), addr)
-    } else {
-        log.Debugf("connection pool %d/%d len %d is empty, establish new pool member connection to %s", idx+1, p.Size(), p.Len(), addr)
-    }
+    log.Debugf("connection pool %d/%d len %d status: %s, establish new pool member connection to %s",
+        idx+1, p.Size(), p.Len(),
+        map[bool]string{true: "unhealthy", false: "empty"}[pc != nil],
+        addr)
+
     conn, err := p.dial(ctx, addr)
     if err != nil {
         if pc != nil && isHealthy(ctx, pc.conn) {
             log.Debugf("dialing new connection to %s failed,\terror: %v,\tbut existing connection to %s is healthy will keep existing connection", addr, err, pc.addr)
             return nil
         }
-        if pc != nil {
-            if pc.conn != nil {
-                p.eg.Go(safety.RecoverFunc(func() error {
-                    log.Debugf("waiting for invalid connection to %s to be closed...", pc.addr)
-                    err = pc.Close(ctx, p.roccd)
-                    if err != nil {
-                        log.Debugf("failed to close connection pool addr = %s\terror = %v", pc.addr, err)
-                    }
-                    return nil
-                }))
-            }
-        }
+        closeConn(pc)
         return errors.Join(err, errors.ErrInvalidGRPCClientConn(addr))
     }
+
     p.store(idx, &poolConn{
         conn: conn,
         addr: addr,
     })
-    if pc != nil {
-        p.eg.Go(safety.RecoverFunc(func() error {
-            log.Debugf("waiting for old connection to %s to be closed...", pc.addr)
-            err = pc.Close(ctx, p.roccd)
-            if err != nil {
-                log.Debugf("failed to close connection pool addr = %s\terror = %v", pc.addr, err)
-            }
-            return nil
-        }))
-    }
+    closeConn(pc)
     return nil
 }

617-651: Enhance error handling in Do method.

The error handling in the Do method could be improved by:

  1. Adding retries with backoff for transient failures
  2. Better handling of context cancellation
 func (p *pool) Do(ctx context.Context, f func(conn *ClientConn) error) (err error) {
     if p == nil {
         return errors.ErrGRPCClientConnNotFound("*")
     }
+
+    var attempts int
+    maxAttempts := 3
+
+    for attempts < maxAttempts {
+        if ctx.Err() != nil {
+            return ctx.Err()
+        }
+
         idx, pc, ok := p.getHealthyConn(ctx, 0, p.Len())
         if !ok || pc == nil || pc.conn == nil {
+            attempts++
+            if attempts == maxAttempts {
                 return errors.ErrGRPCClientConnNotFound(p.addr)
+            }
+            time.Sleep(time.Duration(attempts) * 100 * time.Millisecond)
+            continue
         }
 
         conn := pc.conn
         err = f(conn)
         if err != nil {
             st, ok := status.FromError(err)
-            if ok && st != nil && st.Code() != codes.Canceled {
+            if ok && st != nil {
+                if st.Code() == codes.Canceled {
+                    return err
+                }
                 if conn != nil {
                     cerr := conn.Close()
                     if cerr != nil {
                         st, ok := status.FromError(cerr)
                         if ok && st != nil && st.Code() != codes.Canceled {
                             log.Warnf("Failed to close connection: %v", cerr)
                         }
                     }
                 }
                 rerr := p.refreshConn(ctx, idx, pc, p.addr)
                 if rerr == nil {
                     if newErr := f(p.load(idx).conn); newErr != nil {
                         return errors.Join(err, newErr)
                     }
                     return nil
                 }
                 err = errors.Join(err, rerr)
+                attempts++
+                if attempts < maxAttempts {
+                    time.Sleep(time.Duration(attempts) * 100 * time.Millisecond)
+                    continue
+                }
             }
         }
+        return err
+    }
     return err
 }

547-615: Optimize health check implementation.

The IsHealthy method could be improved by:

  1. Using atomic operations for counters
  2. Adding metrics for individual connection states
  3. Implementing circuit breaker pattern
+type healthMetrics struct {
+    total     int64
+    healthy   int64
+    unhealthy int64
+    states    map[connectivity.State]int64
+}
+
 func (p *pool) IsHealthy(ctx context.Context) (healthy bool) {
     if p == nil || p.closing.Load() {
         return false
     }
     log.Debug("Checking health...")
-    var cnt, unhealthy int
+    metrics := &healthMetrics{
+        states: make(map[connectivity.State]int64),
+    }
     pl := p.len()
+    atomic.StoreInt64(&metrics.total, int64(pl))
+
     err := p.loop(ctx, func(ctx context.Context, idx int, pc *poolConn) bool {
+        if pc != nil && pc.conn != nil {
+            state := pc.conn.GetState()
+            atomic.AddInt64(&metrics.states[state], 1)
+        }
+
         if pc == nil || !isHealthy(ctx, pc.conn) {
+            atomic.AddInt64(&metrics.unhealthy, 1)
             if p.isIP {
                 // ... rest of the IP handling logic
             }
             // ... rest of the DNS handling logic
         }
-        cnt++
+        atomic.AddInt64(&metrics.healthy, 1)
         return true
     })
     
     mu.Lock()
-    metrics[p.addr] = int64(pl - unhealthy)
+    metrics[p.addr] = atomic.LoadInt64(&metrics.healthy)
     mu.Unlock()
     
-    if cnt == 0 {
+    if atomic.LoadInt64(&metrics.healthy) == 0 {
         return false
     }
     if p.isIP {
-        return unhealthy == 0
+        return atomic.LoadInt64(&metrics.unhealthy) == 0
     }
-    return unhealthy < pl
+    return atomic.LoadInt64(&metrics.unhealthy) < int64(pl)
 }

893-901: Consider enhancing metrics with more detailed information.

The current metrics implementation could be improved by adding:

  1. Connection state counts
  2. Connection age
  3. Error counts
  4. Latency metrics
+type ConnectionMetrics struct {
+    HealthyConnections   int64
+    StateMetrics        map[connectivity.State]int64
+    LastErrorTimestamp  int64
+    ErrorCount          int64
+    AverageLatency      int64
+}
+
+var (
+    metricsData map[string]*ConnectionMetrics
+)
+
-func Metrics(context.Context) map[string]int64 {
+func Metrics(context.Context) map[string]*ConnectionMetrics {
     mu.RLock()
     defer mu.RUnlock()
 
-    if len(metrics) == 0 {
+    if len(metricsData) == 0 {
         return nil
     }
-    return maps.Clone(metrics)
+    return maps.Clone(metricsData)
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd9a346 and 551c681.

📒 Files selected for processing (2)
  • internal/net/grpc/pool/pool.go (27 hunks)
  • internal/net/net.go (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/net/net.go
⏰ Context from checks skipped due to timeout of 90000ms (109)
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
🔇 Additional comments (1)
internal/net/grpc/pool/pool.go (1)

92-95: LGTM! Global metrics map is properly synchronized.

The implementation uses a mutex to protect concurrent access to the metrics map, which is a good practice for thread safety.

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 5, 2025
@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch from 9726362 to 341a593 Compare February 5, 2025 02:38
Signed-off-by: kpango <[email protected]>
Signed-off-by: kpango <[email protected]>
Signed-off-by: kpango <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
internal/observability/metrics/grpc/grpc.go (1)

83-103: Consider adding error metrics for connection failures.

While the current implementation tracks healthy connections, it would be valuable to also track connection failures to help diagnose issues.

Consider adding a counter for failed connections:

+const (
+    poolConnFailedMetricsName        = "server_pool_conn_failed"
+    poolConnFailedMetricsDescription = "Count of failed pool connections by target address"
+)

 func (gm *grpcServerMetrics) Register(m metrics.Meter) error {
     healthyConn, err := m.Int64ObservableGauge(
         poolConnMetricsName,
         metrics.WithDescription(poolConnMetricsDescription),
         metrics.WithUnit(metrics.Dimensionless),
     )
     if err != nil {
         return err
     }
+    failedConn, err := m.Int64Counter(
+        poolConnFailedMetricsName,
+        metrics.WithDescription(poolConnFailedMetricsDescription),
+        metrics.WithUnit(metrics.Dimensionless),
+    )
+    if err != nil {
+        return err
+    }
internal/net/grpc/pool/pool.go (1)

92-95: Consider using sync.Map for metrics.

The current implementation uses a regular map with mutex, which could be replaced with sync.Map for better performance in concurrent scenarios.

-var (
-    mu      sync.RWMutex
-    metrics map[string]int64 = make(map[string]int64)
-)
+var metrics sync.Map
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 551c681 and 56a7e41.

⛔ Files ignored due to path filters (31)
  • apis/grpc/v1/agent/core/agent.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/agent/sidecar/sidecar.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/discoverer/discoverer.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/egress/egress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/ingress/ingress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/meta/meta.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/payload/payload.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/rpc/errdetails/error_details.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • example/client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • rust/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (70)
  • .gitfiles (4 hunks)
  • .github/ISSUE_TEMPLATE/bug_report.md (1 hunks)
  • .github/ISSUE_TEMPLATE/security_issue_report.md (1 hunks)
  • .github/PULL_REQUEST_TEMPLATE.md (1 hunks)
  • Makefile (5 hunks)
  • Makefile.d/build.mk (2 hunks)
  • Makefile.d/functions.mk (3 hunks)
  • Makefile.d/k3d.mk (3 hunks)
  • Makefile.d/tools.mk (2 hunks)
  • apis/swagger/v1/mirror/mirror.swagger.json (1 hunks)
  • apis/swagger/v1/vald/filter.swagger.json (8 hunks)
  • apis/swagger/v1/vald/flush.swagger.json (1 hunks)
  • apis/swagger/v1/vald/index.swagger.json (5 hunks)
  • apis/swagger/v1/vald/insert.swagger.json (2 hunks)
  • apis/swagger/v1/vald/object.swagger.json (4 hunks)
  • apis/swagger/v1/vald/remove.swagger.json (3 hunks)
  • apis/swagger/v1/vald/search.swagger.json (8 hunks)
  • apis/swagger/v1/vald/update.swagger.json (3 hunks)
  • apis/swagger/v1/vald/upsert.swagger.json (2 hunks)
  • example/client/go.mod (2 hunks)
  • go.mod (12 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (18 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (27 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/net/net.go (3 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • k8s/discoverer/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/lb/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/mirror/deployment.yaml (1 hunks)
  • k8s/index/operator/deployment.yaml (1 hunks)
  • k8s/manager/index/deployment.yaml (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • rust/rust-toolchain (1 hunks)
  • versions/BUF_VERSION (1 hunks)
  • versions/CMAKE_VERSION (1 hunks)
  • versions/DOCKER_VERSION (1 hunks)
  • versions/FAISS_VERSION (1 hunks)
  • versions/GOLANGCILINT_VERSION (1 hunks)
  • versions/HELM_VERSION (1 hunks)
  • versions/K3S_VERSION (1 hunks)
  • versions/KIND_VERSION (1 hunks)
  • versions/KUBECTL_VERSION (1 hunks)
  • versions/NGT_VERSION (1 hunks)
  • versions/OPERATOR_SDK_VERSION (1 hunks)
  • versions/PROMETHEUS_STACK_VERSION (1 hunks)
  • versions/PROTOBUF_VERSION (1 hunks)
  • versions/RUST_VERSION (1 hunks)
  • versions/TELEPRESENCE_VERSION (1 hunks)
  • versions/USEARCH_VERSION (1 hunks)
  • versions/YQ_VERSION (1 hunks)
  • versions/actions/ACTIONS_SETUP_GO (1 hunks)
  • versions/actions/ACTIONS_SETUP_NODE (1 hunks)
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT (1 hunks)
  • versions/actions/CODECOV_CODECOV_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_QEMU_ACTION (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_INIT (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF (1 hunks)
  • versions/actions/GITHUB_ISSUE_METRICS (1 hunks)
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_HADOLINT (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL (1 hunks)
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (59)
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST
  • versions/actions/DOCKER_SETUP_QEMU_ACTION
  • versions/HELM_VERSION
  • versions/KUBECTL_VERSION
  • versions/KIND_VERSION
  • versions/actions/ACTIONS_SETUP_GO
  • versions/BUF_VERSION
  • versions/USEARCH_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF
  • versions/OPERATOR_SDK_VERSION
  • versions/TELEPRESENCE_VERSION
  • versions/CMAKE_VERSION
  • versions/actions/CODECOV_CODECOV_ACTION
  • versions/NGT_VERSION
  • versions/FAISS_VERSION
  • versions/actions/ACTIONS_SETUP_NODE
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE
  • versions/GOLANGCILINT_VERSION
  • versions/PROTOBUF_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_INIT
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD
  • versions/K3S_VERSION
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION
  • versions/RUST_VERSION
  • versions/PROMETHEUS_STACK_VERSION
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL
  • versions/actions/GITHUB_ISSUE_METRICS
  • apis/swagger/v1/vald/flush.swagger.json
  • k8s/index/operator/deployment.yaml
  • k8s/gateway/gateway/lb/deployment.yaml
  • apis/swagger/v1/vald/index.swagger.json
  • internal/test/mock/grpc/grpc_client_mock.go
  • .github/ISSUE_TEMPLATE/bug_report.md
  • .github/PULL_REQUEST_TEMPLATE.md
  • internal/test/mock/grpc_testify_mock.go
  • Makefile.d/build.mk
  • Makefile.d/k3d.mk
  • apis/swagger/v1/mirror/mirror.swagger.json
  • internal/backoff/backoff.go
  • k8s/manager/index/deployment.yaml
  • k8s/discoverer/deployment.yaml
  • pkg/gateway/lb/handler/grpc/aggregation.go
  • versions/YQ_VERSION
  • internal/net/grpc/pool/pool_test.go
  • versions/DOCKER_VERSION
  • rust/rust-toolchain
  • versions/actions/REVIEWDOG_ACTION_HADOLINT
  • .github/ISSUE_TEMPLATE/security_issue_report.md
  • internal/net/net.go
  • pkg/gateway/mirror/service/mirror.go
  • Makefile.d/tools.mk
  • k8s/gateway/gateway/mirror/deployment.yaml
  • apis/swagger/v1/vald/remove.swagger.json
  • Makefile.d/functions.mk
  • apis/swagger/v1/vald/update.swagger.json
  • example/client/go.mod
  • go.mod
🧰 Additional context used
📓 Learnings (1)
.gitfiles (1)
Learnt from: datelier
PR: vdaas/vald#2807
File: rust/bin/agent/src/handler/object.rs:36-38
Timestamp: 2025-01-22T06:26:59.130Z
Learning: In Rust code for Vald handlers, avoid using unwrap() on Option types. Instead, handle None values explicitly by returning appropriate error statuses to maintain robustness.
🔇 Additional comments (39)
apis/swagger/v1/vald/upsert.swagger.json (2)

17-18: Enhanced /upsert Documentation: Detailed Summary and Description

The updated summary now offers a comprehensive overview of the Upsert RPC, including a clear description of its update/insert behavior and a detailed status code table. The added description provides valuable troubleshooting tips for common error scenarios. Please verify that the embedded markdown (tables, line breaks, and code formatting with backticks) renders correctly in your Swagger UI output.


50-51: Enhanced MultiUpsert Endpoint Documentation

The changes for the /upsert/multiple endpoint now include a more descriptive summary emphasizing the ability to update or add multiple vectors in a single request, with bold formatting for clarity. The description starts with an HTML notice warning about gRPC message size limitations and then outlines detailed status codes and troubleshooting steps. Ensure that the HTML snippet and markdown content display as intended across all consuming tools.

apis/swagger/v1/vald/insert.swagger.json (3)

18-18: Enhanced Error Handling and Troubleshooting (Description Update)
The updated description for the /insert endpoint now provides a detailed set of common error reasons and resolutions. This added clarity is very beneficial to API users.


50-50: Clear MultiInsert Summary
The summary for the /insert/multiple endpoint is concise and clearly states that multiple vectors can be added in a single request. The use of markdown (e.g., 1) appropriately emphasizes the key point.


51-51: Detailed MultiInsert Description with Warning Note
The new description for the /insert/multiple endpoint adds a useful warning about gRPC message size limitations and provides comprehensive status code and troubleshooting information. Please verify that the HTML encoded elements (e.g., \u003cdiv class="notice"\u003e) render correctly in your Swagger UI environment.

internal/observability/metrics/grpc/grpc.go (3)

33-34: LGTM! Clear and descriptive metric names.

The metric names and descriptions are well-defined and follow observability best practices.


37-39: LGTM! Good struct field naming.

The poolTargetAddrKey field is appropriately named and its purpose is clear from the context.


69-77: LGTM! Well-structured metric view.

The view for pool connection metrics is properly configured with:

  • Clear name and description
  • Appropriate aggregation type (Sum)
internal/net/grpc/pool/pool.go (2)

600-602: LGTM! Thread-safe metrics update.

The metrics update is properly protected by a mutex lock.


894-902: Consider adding new Metrics function.

The new Metrics function effectively uses maps.Clone for safe access to metrics data.

internal/net/grpc/client.go (2)

1138-1140: LGTM! Good method signature update.

The rangeConns method now properly accepts context and force parameters, improving control over the connection loop.


1124-1133: LGTM! Proper context cancellation handling.

The code correctly handles context cancellation during connection closure.

.gitfiles (2)

161-177: LGTM! Comprehensive API documentation coverage.

The addition of detailed API documentation files for core Vald operations (filter, flush, index, insert, mirror, object, remove, search, update, upsert) improves the project's usability and maintainability.


1970-1974: Integration Testing and Documentation for QBG Algorithm Are Missing

The new QBG algorithm implementation lacks:

  • Test files in the QBG directory
  • README or documentation explaining the algorithm's purpose and usage

Run the following script to verify the missing test files and documentation:

#!/bin/bash
# Description: Check for test files and documentation for QBG algorithm
echo "Checking for tests..."
fd -t f "test" rust/libs/algorithms/qbg/

echo "Checking for documentation..."
fd -t f "README|\.md" rust/libs/algorithms/qbg/
internal/net/grpc/client_test.go (2)

3306-3308: Context parameter added to rangeConns but test cases need update.

The rangeConns method signature has been updated to include a context parameter but the test cases need to be updated accordingly.

The test should verify context handling. Apply this diff:

type args struct {
-    ctx context.Context
+    ctx context.Context
     fn func(addr string, p pool.Conn) bool
}

3110-3110: 🛠️ Refactor suggestion

Context parameter added but test cases need update.

The ConnectedAddrs method now accepts a context parameter but the test cases haven't been fully updated to reflect this change.

The test should verify context cancellation and timeout scenarios. Apply this diff:

-gotAddrs := g.ConnectedAddrs(context.Background())
+ctx, cancel := context.WithTimeout(context.Background(), time.Second)
+defer cancel()
+gotAddrs := g.ConnectedAddrs(ctx)

Likely invalid or redundant comment.

apis/swagger/v1/vald/object.swagger.json (4)

17-18: LGTM! Documentation improvements for /exists/{id} endpoint.

The documentation has been significantly enhanced with:

  • Clear overview section
  • Comprehensive status code table
  • Detailed troubleshooting guide with common errors and resolutions

47-47: Complete the TODO sections in documentation.

The /object/list endpoint documentation has incomplete sections:

  • Status Code: TODO
  • Troubleshooting: TODO

Would you like me to help generate the missing documentation sections?


77-77: Complete the TODO sections in documentation.

The /object/meta/{id.id} endpoint documentation has incomplete sections:

  • Status Code: TODO
  • Troubleshooting: TODO

Would you like me to help generate the missing documentation sections?


106-107: LGTM! Documentation improvements for /object/{id.id} endpoint.

The documentation has been significantly enhanced with:

  • Clear overview section
  • Comprehensive status code table
  • Detailed troubleshooting guide with common errors and resolutions
apis/swagger/v1/vald/filter.swagger.json (8)

17-17: Enhanced InsertObject Documentation
The updated summary now clearly explains that the endpoint uses the InsertObject RPC with a detailed table of status codes. This added context will help API consumers understand potential responses.


49-49: Improved MultiInsertObject Summary
The description for the /insert/object/multiple endpoint has been expanded to explain that multiple objects can be inserted in one request and now includes clear status code details.


81-81: Clarified SearchObject Summary
The updated summary for /search/object now explains that the SearchObject RPC is used and provides users with a table of status codes. This is a welcome improvement in clarity.


113-113: Enhanced StreamSearchObject Documentation
The revision for /search/object/multiple now emphasizes the bidirectional streaming nature of the RPC along with status codes and troubleshooting information. This extra detail should aid both developers and integrators.


145-145: Clearer UpdateObject Summary
The new summary for /update/object succinctly explains that this endpoint updates a single vector. Including the status code table ensures that consumers know what to expect on error or success.


177-178: MultiUpdateObject Enhanced with Warning Notice
The /update/object/multiple endpoint now includes a summary and a descriptive block notifying users of gRPC message size limitations. This additional troubleshooting section—with potential reasons and guidance—is especially useful.


210-210: UpsertObject Operation Clarified
The summary for /upsert/object has been updated to indicate that it both updates an existing object and inserts a new one, which improves the understanding of its dual role.


241-243: Comprehensive MultiUpsertObject Documentation
The revision in /upsert/object/multiple now details that multiple upsert operations can be handled in one request. The added warning about message size limitations and the structured status code table are both valuable improvements.

Makefile (3)

161-162: Refactored Linker Flags
The adjustment of LDFLAGS to include options like -Wl,--gc-sections and -Wl,-O2 together with the introduction of STATIC_LDFLAGS (which prepends -static) helps optimize the linking process and manage binary size/security more effectively. Please ensure that these removals (e.g. of -static from the dynamic flag set) are fully intended.


170-176: Introduction of COMMON_FLAGS and Compiler Settings
The new COMMON_FLAGS variable consolidates many optimization and visibility options, and its use in CFLAGS, CXXFLAGS, and FFLAGS improves maintainability. The choice of standards (gnu17, gnu++23, f2018) is clearly documented by the variable names. Consider adding a brief comment in the Makefile about the rationale behind these specific flags for future maintainers.


177-190: Architecture-Specific Adjustments
The conditional block for GOARCH correctly applies additional flags for amd64 (with NGT_FLAGS) while leaving arm64 untouched. This ensures that architecture optimizations are only applied where relevant.

apis/swagger/v1/vald/search.swagger.json (8)

17-18: Detailed LinearSearch RPC Overview
The updated summary for /linearsearch now includes a comprehensive status code table and a troubleshooting section. This makes it much clearer what each response code means and which remedies to consider if an error occurs.


50-52: Improved LinearSearchByID Documentation
The /linearsearch/id endpoint now clearly states that the vector must already be indexed and provides precise troubleshooting details. This clarification should help prevent misuse and reduce confusion for API consumers.


83-85: Enhanced MultiLinearSearchByID Summary
The revision for /linearsearch/id/multiple includes both an updated summary and a troubleshooting table, along with a notice about gRPC message size limitations. These additions are very helpful for guiding users through common pitfalls.


116-118: Clear MultiLinearSearch RPC Description
The updated summary and description for /linearsearch/multiple now call out the potential issue with gRPC message sizes along with detailed status codes and troubleshooting tips. This is a very user‑friendly enhancement.


149-151: Refined Search RPC Overview
The /search endpoint summary is now more descriptive and includes the necessary troubleshooting information and status code table. It maintains consistency with the other search endpoints.


182-183: Enhanced SearchByID Documentation
By detailing that the vector must be pre-indexed and offering clear troubleshooting steps, the updated summary for /search/id significantly reduces ambiguity.


214-218: Comprehensive MultiSearchByID Information
The /search/id/multiple update provides an extensive troubleshooting table and clearly states the limitations and remedies. This added level of detail is very helpful for debugging issues when searching by multiple IDs.


248-250: Improved MultiSearch RPC Clarity
The changes in /search/multiple standardize the wording for multi‑vector searches. The combination of summary, status code table, and troubleshooting instructions delivers a consistent and clear documentation experience.

@@ -14,7 +14,8 @@
"paths": {
"/insert": {
"post": {
"summary": "A method to add a new single vector.",
"summary": "Overview\nInset RPC is the method to add a new single vector.\n---\nStatus Code\n| 0 | OK |\n| 1 | CANCELLED |\n| 3 | INVALID_ARGUMENT |\n| 4 | DEADLINE_EXCEEDED |\n| 5 | NOT_FOUND |\n| 13 | INTERNAL |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Typo in RPC Name (Summary Update)
The summary for the /insert endpoint currently reads "Inset RPC" which appears to be a typographical error. It should likely be "Insert RPC" to match the intended meaning.

Proposed change:

-        "summary": "Overview\nInset RPC is the method to add a new single vector.\n---\nStatus Code\n| 0    | OK                |\n| 1    | CANCELLED         |\n| 3    | INVALID_ARGUMENT  |\n| 4    | DEADLINE_EXCEEDED |\n| 5    | NOT_FOUND         |\n| 13   | INTERNAL          |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`.",
+        "summary": "Overview\nInsert RPC is the method to add a new single vector.\n---\nStatus Code\n| 0    | OK                |\n| 1    | CANCELLED         |\n| 3    | INVALID_ARGUMENT  |\n| 4    | DEADLINE_EXCEEDED |\n| 5    | NOT_FOUND         |\n| 13   | INTERNAL          |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`.",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"summary": "Overview\nInset RPC is the method to add a new single vector.\n---\nStatus Code\n| 0 | OK |\n| 1 | CANCELLED |\n| 3 | INVALID_ARGUMENT |\n| 4 | DEADLINE_EXCEEDED |\n| 5 | NOT_FOUND |\n| 13 | INTERNAL |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`.",
"summary": "Overview\nInsert RPC is the method to add a new single vector.\n---\nStatus Code\n| 0 | OK |\n| 1 | CANCELLED |\n| 3 | INVALID_ARGUMENT |\n| 4 | DEADLINE_EXCEEDED |\n| 5 | NOT_FOUND |\n| 13 | INTERNAL |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`",

Comment on lines +1143 to +1162
if force {
cnt++
return fn(addr, p)
}
if p == nil || !p.IsHealthy(ctx) {
pc, err := p.Connect(ctx)
if pc == nil || err != nil || !pc.IsHealthy(ctx) {
if pc != nil && pc.IsIPConn() {
log.Debugf("Failed to re-connect unhealthy connection for %s: %v, trying to disconnect unhealthy", addr, err)
if derr := pc.Disconnect(ctx); derr != nil {
log.Debugf("Failed to disconnect unhealthy connection for %s: %v", addr, derr)
} else {
g.conns.Delete(addr)
}
}
log.Debugf("Unhealthy connection detected for %s during gRPC Connection Range over Loop:\t%s", addr, p.String())
return true
}
p = pc
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error handling in forced connection checks.

The forced connection check logic could be improved to handle errors more gracefully.

 if force {
+    if p == nil {
+        log.Debugf("Nil connection detected for %s during forced check", addr)
+        return true
+    }
     cnt++
     return fn(addr, p)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if force {
cnt++
return fn(addr, p)
}
if p == nil || !p.IsHealthy(ctx) {
pc, err := p.Connect(ctx)
if pc == nil || err != nil || !pc.IsHealthy(ctx) {
if pc != nil && pc.IsIPConn() {
log.Debugf("Failed to re-connect unhealthy connection for %s: %v, trying to disconnect unhealthy", addr, err)
if derr := pc.Disconnect(ctx); derr != nil {
log.Debugf("Failed to disconnect unhealthy connection for %s: %v", addr, derr)
} else {
g.conns.Delete(addr)
}
}
log.Debugf("Unhealthy connection detected for %s during gRPC Connection Range over Loop:\t%s", addr, p.String())
return true
}
p = pc
}
if force {
if p == nil {
log.Debugf("Nil connection detected for %s during forced check", addr)
return true
}
cnt++
return fn(addr, p)
}
if p == nil || !p.IsHealthy(ctx) {
pc, err := p.Connect(ctx)
if pc == nil || err != nil || !pc.IsHealthy(ctx) {
if pc != nil && pc.IsIPConn() {
log.Debugf("Failed to re-connect unhealthy connection for %s: %v, trying to disconnect unhealthy", addr, err)
if derr := pc.Disconnect(ctx); derr != nil {
log.Debugf("Failed to disconnect unhealthy connection for %s: %v", addr, derr)
} else {
g.conns.Delete(addr)
}
}
log.Debugf("Unhealthy connection detected for %s during gRPC Connection Range over Loop:\t%s", addr, p.String())
return true
}
p = pc
}

Signed-off-by: kpango <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment