refactor: Update the response queue in the server to reuse response slots #7879

pskiran1 · 2024-12-13T16:24:57Z

What does the PR do?

The current response queue allocates memory for each response.
This PR aims to enhance the response queue by reusing response slots across multiple responses within the same request once they have been written (completed) to the network. This may help reduce active memory utilization.

In the PopResponse() function, we clear the response content and return it to the reusable pool.
In the AllocateResponse() function, we check for any available responses in the reusable pool; if present, we use it, otherwise, we allocate a new response.
Introduces a configurable threshold(--grpc-max-response-pool-size option) to limit the number of active response protobuf allocations in the gRPC response queue.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

Where should the reviewer start?

Test plan:

CI Pipeline ID: 21564131

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

kthui · 2024-12-13T18:19:44Z

src/grpc/infer_handler.h

  // Gets the response at the specified index
  ResponseType* GetResponseAt(const uint32_t index)
  {
    std::lock_guard<std::mutex> lock(mtx_);
+
+    // Check if the index is valid for allocated responses
    if (index >= alloc_count_) {
      LOG_ERROR << "[INTERNAL] Attempting to access response which is not yet "
                   "allocated";
      return nullptr;
    }
-    return responses_[index];
+    if (index < pop_count_) {
+      LOG_ERROR << "[INTERNAL] Attempting to access a response that has "
+                   "already been removed from the queue.";
+      return nullptr;
+    }
+
+    // Adjust index based on number of popped responses to get actual index in
+    // 'responses_'
+    return responses_[index - pop_count_];
  }

-  // Pops the response from the tail of the queue
+  // Removes the current response from the front of the queue
  void PopResponse()
  {
    std::lock_guard<std::mutex> lock(mtx_);
-    current_index_++;
+
+    // Ensure there are responses in the queue to pop
+    if (responses_.empty()) {
+      LOG_ERROR << "[INTERNAL] No responses in the queue to pop.";
+      return;
+    }
+
+    // Clear and move the current response to the reusable pool
+    auto response = responses_.front();
+    response->Clear();
+    reusable_pool_.push_back(response);
+    responses_.pop_front();
+    pop_count_++;
  }


Just checking my understanding:
Previously, we just increment the current_index_/pop_count_ when a response is popped from the responses_ queue, while the index always refer to the actual location on the responses_ queue. Now, a popped response is removed from the responses_ queue, changing the index of the remaining elements, so an "offset"/pop_count_ is necessary to correctly access the elements, given the callers are unaware of the change when providing the index.

I assume the response is fully written to gRPC and there will be no more access to the response object by the time PopResponse() is called? In other words, the response object can be safely reused/modified after it is popped? Given the previous implementation keeps the popped response objects intact, until the ~ResponseQueue() is destructed.

Yes, @kthui. Once we call PopResponse(), the response object will no longer be needed.
cc: @tanmayv25

pskiran1 · 2025-01-20T06:18:18Z

The PR was automatically closed on forced rebase with the main, reopened with a new commit.

into spolisetty_dlis_7657

indrajit96 · 2025-02-04T12:01:28Z

src/grpc/infer_handler.h

-  std::vector<ResponseType*> responses_;
+  // Stores responses that need to be written. The front of the queue indicates
+  // the current response, while the back indicates the last allocated response.
+  std::deque<ResponseType*> responses_;


Can we have smart pointers unique_ptr instead of raw pointers.
Just a suggestion if it's not a lot of work.

indrajit96 · 2025-02-04T12:05:07Z

src/command_line_parser.cc

@@ -1438,6 +1444,14 @@ TritonParser::Parse(int argc, char** argv)
        case OPTION_GRPC_INFER_ALLOCATION_POOL_SIZE:
          lgrpc_options.infer_allocation_pool_size_ = ParseOption<int>(optarg);
          break;
+        case OPTION_GRPC_MAX_RESPONSE_POOL_SIZE:
+          lgrpc_options.max_response_pool_size_ = ParseOption<int>(optarg);
+          if (lgrpc_options.max_response_pool_size_ <= 0) {


Do we also need to have an upper limit on the check?
Code below suggests the max is {INT_MAX};

indrajit96 · 2025-02-04T12:29:20Z

qa/L0_decoupled/test.sh

@@ -127,6 +127,45 @@ for trial in $TRIALS; do

  kill $SERVER_PID
  wait $SERVER_PID
+
+  SERVER_ARGS="--model-repository=$MODELDIR --grpc-max-response-pool-size=1"


Can you add a test plan in the description on what we are trying to test here?
Why have we set --grpc-max-response-pool-size only to 1?
Also can we add a test to confirm memory footprint decreses with using --grpc-max-response-pool-size VS not using --grpc-max-response-pool-size ?

indrajit96 · 2025-02-04T13:08:40Z

Can we also update docs for --grpc-max-response-pool-size

tanmayv25 · 2025-02-04T22:55:46Z

src/command_line_parser.cc

@@ -536,6 +537,11 @@ TritonParser::SetupOptions()
       "allocated for reuse. As long as the number of in-flight requests "
       "doesn't exceed this value there will be no allocation/deallocation of "
       "request/response objects."});
+  grpc_options_.push_back(


Why do we need an extra argument for this? Why no reuse "grpc-infer-allocation-pool-size above?

@tanmayv25, We use the grpc-infer-allocation-pool-size parameter as the limit for state bucket and state reuse, with a default value of only 8. In contrast, for the response queue threshold, we need to have a higher maximum value to achieve better performance by default(current behavior).
If we apply the same option for the response queue threshold also, we may always need to specify the grpc-infer-allocation-pool-size to improve performance by default. However, a major drawback is that if this value is set too high, it will also affect the size of the state bucket, resulting in states not being deleted. For example, if we set it to 200, it will keep 200 states alive for reuse.
Please correct me if my understanding is wrong. Thank you.

Considering the above, I think it may be better to maintain a separate flag for controlling the response queue limit without affecting the state bucket.

pskiran1 requested review from tanmayv25 and kthui December 13, 2024 17:38

kthui reviewed Dec 13, 2024

View reviewed changes

pskiran1 requested a review from indrajit96 January 3, 2025 10:10

pvijayakrish force-pushed the spolisetty_dlis_7657 branch from a9ab1e8 to 32490e1 Compare January 15, 2025 17:13

pskiran1 closed this Jan 20, 2025

pskiran1 force-pushed the spolisetty_dlis_7657 branch from 1948b34 to 596925a Compare January 20, 2025 06:11

Reuse response allocations

c9bcbd9

pskiran1 reopened this Jan 20, 2025

pskiran1 added 5 commits January 21, 2025 20:12

ResponseQueue Threshold

2fba1dd

Update copyright

16d347a

Merge branch 'main' of https://github.com/triton-inference-server/server

12cc2d2

into spolisetty_dlis_7657

Test case

740d167

Update copyright

9020014

pskiran1 requested a review from kthui January 24, 2025 16:14

Merge branch 'main' into spolisetty_dlis_7657

6ccee17

pskiran1 marked this pull request as ready for review February 3, 2025 16:46

Merge branch 'main' into spolisetty_dlis_7657

62e719c

indrajit96 reviewed Feb 4, 2025

View reviewed changes

tanmayv25 reviewed Feb 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Update the response queue in the server to reuse response slots #7879

refactor: Update the response queue in the server to reuse response slots #7879

pskiran1 commented Dec 13, 2024 •

edited

Loading

kthui Dec 13, 2024

pskiran1 Dec 16, 2024

pskiran1 commented Jan 20, 2025

indrajit96 Feb 4, 2025

indrajit96 Feb 4, 2025

indrajit96 Feb 4, 2025 •

edited

Loading

indrajit96 commented Feb 4, 2025

tanmayv25 Feb 4, 2025

pskiran1 Feb 5, 2025 •

edited

Loading

pskiran1 Feb 5, 2025 •

edited

Loading

refactor: Update the response queue in the server to reuse response slots #7879

Are you sure you want to change the base?

refactor: Update the response queue in the server to reuse response slots #7879

Conversation

pskiran1 commented Dec 13, 2024 • edited Loading

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

kthui Dec 13, 2024

Choose a reason for hiding this comment

pskiran1 Dec 16, 2024

Choose a reason for hiding this comment

pskiran1 commented Jan 20, 2025

indrajit96 Feb 4, 2025

Choose a reason for hiding this comment

indrajit96 Feb 4, 2025

Choose a reason for hiding this comment

indrajit96 Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

indrajit96 commented Feb 4, 2025

tanmayv25 Feb 4, 2025

Choose a reason for hiding this comment

pskiran1 Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

pskiran1 Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

pskiran1 commented Dec 13, 2024 •

edited

Loading

indrajit96 Feb 4, 2025 •

edited

Loading

pskiran1 Feb 5, 2025 •

edited

Loading

pskiran1 Feb 5, 2025 •

edited

Loading