Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Update the response queue in the server to reuse response slots #7879

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

pskiran1
Copy link
Member

@pskiran1 pskiran1 commented Dec 13, 2024

What does the PR do?

The current response queue allocates memory for each response.
This PR aims to enhance the response queue by reusing response slots across multiple responses within the same request once they have been written (completed) to the network. This may help reduce active memory utilization.

  • In the PopResponse() function, we clear the response content and return it to the reusable pool.
  • In the AllocateResponse() function, we check for any available responses in the reusable pool; if present, we use it, otherwise, we allocate a new response.
  • Introduces a configurable threshold(--grpc-max-response-pool-size option) to limit the number of active response protobuf allocations in the gRPC response queue.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

Where should the reviewer start?

Test plan:

  • CI Pipeline ID: 21564131

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@pskiran1 pskiran1 requested review from tanmayv25 and kthui December 13, 2024 17:38
Comment on lines 221 to 260
// Gets the response at the specified index
ResponseType* GetResponseAt(const uint32_t index)
{
std::lock_guard<std::mutex> lock(mtx_);

// Check if the index is valid for allocated responses
if (index >= alloc_count_) {
LOG_ERROR << "[INTERNAL] Attempting to access response which is not yet "
"allocated";
return nullptr;
}
return responses_[index];
if (index < pop_count_) {
LOG_ERROR << "[INTERNAL] Attempting to access a response that has "
"already been removed from the queue.";
return nullptr;
}

// Adjust index based on number of popped responses to get actual index in
// 'responses_'
return responses_[index - pop_count_];
}

// Pops the response from the tail of the queue
// Removes the current response from the front of the queue
void PopResponse()
{
std::lock_guard<std::mutex> lock(mtx_);
current_index_++;

// Ensure there are responses in the queue to pop
if (responses_.empty()) {
LOG_ERROR << "[INTERNAL] No responses in the queue to pop.";
return;
}

// Clear and move the current response to the reusable pool
auto response = responses_.front();
response->Clear();
reusable_pool_.push_back(response);
responses_.pop_front();
pop_count_++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking my understanding:
Previously, we just increment the current_index_/pop_count_ when a response is popped from the responses_ queue, while the index always refer to the actual location on the responses_ queue. Now, a popped response is removed from the responses_ queue, changing the index of the remaining elements, so an "offset"/pop_count_ is necessary to correctly access the elements, given the callers are unaware of the change when providing the index.

I assume the response is fully written to gRPC and there will be no more access to the response object by the time PopResponse() is called? In other words, the response object can be safely reused/modified after it is popped? Given the previous implementation keeps the popped response objects intact, until the ~ResponseQueue() is destructed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @kthui. Once we call PopResponse(), the response object will no longer be needed.
cc: @tanmayv25

@pskiran1 pskiran1 requested a review from indrajit96 January 3, 2025 10:10
@pskiran1 pskiran1 closed this Jan 20, 2025
@pskiran1 pskiran1 force-pushed the spolisetty_dlis_7657 branch from 1948b34 to 596925a Compare January 20, 2025 06:11
@pskiran1
Copy link
Member Author

The PR was automatically closed on forced rebase with the main, reopened with a new commit.

@pskiran1 pskiran1 reopened this Jan 20, 2025
@pskiran1 pskiran1 requested a review from kthui January 24, 2025 16:14
@pskiran1 pskiran1 marked this pull request as ready for review February 3, 2025 16:46
std::vector<ResponseType*> responses_;
// Stores responses that need to be written. The front of the queue indicates
// the current response, while the back indicates the last allocated response.
std::deque<ResponseType*> responses_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have smart pointers unique_ptr instead of raw pointers.
Just a suggestion if it's not a lot of work.

@@ -1438,6 +1444,14 @@ TritonParser::Parse(int argc, char** argv)
case OPTION_GRPC_INFER_ALLOCATION_POOL_SIZE:
lgrpc_options.infer_allocation_pool_size_ = ParseOption<int>(optarg);
break;
case OPTION_GRPC_MAX_RESPONSE_POOL_SIZE:
lgrpc_options.max_response_pool_size_ = ParseOption<int>(optarg);
if (lgrpc_options.max_response_pool_size_ <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to have an upper limit on the check?
Code below suggests the max is {INT_MAX};

@@ -127,6 +127,45 @@ for trial in $TRIALS; do

kill $SERVER_PID
wait $SERVER_PID

SERVER_ARGS="--model-repository=$MODELDIR --grpc-max-response-pool-size=1"
Copy link
Contributor

@indrajit96 indrajit96 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test plan in the description on what we are trying to test here?
Why have we set --grpc-max-response-pool-size only to 1?
Also can we add a test to confirm memory footprint decreses with using --grpc-max-response-pool-size VS not using --grpc-max-response-pool-size ?

@indrajit96
Copy link
Contributor

Can we also update docs for --grpc-max-response-pool-size

@@ -536,6 +537,11 @@ TritonParser::SetupOptions()
"allocated for reuse. As long as the number of in-flight requests "
"doesn't exceed this value there will be no allocation/deallocation of "
"request/response objects."});
grpc_options_.push_back(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an extra argument for this? Why no reuse "grpc-infer-allocation-pool-size above?

Copy link
Member Author

@pskiran1 pskiran1 Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanmayv25, We use the grpc-infer-allocation-pool-size parameter as the limit for state bucket and state reuse, with a default value of only 8. In contrast, for the response queue threshold, we need to have a higher maximum value to achieve better performance by default(current behavior).
If we apply the same option for the response queue threshold also, we may always need to specify the grpc-infer-allocation-pool-size to improve performance by default. However, a major drawback is that if this value is set too high, it will also affect the size of the state bucket, resulting in states not being deleted. For example, if we set it to 200, it will keep 200 states alive for reuse.
Please correct me if my understanding is wrong. Thank you.

Copy link
Member Author

@pskiran1 pskiran1 Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the above, I think it may be better to maintain a separate flag for controlling the response queue limit without affecting the state bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants