-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Update the response queue in the server to reuse response slots #7879
base: main
Are you sure you want to change the base?
Conversation
src/grpc/infer_handler.h
Outdated
// Gets the response at the specified index | ||
ResponseType* GetResponseAt(const uint32_t index) | ||
{ | ||
std::lock_guard<std::mutex> lock(mtx_); | ||
|
||
// Check if the index is valid for allocated responses | ||
if (index >= alloc_count_) { | ||
LOG_ERROR << "[INTERNAL] Attempting to access response which is not yet " | ||
"allocated"; | ||
return nullptr; | ||
} | ||
return responses_[index]; | ||
if (index < pop_count_) { | ||
LOG_ERROR << "[INTERNAL] Attempting to access a response that has " | ||
"already been removed from the queue."; | ||
return nullptr; | ||
} | ||
|
||
// Adjust index based on number of popped responses to get actual index in | ||
// 'responses_' | ||
return responses_[index - pop_count_]; | ||
} | ||
|
||
// Pops the response from the tail of the queue | ||
// Removes the current response from the front of the queue | ||
void PopResponse() | ||
{ | ||
std::lock_guard<std::mutex> lock(mtx_); | ||
current_index_++; | ||
|
||
// Ensure there are responses in the queue to pop | ||
if (responses_.empty()) { | ||
LOG_ERROR << "[INTERNAL] No responses in the queue to pop."; | ||
return; | ||
} | ||
|
||
// Clear and move the current response to the reusable pool | ||
auto response = responses_.front(); | ||
response->Clear(); | ||
reusable_pool_.push_back(response); | ||
responses_.pop_front(); | ||
pop_count_++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking my understanding:
Previously, we just increment the current_index_
/pop_count_
when a response is popped from the responses_
queue, while the index
always refer to the actual location on the responses_
queue. Now, a popped response is removed from the responses_
queue, changing the index of the remaining elements, so an "offset"/pop_count_
is necessary to correctly access the elements, given the callers are unaware of the change when providing the index
.
I assume the response is fully written to gRPC and there will be no more access to the response object by the time PopResponse()
is called? In other words, the response object can be safely reused/modified after it is popped? Given the previous implementation keeps the popped response objects intact, until the ~ResponseQueue()
is destructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @kthui. Once we call PopResponse(), the response object will no longer be needed.
cc: @tanmayv25
a9ab1e8
to
32490e1
Compare
1948b34
to
596925a
Compare
The PR was automatically closed on forced rebase with the main, reopened with a new commit. |
std::vector<ResponseType*> responses_; | ||
// Stores responses that need to be written. The front of the queue indicates | ||
// the current response, while the back indicates the last allocated response. | ||
std::deque<ResponseType*> responses_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have smart pointers unique_ptr
instead of raw pointers.
Just a suggestion if it's not a lot of work.
@@ -1438,6 +1444,14 @@ TritonParser::Parse(int argc, char** argv) | |||
case OPTION_GRPC_INFER_ALLOCATION_POOL_SIZE: | |||
lgrpc_options.infer_allocation_pool_size_ = ParseOption<int>(optarg); | |||
break; | |||
case OPTION_GRPC_MAX_RESPONSE_POOL_SIZE: | |||
lgrpc_options.max_response_pool_size_ = ParseOption<int>(optarg); | |||
if (lgrpc_options.max_response_pool_size_ <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to have an upper limit on the check?
Code below suggests the max is {INT_MAX};
@@ -127,6 +127,45 @@ for trial in $TRIALS; do | |||
|
|||
kill $SERVER_PID | |||
wait $SERVER_PID | |||
|
|||
SERVER_ARGS="--model-repository=$MODELDIR --grpc-max-response-pool-size=1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test plan in the description on what we are trying to test here?
Why have we set --grpc-max-response-pool-size
only to 1
?
Also can we add a test to confirm memory footprint decreses with using --grpc-max-response-pool-size
VS not using --grpc-max-response-pool-size
?
Can we also update docs for |
@@ -536,6 +537,11 @@ TritonParser::SetupOptions() | |||
"allocated for reuse. As long as the number of in-flight requests " | |||
"doesn't exceed this value there will be no allocation/deallocation of " | |||
"request/response objects."}); | |||
grpc_options_.push_back( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need an extra argument for this? Why no reuse "grpc-infer-allocation-pool-size
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanmayv25, We use the grpc-infer-allocation-pool-size
parameter as the limit for state bucket and state reuse, with a default value of only 8. In contrast, for the response queue threshold, we need to have a higher maximum value to achieve better performance by default(current behavior).
If we apply the same option for the response queue threshold also, we may always need to specify the grpc-infer-allocation-pool-size
to improve performance by default. However, a major drawback is that if this value is set too high, it will also affect the size of the state bucket, resulting in states not being deleted. For example, if we set it to 200, it will keep 200 states alive for reuse.
Please correct me if my understanding is wrong. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the above, I think it may be better to maintain a separate flag for controlling the response queue limit without affecting the state bucket.
What does the PR do?
The current response queue allocates memory for each response.
This PR aims to enhance the response queue by reusing response slots across multiple responses within the same request once they have been written (completed) to the network. This may help reduce active memory utilization.
PopResponse()
function, we clear the response content and return it to the reusable pool.AllocateResponse()
function, we check for any available responses in the reusable pool; if present, we use it, otherwise, we allocate a new response.--grpc-max-response-pool-size
option) to limit the number of active response protobuf allocations in the gRPC response queue.Checklist
<commit_type>: <Title>
Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
Where should the reviewer start?
Test plan:
Caveats:
Background
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)