You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Input and output buffers are requested using Triton APIs whenever there is a new request. Once the output is sent, they are disposed. If we can keep a pre-allocated buffer for the input and output, we can accelerate the query response time. But, knowing the size of these buffers is not that easy. If we know the max batch size, we can allocate the space based on that. But, if that batch size is large, then we have to occupy a good amount of space on memory.
Maybe, for small batches, we can keep a pre-allocated buffer for input and output. For larger batch sizes, we can request a new space.
The text was updated successfully, but these errors were encountered:
Input and output buffers are requested using Triton APIs whenever there is a new request. Once the output is sent, they are disposed. If we can keep a pre-allocated buffer for the input and output, we can accelerate the query response time. But, knowing the size of these buffers is not that easy. If we know the max batch size, we can allocate the space based on that. But, if that batch size is large, then we have to occupy a good amount of space on memory.
Maybe, for small batches, we can keep a pre-allocated buffer for input and output. For larger batch sizes, we can request a new space.
The text was updated successfully, but these errors were encountered: