-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prefix cache stats to usage #2018
base: main
Are you sure you want to change the base?
Conversation
|
||
// match prefix cache | ||
for (int i = 0; i < sequences.size(); i++) { | ||
if (!sequences[i]->prompt.empty() && sequences[i]->blocks.empty()) { | ||
auto& seq = const_cast<Sequence&>(*sequences[i]); | ||
block_trie_->match(seq); | ||
seq.cache_len = seq.blocks.size() * block_seq_len_; | ||
if (rank_ == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Printing for each sequence, will it be costly? May we consider doing a complete count and then printing only once? What is your suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when tp>1
, for each device, the count will be the same, so we don't need to print all of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that the statistics may be calculated and printed once for each batch, not for each sequence and has nothing to do with tp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we can calculate and log an average hit_rate for a batch of sequences in one iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @josephrocca
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhyncs For my use case, what would be the most useful is to simply add the number of prefix tokens that were cached to the API response. E.g. the last EventStream object. Otherwise I need need to write my own parser for the logs to simply get the info on each request.
E.g. something like this would be great as the final EventStream item:
{
"id": "15964",
"object": "text_completion",
"created": 1720934754,
"model": "repo/modelname",
"choices": [{"index":0,"text":"\n","logprobs":null,"finish_reason":"stop"}],
"request_summary": {
"input_tokens": 2542,
"output_tokens": 230,
"prefix_cached_tokens": 1740,
"prefill_time": 0.4,
"decoding_time": 2.3
}
}
I've added some other example info there that would be useful to me, just as an example of how request_summary
(or usage
or similar) could be extended in the future - a couple inspired by this issue:
There is precedent for adding extra info to the final EventStream object - e.g. the (full) generated_text
in final TGI EventStream item, and usage
in final OpenAI API EventStream item:
An optional field that will only be present when you set
stream_options: {"include_usage": true}
in your request. When present, it contains a null value except for the last chunk which contains the token usage statistics for the entire request.
This reason this would be beneficial for me is because I collect statistics at my load balancing server, keyed by GPU type, hosting providers, etc. - and if I simply needed to inspect the request_summary
field of the final EventStream object, then this makes things very easy for me.
However, if this is not possible, I will certainly benefit from the current logging approach. Thanks for your work on this @ispobock!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @josephrocca Adding prefix_cached_tokens
to the usage
of the final EventStream item is technically feasible. But this will involve API changes. Currently, only non-streaming responses have usage
field, which contains request-level information such as prompt_tokens
, completion_tokens
, and total_tokens
. @AllentDan may help confirm this. Whether to add prefix_cached_tokens
to usage
and whether to attach usage
to the streaming response still need to be discussed.
cc: @lvhan028 @lzhangzz @irexyc @zhyncs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ispobock I think the streaming response returns the "usage" when the last token is generated.
@zhulinJulia24 The unit-test and pr_ete_test failed because |
|
|
@lvhan028 @AllentDan Could you help review this PR?Do you have any suggestions for the API changes mentioned in #2018 (comment)? |
Hi, @ispobock
|
b1d261c
to
bdb3082
Compare
@lvhan028 @josephrocca The server:
request:
response:
|
That's great @ispobock! Thank you! I use |
@josephrocca Added usage to final stream response, pls check again |
这个有计划发布吗? |
1e960c3
to
fa59c4c
Compare
Hi @lvhan028,sorry for the delay. This PR is ready to review. |
Motivation
#1942
Modification
Log prefix cache statistics.
Checklist