-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model features: token usage #109
base: main
Are you sure you want to change the base?
Conversation
@@ -268,6 +268,13 @@ class ChatNVIDIA(BaseChatModel): | |||
top_p: Optional[float] = Field(None, description="Top-p for distribution sampling") | |||
seed: Optional[int] = Field(None, description="The seed for deterministic results") | |||
stop: Optional[Sequence[str]] = Field(None, description="Stop words (cased)") | |||
stream_usage: bool = Field( | |||
False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the langchain user's expectation is that usage details are returned by default
@@ -381,18 +388,38 @@ def _generate( | |||
response = self._client.get_req(payload=payload, extra_headers=extra_headers) | |||
responses, _ = self._client.postprocess(response) | |||
self._set_callback_out(responses, run_manager) | |||
parsed_response = self._custom_postprocess(responses, streaming=False) | |||
parsed_response = self._custom_postprocess( | |||
responses, streaming=False, stream_usage=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about relying on streaming=False to imply stream_usage=False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if it is streaming but do not want stream usage details?
"output_tokens": token_usage.get("completion_tokens", 0), | ||
"total_tokens": token_usage.get("total_tokens", 0), | ||
} | ||
if (streaming and stream_usage) or not streaming: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about always returning tokens usage if its in the response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense, because does not matter parameter at this point
@@ -441,3 +443,121 @@ def test_stop( | |||
assert isinstance(token.content, str) | |||
result += f"{token.content}|" | |||
assert all(target not in result for target in targets) | |||
|
|||
|
|||
def test_ai_endpoints_stream_token_usage() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all integration tests should accept and use a model and mode
_test_stream(llm.stream("Hello", stream_usage=False), expect_usage=False) | ||
|
||
|
||
async def test_ai_endpoints_astream_token_usage() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all integration tests should accept and use a model and mode
@@ -294,14 +294,39 @@ def test_stream_usage_metadata( | |||
) | |||
|
|||
llm = ChatNVIDIA(api_key="BOGUS") | |||
response = reduce(add, llm.stream("IGNROED")) | |||
response = reduce(add, llm.stream("IGNROED", stream_usage=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default should be True
@mattf when I was checking if it works for all models, I figured out that 28 models out of 66 models supported by chatNVIDIA do not support stream_option={"include_usage"} parameter.
Out of 38 which are supported there is inconsistency in response as some models return total token_usage at the end where as some return with each token. How should we handle this above 2 things? |
...
returning token usage should be the default. those endpoints will need to be fixed. |
Token Usage:
Variants to use these options:
llm = ChatNVIDIA(temperature=0, max_tokens=5, stream_usage=True)
llm.stream("IGNROED", stream_usage=True)
llm.stream("Hello", stream_options={"include_usage": True})
cc: @sumitkbh