Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python[patch]: evaluate local mode #1224

Merged
merged 29 commits into from
Nov 22, 2024
Merged

python[patch]: evaluate local mode #1224

merged 29 commits into from
Nov 22, 2024

Conversation

baskaryan
Copy link
Contributor

Adds an upload_results flag to avoid tracing any runs (target or evaluator) or creating an experiment in langsmith:

from langsmith import evaluate

results = evaluate(
    lambda x: x, 
    data="Sample Dataset 3", 
    evaluators=[lambda inputs: {"score": 1, "key": "correct"}],
    upload_results=False
)

@baskaryan baskaryan requested a review from hinthornw November 18, 2024 20:58
@@ -1201,14 +1201,14 @@ def my_run(foo: str):

my_run(foo="bar", langsmith_extra={"parent": headers, "client": mock_client})
mock_calls = _get_calls(mock_client)
assert len(mock_calls) == 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hinthornw not sure i understand this behavior, do we need/want to maintain it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya. At one point what was happening was we'd create a RunTree from headers withoutt he overriding client, and then use that, meaning the user configuration was broken

@baskaryan baskaryan marked this pull request as ready for review November 19, 2024 00:25
@baskaryan baskaryan changed the title rfc: evaluate local mode python[patch]: evaluate local mode Nov 19, 2024
@@ -675,7 +685,7 @@ async def _arun_evaluators(
**current_context,
"project_name": "evaluators",
"metadata": metadata,
"enabled": True,
"enabled": "local" if not self._upload_results else True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does local mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

means we're tracing (ie tracking intermediate steps) but not sending to langsmith

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, where do they get stored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're keeping all the runs in memory currently (on the ExperimentManager)

python/langsmith/evaluation/_runner.py Outdated Show resolved Hide resolved
python/langsmith/run_helpers.py Show resolved Hide resolved
@@ -83,6 +83,7 @@ async def aevaluate(
client: Optional[langsmith.Client] = None,
blocking: bool = True,
experiment: Optional[Union[schemas.TracerSession, str, uuid.UUID]] = None,
upload_results: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flyby comment: the fact that we have to update so many call sites feels pretty problematic

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdym call sites?

Our data model for evals is indeed too complicated lol -tracer sessions, feedback, runs in multiple sessions, ay caramba

@@ -675,7 +685,7 @@ async def _arun_evaluators(
**current_context,
"project_name": "evaluators",
"metadata": metadata,
"enabled": True,
"enabled": "local" if not self._upload_results else True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, where do they get stored?

python/langsmith/evaluation/_runner.py Outdated Show resolved Hide resolved
Comment on lines +786 to +796
if self._upload_results:
for result in flattened_results:
feedback = result.dict(exclude={"target_run_id"})
evaluator_info = feedback.pop("evaluator_info", None)
await aitertools.aio_to_thread(
self.client.create_feedback,
**feedback,
run_id=None,
project_id=project_id,
source_info=evaluator_info,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments here:

  1. What's the rationale of awaiting in the for loop over gathering the results with some max concurrency?
  2. Second, in most cases now, (when multipart is enabled) we now enqueue the feedback to the TracingQueue for asyncronous background processing. So it really doesn't make sense to keep this logic to run this in an executor? cc @hinthornw for this too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw the only change here was condition in L786 (rest is just indentation)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya we can asyncio.gather()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: background processing, it doesn't get put in the background queue because we lack a trace_id in this case

project_id=project_id,
source_info=evaluator_info,
)
if self._upload_results:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above here

@baskaryan baskaryan enabled auto-merge (squash) November 22, 2024 18:06
@baskaryan baskaryan disabled auto-merge November 22, 2024 18:16
Copy link
Collaborator

@hinthornw hinthornw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this looks good. I do today feel more that upload_results=False shouldn't enforce local mode but rather just shouldn't create a project and then should use the user's environment setup to see whether or not to trace

@baskaryan baskaryan merged commit 6cf7c9b into main Nov 22, 2024
10 checks passed
@baskaryan baskaryan deleted the bagatur/local_mode branch November 22, 2024 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants