Skip to content

Commit 2420c62

Browse files
ojaffejohny-bJunShernthesofakillers
authored
Updates on existing solvers and bugged tool eval (#1506)
@JunShern will review this Wrap solvers with completion functions for compatibility with pre-solver Evals. This means you can execute all evals using solvers. [49fd9ef](49fd9ef) Add context length information about gpt-4-turbo-preview and gpt-4-0125-preview. [9a0ab1c](9a0ab1c) Move oai and together solvers into providers / subdir [063bf4f](063bf4f) Update the default task descriptions for bugged tools. We added more information when using gemini + OS models, since they got confused. [0523dd4](0523dd4) Modified the default solver chain-of-thought prompt, as well as other custom chain-of-thought prompts used in some evals. The default CoTSolver prompts were a bit misleading in some cases; we observed GeminiSolver working too hard to arrive at a final answer for the whole eval when it's in fact supposed to give just a response for the next turn. [287f3cf](287f3cf) --------- Co-authored-by: johny-b <[email protected]> Co-authored-by: Chan Jun Shern <[email protected]> Co-authored-by: Giulio Starace <[email protected]>
1 parent d9d2f5f commit 2420c62

39 files changed

+533
-283
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
from typing import Any, Union
2+
3+
from evals.api import CompletionFn, CompletionResult
4+
from evals.prompt.base import OpenAICreateChatPrompt
5+
from evals.solvers.nested.cot_solver import CoTSolver
6+
from evals.solvers.solver import Solver, SolverSpec, create_solver
7+
from evals.task_state import Message, TaskState
8+
9+
10+
class SolverCompletionFnResult(CompletionResult):
11+
def __init__(self, msg):
12+
self.msg = msg
13+
14+
def get_completions(self):
15+
return [self.msg]
16+
17+
18+
class SolverCompletionFn(CompletionFn):
19+
"""
20+
Wraps a solver into a completion function, s.t. that the completion function's
21+
__call__ method calls the internal solver's _solve method, mapping the input
22+
completion function `prompt` to the solver's `task_state` input.
23+
24+
Useful for using Solvers with eval.Eval classes, which would normally require a CompletionFn.
25+
26+
Current limitations:
27+
- Stateful solvers are not supported: Solver state is not maintained between
28+
calls.
29+
- Prompts with more than `role` and `content` keys are not supported.
30+
"""
31+
32+
def __init__(self, solver: Union[SolverSpec, Solver], registry: Any = None):
33+
if isinstance(solver, Solver):
34+
self.solver = solver
35+
else:
36+
self.solver = create_solver(solver)
37+
38+
def __call__(
39+
self, prompt: Union[str, OpenAICreateChatPrompt], **kwargs
40+
) -> SolverCompletionFnResult:
41+
# We have this check here rather than __init__ since the solver may be unwrapped and used in a SolverEval
42+
if isinstance(self.solver, CoTSolver):
43+
if self.solver.interaction_cache is not None:
44+
raise ValueError(
45+
"`CoTSolver` with persistent memory is incompatible with "
46+
"CompletionFn-based `Eval` classes. "
47+
"Please set `CoTSolver(persistent_memory=False)` or update the eval to a `SolverEval`."
48+
)
49+
50+
if isinstance(prompt, str):
51+
prompt = [{"role": "system", "content": prompt}]
52+
elif isinstance(prompt, list):
53+
assert prompt[0]["role"] == "system", "Unexpected prompt role ordering"
54+
else:
55+
raise ValueError(
56+
f"Unexpected prompt type: "
57+
f"string or OpenAICreateChatPrompt expected, got {type(prompt)}"
58+
)
59+
60+
assert set(prompt[0].keys()) == {"role", "content",}, (
61+
"Unexpected keys in prompt: "
62+
f"expected exactly {{'role', 'content'}}, got {set(prompt[0].keys())}"
63+
)
64+
task_state = TaskState(
65+
prompt[0]["content"],
66+
[Message(msg["role"], msg["content"]) for msg in prompt[1:]],
67+
)
68+
69+
# use a copy to avoid task state surviving across samples
70+
pure_solver = self.solver.copy()
71+
72+
result = pure_solver(task_state, **kwargs)
73+
return SolverCompletionFnResult(result.output)

evals/elsuite/bluff/strategy_solver.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
from typing import Optional
55

66
from evals.elsuite.bluff.bluff.cards import get_bluff_move
7+
from evals.solvers.memory import PersistentMemoryCache
78
from evals.solvers.solver import Solver, SolverResult
8-
from evals.solvers.utils import PersistentMemoryCache
99
from evals.task_state import Message, TaskState
1010

1111

@@ -28,7 +28,7 @@ def __init__(
2828

2929
# interaction_length=1 to store reasoning step in private memory
3030
self.interaction_cache = PersistentMemoryCache(interaction_length=1)
31-
31+
3232
def _generate_response(self, task_state: TaskState):
3333
"""
3434
Calls base solver. Modifies taks state to remove all non-reasoning messages

evals/elsuite/bugged_tools/task_description.py

+2
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
You have access to tools that may help you at solving this task. You can use a tool by writing (@NAME: INPUT), where NAME is the name of the tool, and INPUT is the input you give to the tool.
33
If your tool implements a function, after tagging the tool only write the input and nothing else.
44
You can call multiple tools at the same time. Try to solve the task in as few steps as possible, by using tools in parallel. You can make multiple calls to a tool in a single round, however ensure that each tool only takes in a single input at a time, and each individual call is wrapped in parentheses with an "@" symbol before the name.
5+
When you call a tool, I will give you the tools output in the next message. Do not generate the outputs of the tools yourself, I will do it for you. Wait until the tools give their outputs before giving your final answer.
56
{bug_description}
67
78
You have access to the following tools:
@@ -27,6 +28,7 @@
2728
You have access to tools that may help you at solving this task. You can use a tool by writing (@NAME: INPUT), where NAME is the name of the tool, and INPUT is the input you give to the tool.
2829
If your tool implements a function, after tagging the tool only write the input and nothing else.
2930
You can call multiple tools at the same time. Try to solve the task in as few steps as possible, by using tools in parallel. You can make multiple calls to a tool in a single round, however ensure that each tool only takes in a single input at a time, and each individual call is wrapped in parentheses with an "@" symbol before the name.
31+
When you call a tool, I will give you the tools output in the next message. Do not generate the outputs of the tools yourself, I will do it for you. Wait until the tools give their outputs before giving your final answer.
3032
{bug_description}
3133
3234
You have access to the following tools:

evals/elsuite/function_deduction/prompts.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
new_value = "calc({in_}) = {out}"
3030
test_input_not_allowed = "Numbers {inputs} are not allowed"
3131

32-
cot_template_first_round = """This is the first round. Think out loud about a general strategy of solving tasks like this."""
32+
cot_template_first_round = """This is the first round. Think out loud about a general strategy of solving tasks like this. (You now have an opportunity to reason privately before submitting an action. Your next response will not be evaluated.)"""
3333
cot_template_later_rounds = """\
3434
This is round {round_ix} out of {num_rounds}.
3535
@@ -40,4 +40,6 @@
4040
* Do you have any hypothesis on what this function might be doing?
4141
* If yes, should you try to test it (how?), or just use it to calculate the answer?
4242
* If not, what additional information should you gather to be able to formulate a hypothesis?
43+
44+
(You now have an opportunity to reason privately before submitting an action. Your next response will not be evaluated.)
4345
"""

evals/elsuite/function_deduction/solvers.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ def _few_shot_messages(self, current_state) -> list[Message]:
158158
(
159159
"assistant",
160160
(
161-
"Now we have two results where the ouput is the input times 4. It seems that the function multiplies by 4. "
161+
"Now we have two results where the output is the input times 4. It seems that the function multiplies by 4. "
162162
"I will make the guess now. 17 * 4 = 68, 36 * 4 = 144 and 68 * 4 = 272, so my guess will be 68 144 272."
163163
),
164164
)

evals/elsuite/function_deduction/solvers_test.py

+10-8
Original file line numberDiff line numberDiff line change
@@ -85,43 +85,43 @@ def test_base_model_cot_solver():
8585
Message(role="system", content="I have a hidden function. What is your first action?"),
8686
Message(
8787
role="system",
88-
content="This is the first round. Think out loud about a general strategy of solving tasks like this.",
88+
content="This is the first round. Think out loud about a general strategy of solving tasks like this. (You now have an opportunity to reason privately before submitting an action. Your next response will not be evaluated.)",
8989
),
9090
Message(
9191
role="assistant",
9292
content="I don't know anything about the function yet. I should start with gathering some information, e.g. by asking about the return value for 40.",
9393
),
9494
Message(
9595
role="system",
96-
content="Given the above reasoning, the answer in the format requested by the question is:",
96+
content="Given the above reasoning, your response in the format requested by the instructions is:",
9797
),
9898
Message(role="assistant", content="40"),
9999
Message(role="system", content="calc(40) = 160"),
100100
Message(
101101
role="system",
102-
content="This is round 2 out of 10.\n\nSummary of the information you have already gathered:\ncalc(40) = 160\n\nThink out loud about the following questions:\n* Do you have any hypothesis on what this function might be doing?\n* If yes, should you try to test it (how?), or just use it to calculate the answer?\n* If not, what additional information should you gather to be able to formulate a hypothesis?\n",
102+
content="This is round 2 out of 10.\n\nSummary of the information you have already gathered:\ncalc(40) = 160\n\nThink out loud about the following questions:\n* Do you have any hypothesis on what this function might be doing?\n* If yes, should you try to test it (how?), or just use it to calculate the answer?\n* If not, what additional information should you gather to be able to formulate a hypothesis?\n\n(You now have an opportunity to reason privately before submitting an action. Your next response will not be evaluated.)",
103103
),
104104
Message(
105105
role="assistant",
106106
content="Perhaps this function does multiplication by 4? That's not certain yet, so let's try some other number, e.g. 52.",
107107
),
108108
Message(
109109
role="system",
110-
content="Given the above reasoning, the answer in the format requested by the question is:",
110+
content="Given the above reasoning, your response in the format requested by the instructions is:",
111111
),
112112
Message(role="assistant", content="52"),
113113
Message(role="system", content="calc(52) = 204"),
114114
Message(
115115
role="system",
116-
content="This is round 3 out of 10.\n\nSummary of the information you have already gathered:\ncalc(40) = 160\ncalc(52) = 204\n\nThink out loud about the following questions:\n* Do you have any hypothesis on what this function might be doing?\n* If yes, should you try to test it (how?), or just use it to calculate the answer?\n* If not, what additional information should you gather to be able to formulate a hypothesis?\n",
116+
content="This is round 3 out of 10.\n\nSummary of the information you have already gathered:\ncalc(40) = 160\ncalc(52) = 204\n\nThink out loud about the following questions:\n* Do you have any hypothesis on what this function might be doing?\n* If yes, should you try to test it (how?), or just use it to calculate the answer?\n* If not, what additional information should you gather to be able to formulate a hypothesis?\n\n(You now have an opportunity to reason privately before submitting an action. Your next response will not be evaluated.)",
117117
),
118118
Message(
119119
role="assistant",
120-
content="Now we have two results where the ouput is the input times 4. It seems that the function multiplies by 4. I will make the guess now. 17 * 4 = 68, 36 * 4 = 144 and 68 * 4 = 272, so my guess will be 68 144 272.",
120+
content="Now we have two results where the output is the input times 4. It seems that the function multiplies by 4. I will make the guess now. 17 * 4 = 68, 36 * 4 = 144 and 68 * 4 = 272, so my guess will be 68 144 272.",
121121
),
122122
Message(
123123
role="system",
124-
content="Given the above reasoning, the answer in the format requested by the question is:",
124+
content="Given the above reasoning, your response in the format requested by the instructions is:",
125125
),
126126
Message(role="assistant", content="68 144 272"),
127127
Message(role="system", content="Correct guess!"),
@@ -130,7 +130,9 @@ def test_base_model_cot_solver():
130130
content="I now have a new function. Forget about the previous one, we start again.",
131131
),
132132
]
133-
assert solver_private_memory[: len(expected_few_shot_msgs)] == expected_few_shot_msgs
133+
for i in range(len(expected_few_shot_msgs)):
134+
assert solver_private_memory[i].role == expected_few_shot_msgs[i].role
135+
assert solver_private_memory[i].content.strip() == expected_few_shot_msgs[i].content.strip()
134136
assert (
135137
solver_private_memory[len(expected_few_shot_msgs) + 0].content == cot_template_first_round
136138
)

evals/elsuite/hr_ml_agent_bench/solvers/baseline.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
import tiktoken
77

88
from evals.registry import Registry, n_ctx_from_model_name
9-
from evals.solvers.openai_solver import OpenAISolver
9+
from evals.solvers.providers.openai.openai_solver import OpenAISolver
1010
from evals.solvers.solver import Solver, SolverResult
1111
from evals.task_state import Message, TaskState
1212

evals/elsuite/make_me_say/eval.py

+8-13
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,26 @@
11
import numpy as np
22

33
import evals
4-
from evals.api import CompletionFn, DummyCompletionFn
4+
from evals.api import DummyCompletionFn
55
from evals.elsuite.make_me_say.autoeval import run as run_auto_eval
66
from evals.elsuite.make_me_say.core import Game
77
from evals.record import RecorderBase
88

99

1010
class MakeMeSay(evals.Eval):
11-
def __init__(
12-
self,
13-
completion_fns: list[CompletionFn],
14-
*args,
15-
**kwargs,
16-
):
17-
super().__init__(completion_fns, *args, **kwargs)
11+
def __init__(self, *args, **kwargs):
12+
super().__init__(*args, **kwargs)
1813

19-
if len(completion_fns) == 1 and isinstance(completion_fns[0], DummyCompletionFn):
20-
completion_fn = completion_fns[0]
21-
completion_fns = [completion_fn for _ in range(3)]
14+
if len(self.completion_fns) == 1 and isinstance(self.completion_fns[0], DummyCompletionFn):
15+
completion_fn = self.completion_fns[0]
16+
self.completion_fns = [completion_fn for _ in range(3)]
2217

23-
assert len(completion_fns) == 3, "MakeMeSay only supports three completion fns"
18+
assert len(self.completion_fns) == 3, "MakeMeSay only supports three completion fns"
2419
(
2520
self.manipulator_completion_fn,
2621
self.manipulatee_completion_fn,
2722
self.judge_completion_fn,
28-
) = completion_fns
23+
) = self.completion_fns
2924

3025
def eval_sample(self, sample: dict, rng) -> None:
3126
del rng

evals/eval.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
import random
99
from multiprocessing.pool import ThreadPool
1010
from pathlib import Path
11-
from typing import Any, Awaitable, Callable, Dict, List, Optional, Tuple
11+
from typing import Any, Awaitable, Callable, Dict, List, Optional, Tuple, Union
1212

1313
from tqdm import tqdm
1414

@@ -18,7 +18,7 @@
1818
from .record import RecorderBase
1919
from .registry import Registry
2020
from .solvers.solver import Solver
21-
from .solvers.utils import maybe_wrap_with_solver
21+
from .solvers.utils import maybe_wrap_with_compl_fn, maybe_wrap_with_solver
2222

2323
logger = logging.getLogger(__name__)
2424

@@ -55,7 +55,7 @@ class Eval(abc.ABC):
5555

5656
def __init__(
5757
self,
58-
completion_fns: list[CompletionFn],
58+
completion_fns: list[Union[CompletionFn, Solver]],
5959
eval_registry_path: Path,
6060
seed: int = 20220722,
6161
name: str = "no_name_eval.default",
@@ -66,7 +66,7 @@ def __init__(
6666
if len(splits) < 2:
6767
raise ValueError(f"Eval name must at least have <base_eval>.<split>. Got name {name}")
6868

69-
self.completion_fns = completion_fns
69+
self.completion_fns = [maybe_wrap_with_compl_fn(fn) for fn in completion_fns]
7070
self.eval_registry_path = eval_registry_path
7171
self.seed = seed
7272
self.name = name

evals/registry.py

+2
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,8 @@ def n_ctx_from_model_name(model_name: str) -> Optional[int]:
6363
"gpt-4-32k": 32768,
6464
"gpt-4-base": 8192,
6565
"gpt-4-1106-preview": 128_000,
66+
"gpt-4-turbo-preview": 128_000,
67+
"gpt-4-0125-preview": 128_000,
6668
}
6769

6870
# first, look for an exact match

evals/registry/solvers/already_said_that.yaml

+6-6
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,15 @@ already_said_that/cot/gpt-3.5-turbo:
1414
args:
1515
persistent_memory: False
1616
cot_solver:
17-
class: evals.solvers.openai_solver:OpenAISolver
17+
class: evals.solvers.providers.openai.openai_solver:OpenAISolver
1818
args:
1919
completion_fn_options:
2020
model: gpt-3.5-turbo
2121
extra_options:
2222
temperature: 1
2323
max_tokens: 512
2424
extract_solver:
25-
class: evals.solvers.openai_solver:OpenAISolver
25+
class: evals.solvers.providers.openai.openai_solver:OpenAISolver
2626
args:
2727
completion_fn_options:
2828
model: gpt-3.5-turbo
@@ -35,15 +35,15 @@ already_said_that/cot/gpt-4-turbo-preview:
3535
args:
3636
persistent_memory: False
3737
cot_solver:
38-
class: evals.solvers.openai_solver:OpenAISolver
38+
class: evals.solvers.providers.openai.openai_solver:OpenAISolver
3939
args:
4040
completion_fn_options:
4141
model: gpt-4-turbo-preview
4242
extra_options:
4343
temperature: 1
4444
max_tokens: 512
4545
extract_solver:
46-
class: evals.solvers.openai_solver:OpenAISolver
46+
class: evals.solvers.providers.openai.openai_solver:OpenAISolver
4747
args:
4848
completion_fn_options:
4949
model: gpt-4-turbo-preview
@@ -59,7 +59,7 @@ already_said_that/cot_hhh/gpt-4-base:
5959
class: evals.solvers.nested.hhh_solver:HHHSolver
6060
args:
6161
solver:
62-
class: evals.solvers.openai_solver:OpenAISolver
62+
class: evals.solvers.providers.openai.openai_solver:OpenAISolver
6363
args:
6464
completion_fn_options:
6565
model: gpt-4-base
@@ -70,7 +70,7 @@ already_said_that/cot_hhh/gpt-4-base:
7070
class: evals.solvers.nested.hhh_solver:HHHSolver
7171
args:
7272
solver:
73-
class: evals.solvers.openai_solver:OpenAISolver
73+
class: evals.solvers.providers.openai.openai_solver:OpenAISolver
7474
args:
7575
completion_fn_options:
7676
model: gpt-4-base

0 commit comments

Comments
 (0)