You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Updates on existing solvers and bugged tool eval (#1506)
@JunShern will review this
Wrap solvers with completion functions for compatibility with pre-solver
Evals. This means you can execute all evals using solvers.
[49fd9ef](49fd9ef)
Add context length information about gpt-4-turbo-preview and
gpt-4-0125-preview.
[9a0ab1c](9a0ab1c)
Move oai and together solvers into providers / subdir
[063bf4f](063bf4f)
Update the default task descriptions for bugged tools. We added more
information when using gemini + OS models, since they got confused.
[0523dd4](0523dd4)
Modified the default solver chain-of-thought prompt, as well as other
custom chain-of-thought prompts used in some evals. The default
CoTSolver prompts were a bit misleading in some cases; we observed
GeminiSolver working too hard to arrive at a final answer for the whole
eval when it's in fact supposed to give just a response for the next
turn.
[287f3cf](287f3cf)
---------
Co-authored-by: johny-b <[email protected]>
Co-authored-by: Chan Jun Shern <[email protected]>
Co-authored-by: Giulio Starace <[email protected]>
Copy file name to clipboardExpand all lines: evals/elsuite/bugged_tools/task_description.py
+2
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,7 @@
2
2
You have access to tools that may help you at solving this task. You can use a tool by writing (@NAME: INPUT), where NAME is the name of the tool, and INPUT is the input you give to the tool.
3
3
If your tool implements a function, after tagging the tool only write the input and nothing else.
4
4
You can call multiple tools at the same time. Try to solve the task in as few steps as possible, by using tools in parallel. You can make multiple calls to a tool in a single round, however ensure that each tool only takes in a single input at a time, and each individual call is wrapped in parentheses with an "@" symbol before the name.
5
+
When you call a tool, I will give you the tools output in the next message. Do not generate the outputs of the tools yourself, I will do it for you. Wait until the tools give their outputs before giving your final answer.
5
6
{bug_description}
6
7
7
8
You have access to the following tools:
@@ -27,6 +28,7 @@
27
28
You have access to tools that may help you at solving this task. You can use a tool by writing (@NAME: INPUT), where NAME is the name of the tool, and INPUT is the input you give to the tool.
28
29
If your tool implements a function, after tagging the tool only write the input and nothing else.
29
30
You can call multiple tools at the same time. Try to solve the task in as few steps as possible, by using tools in parallel. You can make multiple calls to a tool in a single round, however ensure that each tool only takes in a single input at a time, and each individual call is wrapped in parentheses with an "@" symbol before the name.
31
+
When you call a tool, I will give you the tools output in the next message. Do not generate the outputs of the tools yourself, I will do it for you. Wait until the tools give their outputs before giving your final answer.
Copy file name to clipboardExpand all lines: evals/elsuite/function_deduction/prompts.py
+3-1
Original file line number
Diff line number
Diff line change
@@ -29,7 +29,7 @@
29
29
new_value="calc({in_}) = {out}"
30
30
test_input_not_allowed="Numbers {inputs} are not allowed"
31
31
32
-
cot_template_first_round="""This is the first round. Think out loud about a general strategy of solving tasks like this."""
32
+
cot_template_first_round="""This is the first round. Think out loud about a general strategy of solving tasks like this. (You now have an opportunity to reason privately before submitting an action. Your next response will not be evaluated.)"""
33
33
cot_template_later_rounds="""\
34
34
This is round {round_ix} out of {num_rounds}.
35
35
@@ -40,4 +40,6 @@
40
40
* Do you have any hypothesis on what this function might be doing?
41
41
* If yes, should you try to test it (how?), or just use it to calculate the answer?
42
42
* If not, what additional information should you gather to be able to formulate a hypothesis?
43
+
44
+
(You now have an opportunity to reason privately before submitting an action. Your next response will not be evaluated.)
Message(role="system", content="I have a hidden function. What is your first action?"),
86
86
Message(
87
87
role="system",
88
-
content="This is the first round. Think out loud about a general strategy of solving tasks like this.",
88
+
content="This is the first round. Think out loud about a general strategy of solving tasks like this. (You now have an opportunity to reason privately before submitting an action. Your next response will not be evaluated.)",
89
89
),
90
90
Message(
91
91
role="assistant",
92
92
content="I don't know anything about the function yet. I should start with gathering some information, e.g. by asking about the return value for 40.",
93
93
),
94
94
Message(
95
95
role="system",
96
-
content="Given the above reasoning, the answer in the format requested by the question is:",
96
+
content="Given the above reasoning, your response in the format requested by the instructions is:",
97
97
),
98
98
Message(role="assistant", content="40"),
99
99
Message(role="system", content="calc(40) = 160"),
100
100
Message(
101
101
role="system",
102
-
content="This is round 2 out of 10.\n\nSummary of the information you have already gathered:\ncalc(40) = 160\n\nThink out loud about the following questions:\n* Do you have any hypothesis on what this function might be doing?\n* If yes, should you try to test it (how?), or just use it to calculate the answer?\n* If not, what additional information should you gather to be able to formulate a hypothesis?\n",
102
+
content="This is round 2 out of 10.\n\nSummary of the information you have already gathered:\ncalc(40) = 160\n\nThink out loud about the following questions:\n* Do you have any hypothesis on what this function might be doing?\n* If yes, should you try to test it (how?), or just use it to calculate the answer?\n* If not, what additional information should you gather to be able to formulate a hypothesis?\n\n(You now have an opportunity to reason privately before submitting an action. Your next response will not be evaluated.)",
103
103
),
104
104
Message(
105
105
role="assistant",
106
106
content="Perhaps this function does multiplication by 4? That's not certain yet, so let's try some other number, e.g. 52.",
107
107
),
108
108
Message(
109
109
role="system",
110
-
content="Given the above reasoning, the answer in the format requested by the question is:",
110
+
content="Given the above reasoning, your response in the format requested by the instructions is:",
111
111
),
112
112
Message(role="assistant", content="52"),
113
113
Message(role="system", content="calc(52) = 204"),
114
114
Message(
115
115
role="system",
116
-
content="This is round 3 out of 10.\n\nSummary of the information you have already gathered:\ncalc(40) = 160\ncalc(52) = 204\n\nThink out loud about the following questions:\n* Do you have any hypothesis on what this function might be doing?\n* If yes, should you try to test it (how?), or just use it to calculate the answer?\n* If not, what additional information should you gather to be able to formulate a hypothesis?\n",
116
+
content="This is round 3 out of 10.\n\nSummary of the information you have already gathered:\ncalc(40) = 160\ncalc(52) = 204\n\nThink out loud about the following questions:\n* Do you have any hypothesis on what this function might be doing?\n* If yes, should you try to test it (how?), or just use it to calculate the answer?\n* If not, what additional information should you gather to be able to formulate a hypothesis?\n\n(You now have an opportunity to reason privately before submitting an action. Your next response will not be evaluated.)",
117
117
),
118
118
Message(
119
119
role="assistant",
120
-
content="Now we have two results where the ouput is the input times 4. It seems that the function multiplies by 4. I will make the guess now. 17 * 4 = 68, 36 * 4 = 144 and 68 * 4 = 272, so my guess will be 68 144 272.",
120
+
content="Now we have two results where the output is the input times 4. It seems that the function multiplies by 4. I will make the guess now. 17 * 4 = 68, 36 * 4 = 144 and 68 * 4 = 272, so my guess will be 68 144 272.",
121
121
),
122
122
Message(
123
123
role="system",
124
-
content="Given the above reasoning, the answer in the format requested by the question is:",
124
+
content="Given the above reasoning, your response in the format requested by the instructions is:",
0 commit comments