UKGovernmentBEIS · jjallaire · Feb 5, 2025 · Jan 30, 2025 · Jan 31, 2025 · Jan 31, 2025
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -18,6 +18,7 @@
 - OpenAI: Map some additional 400 status codes to `content_filter` stop reason.
 - Anthropic: Handle 413 status code (Payload Too Large) and map to `model_length` StopReason.
 - Tasks: Log sample with error prior to raising task-ending exception.
+- Python: Enhance prompt to emphasise that it is a script rather than a notebook.
 - Computer: Various improvements to image including desktop, python, and VS Code configuration.
 - Bugfix: Don't download full log from S3 for header_only reads.
 

diff --git a/src/inspect_ai/tool/_tools/_execute.py b/src/inspect_ai/tool/_tools/_execute.py
@@ -74,8 +74,39 @@ async def execute(code: str) -> str:
         """
         Use the python function to execute Python code.
 
-        The python function will only return you the stdout of the script,
-        so make sure to use print to see the output.
+        The Python tool executes single-run Python scripts. Important notes:
+        1. Each execution is independent - no state is preserved between runs
+        2. You must explicitly use print() statements to see any output
+        3. Simply writing expressions (like in notebooks) will not display results
+        4. The script cannot accept interactive input during execution
+        5. Return statements alone won't produce visible output
+        6. All variables and imports are cleared between executions
+        7. Standard output (via print()) is the only way to see results
+
+        Examples:
+          INCORRECT (notebook style):
+          x = 5
+          x * 2           # Won't show anything
+          return x * 2    # Won't show anything
+          [1, 2, 3]       # Won't show anything
+
+          CORRECT:
+          x = 5
+          print(x * 2)    # Will show: 10
+          result = x * 2
+          print(result)   # Will show: 10
+          print([1, 2, 3])  # Will show: [1, 2, 3]
+
+          INCORRECT (assuming previous imports persist):
+          # First run:
+          import numpy as np
+          # Second run:
+          arr = np.array([1, 2, 3])  # This will fail - numpy not imported in this run
+
+          CORRECT (each run is self-contained):
+          import numpy as np
+          arr = np.array([1, 2, 3])
+          print(arr)  # Will show: [1 2 3]
 
         Args:
           code (str): The python code to execute.