job chat: Add a prompt testing process #108

hanna-paasivirta · 2024-11-12T17:39:21Z

New prompts should be tested to evaluate their performance and minimise unexpected issues in production. This will likely involve accumulating generated test datasets targeting different issues, as well as using LLM-based evaluation to check if each test passed (T/F) to produce a score.

josephjclark · 2024-11-12T18:26:30Z

For the record I would be happy with a manual test process which goes something like this:

Run a bunch of test questions through the assistant
save the responses to a file and check them into the repo
Make a change to the prompt
Re-run the test questions and MANUALLY review diffs in the answers
Check in the updated answers if we're happy

We may also need to factor in drift from the LLM end itself - as eg Anthropic updates its model, I don't know how tightly we can version lock, so we may see a natural variance.

josephjclark · 2024-12-05T12:54:30Z

I'm not really keen on using notebooks for this, I want a more formal process with better script support in the repo.

Here's a bit more detail on the experience I'm looking for:

Write or generate a bunch of test questions to a file. Maybe one question per line, or delimited in a particular way.
Run a script from inside job_chat like bun test_prompt <questions.py>, where questions.py refers that questions file. Or maybe it runs it against all questions files.
The script will load in the actual prompt defined in the job_chat python. Maybe you can pass in a different prompt but I'm really not sure the use case
The script will then run each question through the prompt, and write the results to disk. Probably as one file, with the same filename as the question file

The idea here is that I can edit the prompt locally and re-run my questions. I can then commit changes, so that in a PR or git compare I can see the prompt changes and the resulting output.

All the questions and answers must be checked-in so that we can compare to last time.

josephjclark · 2025-01-24T17:43:41Z

The thinking with this is that the assistant is too immature to justify an expensive prompt versioning (or testing) process. Better to just tweak the prompt ad-hoc for now and watch for changes live.

hanna-paasivirta self-assigned this Nov 12, 2024

hanna-paasivirta mentioned this issue Nov 13, 2024

Update job chat prompt and add prompt evaluation for issue 97 #118

Merged

7 tasks

josephjclark unassigned hanna-paasivirta Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job chat: Add a prompt testing process #108

job chat: Add a prompt testing process #108

hanna-paasivirta commented Nov 12, 2024

josephjclark commented Nov 12, 2024

josephjclark commented Dec 5, 2024

josephjclark commented Jan 24, 2025

job chat: Add a prompt testing process #108

job chat: Add a prompt testing process #108

Comments

hanna-paasivirta commented Nov 12, 2024

josephjclark commented Nov 12, 2024

josephjclark commented Dec 5, 2024

josephjclark commented Jan 24, 2025