Skip to content

Commit

Permalink
Clean up evaluate Khoj helpfulness post based on feedback by Khoj
Browse files Browse the repository at this point in the history
  • Loading branch information
debanjum committed Nov 23, 2024
1 parent 85d593a commit a60ede6
Showing 1 changed file with 21 additions and 20 deletions.
41 changes: 21 additions & 20 deletions src/content/posts/evaluate-khoj-quality.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ author: debanjum
description: "A deep dive into how we implemented an automated evaluation harness and Khoj's excellent performance on modern factuality and reasoning benchmarks."
heroImage: /eval-khoj-quality.webp
pubDate: 2024-11-22
keywords: ["agent eval", "automated llm benchmark"]
keywords: ["agent eval", "automated llm benchmark", "research mode"]
---

Khoj is an open, personal AI that can gather information from your documents and the web to generate accurate answers, paint images, visualize data, and create documents for you.
Expand All @@ -26,28 +26,27 @@ Additionally, as agent capabilities increase, we need more widespread testing to
We selected two primary benchmarks for evaluation:

1. **Google's [FRAMES](https://huggingface.co/datasets/google/frames-benchmark)**: This is the primary evaluation benchmark we tested against. It tests:
- Multi-hop reasoning: Requires retrieval from multiple sources and reasoning over them.
- Temporal reasoning: Requires reasoning about time.
- Tabular reasoning: Requires reasoning on data in tabels.
- Multi-hop reasoning: Requires retrieval from multiple sources and reasoning over them.
- Temporal reasoning: Requires reasoning about time.
- Tabular reasoning: Requires reasoning on data in tables.

These align well with our retrieval and reasoning goals for Khoj. The benchmark was released in September 2024. It is public, reasonably challenging dataset for modern agents.[^2].
These align well with our retrieval and reasoning goals for Khoj. The benchmark was released in September 2024 by Google. It is a public, reasonably challenging dataset for modern AI agents[^2].

2. **OpenAI's [SimpleQA](https://openai.com/index/introducing-simpleqa/)**: This is a newer, secondary evaluation benchmark we incorporated:
- It evaluates the ability of large language models to give correct and truthful answers.
- It was created as a challenging Q&A benchmark for modern LLMs. Top models like o1-preview and the latest claude 3.5 sonnet only get ~40% answers correct.
2. **OpenAI's [SimpleQA](https://openai.com/index/introducing-simpleqa/)**: This is a recently released evaluation benchmark.
- It evaluates the ability of large language models to give correct and truthful answers.
- It was created as a challenging Q&A benchmark for modern LLMs. Top models like o1-preview and the latest claude 3.5 sonnet only get ~40% answers correct.

These aligns well with our helpfulness goals for Khoj. The benchmark was released in October 2024. It is open-source and challenging for current state-of-the-art LLMs.
These match our helpfulness goals for Khoj. This benchmark was released a few weeks ago by OpenAI. It is open-source and challenging for current state-of-the-art LLMs.

#### Implementation
##### Modes
Khoj can be interacted with in few different modes. The 3 main ones from the lens of the evaluations are:
#### Evaluated Modes
Khoj can be interacted with in a few different modes. The 3 main ones from the lens of the evaluations are:
- **General**: This is like a closed book exam. No retrieval is allowed. The agent can't access external information, only the LLMs existing *general* knowledge.
- **Default**: This is like an open book exam. Single shot retrieval is allowed. The agent can search for information online, run calculations in a [code sandbox](/posts/ai-with-code-execution).
- **Research**: This is like a take home exam. Iterative retrival is permitted. The agent can do deeper research for a bit longer with the same web search and code tools.
- **Research**: This is like a take home exam. Iterative retrieval is permitted. The agent can do deeper research for a bit longer with the same web search and code tools.

You can trigger any of the 3 modes in Khoj using a slash command like `/research`. Default mode doesn't require slash command. Research mode was released at the start of November and is still in beta.
You can chat with Khoj in any of the 3 modes using a slash command like `/research`. Default mode doesn't require slash command. Research mode was released at the start of November and is still in beta.

##### Evaluation Script
#### Evaluation Harness

We developed an evaluation script to quiz Khoj on different benchmarks[^6]. It allows you to:
- Configure sample size, randomization, target benchmark.
Expand All @@ -61,17 +60,19 @@ The eval is automatically run on every release using a Github [workflow](https:/
4. Grades the responses using gemini-1.5-pro-002 as the LLM judge.
5. Publishes the scores and a downloadable report for verification.

Using an automated evaluation workflow provides transparency at multiple levels. It creates an audit trail to inspect the setup, reasoning traces and detailed results of Khoj's performance across time and code changes. You can see the raw logs from an eval workflow run [here](https://github.com/khoj-ai/khoj/actions/runs/11963916969/job/33355284137#step:8:38398).
Using a public evaluation workflow provides transparency at multiple levels. It creates an audit trail to inspect the setup, reasoning traces and detailed results of Khoj's performance across time and code changes. You can see the raw logs from a recent eval workflow run [here](https://github.com/khoj-ai/khoj/actions/runs/11963916969/job/33355284137#step:8:38398).

### Results

These runs evaluate Khoj with gemini-1.5-flash-002[^5] on a 200-question random subset of the target benchmark. This results in error margins of ~6% at reasonable costs ($5 across the 3 modes and 2 benchmarks).
These runs evaluate Khoj with gemini-1.5-flash-002 on a 200-question random subset of the target benchmark[^5]. This results in error margins of ~6% at reasonable costs ($5 across the 3 modes and 2 benchmarks).

| Benchmark | General | Default | Research | Baseline |
|-----------|------|---------|---------|----------|
| [FRAMES](https://huggingface.co/datasets/google/frames-benchmark) | [27.14](https://github.com/khoj-ai/khoj/actions/runs/11941817410/attempts/1#summary-33287504889) | [42.00](https://github.com/khoj-ai/khoj/actions/runs/11944716303/attempts/1#summary-33296136909) | [63.5](https://github.com/khoj-ai/khoj/actions/runs/11945673147/attempts/1#summary-33298733849) | 26.3% (flash-1.5-001) |
| [SimpleQA](https://openai.com/index/introducing-simpleqa/) | [10.00](https://github.com/khoj-ai/khoj/actions/runs/11963066702/attempts/1#summary-33352767460) | [84.00](https://github.com/khoj-ai/khoj/actions/runs/11963354200/attempts/1#summary-33353634493) | [86.00](https://github.com/khoj-ai/khoj/actions/runs/11963916969/attempts/1#summary-33355284137) | 43.5% (o1 preview) |

The graphs below visualize the improvements across the 3 modes on the evaluated benchmarks:

![](/khoj-on-frames.webp)

![](/khoj-on-simpleqa.webp)
Expand All @@ -96,9 +97,9 @@ Khoj upgrades small hosted LLMs into AI agents that perform at or beyond the cap

#### Impact of Code Interpreter Tool
Khoj can [run code](/posts/ai-with-code-execution). This ability results in notable accuracy improvements:
- Default mode accuracy **without** code tool: 35.68%
- Default mode accuracy **with** code tool: 42.00%
- Net improvement: ~**20%**
- Default mode accuracy **without** code tool: 35.68%.
- Default mode accuracy **with** code tool: 42.00%.
- Net relative improvement: ~**18%** from 35.68% to 42.00%.

### Future Work
- Add ability to efficiently test retrieval across internal and external knowledge. Our current eval only measures retrieval from the internet, not from your documents.
Expand Down

0 comments on commit a60ede6

Please sign in to comment.