The latest version of this repository is now at https://github.com/athina-ai/athina-evals
Athina is an LLM output testing SDK + observability platform that helps you write tests and monitor your app in production.
Reliability of output is one of the biggest challenges for people trying to use LLM apps in production.
Since LLM outputs are non-deterministic, it’s very hard to measure how good the output is.
Eyeballing the responses from an LLM can work in development, but it’s not a great solution.
In production, it’s virtually impossible to eyeball thousands of responses. Which means you have very little visibility into how well your LLM is performing.
- Do you know when your LLM app is hallucinating?
- How do you know how well it's really performing?
- Do you know how often it’s producing a critically bad output?
- How do you know what your users are seeing?
- How do you measure how good your LLM responses are? And if you can’t measure it, how do you improve the accuracy?
If these sound like problems to you (today or in the future), please reach out to us at [email protected]. We’d love to hear more!
pip install magik
See https://docs.magiklabs.app for instructions on how to write and run tests.
Who is this product meant for?
- If you're in the early stages of building an LLM app:
- If you have an LLM app in production
Test-driven development can speed up your development very nicely, and can help you engineer your prompts to be more robust.
For example, assuming your prompt looks like this:
Create some marketing copy for a tweet of less than 280 characters for my app {app_name}.
My app helps people generate sales emails using AI.
Make sure the marketing copy contains a complete and valid link to my app.
Here is the link to my app: https://magiklabs.app.
You can write tests like this:
from magik.evaluators import (
contains_none,
contains_link,
contains_valid_link,
is_positive_sentiment,
length_less_than,
)
# Local context - this is used as the "ground truth" data that you can compare against in your tests
test_context = {}
# Define tests here
def define_tests(context: dict):
return [
{
"description": "output contains a link",
"eval": contains_link(),
"prompt_vars": {
"app_name": "Uber",
},
"failure_labels": ["bad_response_format"],
},
{
"description": "output contains a valid link",
"eval": contains_valid_link(),
"prompt_vars": {
"app_name": "Magik",
},
"failure_labels": ["bad_response_format"],
},
{
"description": "output sentiment is positive",
"eval": is_positive_sentiment(),
"prompt_vars": {
"app_name": "Lyft",
},
"failure_labels": ["negative_sentiment"],
},
{
"description": "output length is less than 280 characters",
"eval": length_less_than(280),
"prompt_vars": {
"app_name": "Facebook",
},
"failure_labels": ["negative_sentiment", "critical"],
},
{
"description": "output does not contain hashtags",
"eval": contains_none(['#']),
"prompt_vars": {
"app_name": "Datadog",
},
"failure_labels": ["bad_response_format"],
},
]
You can use our evaluation & monitoring platform to:
-
Observe the prompt, response pairs in production, and analyze response times, cost, token usage, etc for different prompts and date ranges.
-
Evaluate your production responses against your own tests to get a quantifiable understanding of how well your LLM app is performing.
- For example, You can run the tests you defined against the LLM responses you are getting in production to measure how your app is performing with real data.
-
Filter by failure labels, severity, prompt, etc to identify different types of errors that are occurring in your LLM outputs.
See https://magiklabs.app for more details, or contact us at [email protected]
Soon, you will also be able to:
-
Fail bad outputs before they get to your users.
- For example, if the LLM response contains sensitive information like PII, you can detect that in real-time, and cut it off before it reaches the end user.
-
Set up alerts to notify you about critical errors in production.
Contact us at [email protected] to get access to our LLM observability platform where you can run the tests you've defined here against your LLM responses in production.