A Python library for generating completions and evaluating NaniDAO models using different datasets, prompts, and configuration files.
- Set up environment variables in
.env
:
NANI_API_KEY=your_nani_api_key
NANI_BASE_URL=http://nani.ooo/api/chat
GEMINI_API_KEY=your_gemini_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
OPENAI_API_KEY=your_openai_api_key
If not, you can specify individual API_KEY/BASE_URL using CLI arguments.
- Install dependencies:
# Using uv (recommended)
git clone https://github.com/NaniDAO/evals.git
cd evals
uv pip install -e .
# Or using pip
git clone https://github.com/NaniDAO/evals.git
cd evals
pip install -e .
Directory contains previous completions and evaluations run on different LLMs using datasets from nanidao_evals/data/datasets
. Inspect individual metadata for details. Not part of the nanidao-evals
package.
Generate completions using default settings:
# Using environment variables from .env
nanidao-evals
# Or passing credentials via CLI
nanidao-evals \
--providers nani \
--provider-urls "nani:https://nani.ooo/api/chat" \
--provider-api-keys "nani:your-api-key"
Evaluate existing completions:
# Using environment variables
nanidao-evals --evaluation-judge nani --evaluate-file out/completions.json
# Or passing credentials via CLI
nanidao-evals \
--evaluation-judge nani \
--provider-urls "nani:https://nani.ooo/api/chat" \
--provider-api-keys "nani:your-api-key" \
--evaluate-file out/completions.json
# List available behaviors (default dataset: JBB)
nanidao-evals --list-behaviors
# List categories in a specific dataset
nanidao-evals --completions-dataset NANI --list-categories
# Show prompts that match specific criteria
nanidao-evals --show-prompts --dataset-category Hardware --completions-dataset NANI
# Generate completions from specific dataset
nanidao-evals --completions-dataset NANI
# Generate with multiple configurations
nanidao-evals \
--providers nani \
--config-file configs/multi_temp.json \
--completions-dataset NANI
Example configs/multi_temp.json
:
[
{
"temperature": 0.7,
"max_tokens": 1000,
"top_p": 1.0
},
{
"temperature": 0.9,
"max_tokens": 1500,
"top_p": 0.9
}
]
# Generate and evaluate completions using Gemini
nanidao-evals --evaluation-judge gemini
# Evaluate existing completions file
nanidao-evals --evaluation-judge anthropic --evaluate-file out/completions.json
Datasets for generating completions are found in data/datasets/jailbreaks_datasets.json
.
Prompts for generating evaluations are found in data/prompts/eval_prompts.json
.
Generate completions using specific provider and model:
nanidao-evals \
--providers nani \
--provider-urls "nani:https://nani.ooo/api/chat" \
--provider-models "nani:deepseek-r1-qwen-2.5-32B-ablated" \
--provider-api-keys "nani:your-api-key" \
--completions-dataset NANI
from nanidao_evals.generators.completions import CompletionGenerator
# Configure providers
provider_configs = {
"nani": {
"base_url": "https://nani.ooo/api/chat",
"api_key": "your-api-key",
"model": "deepseek-r1-qwen-2.5-32B-ablated"
}
}
# Create generator
generator = CompletionGenerator(
providers=["nani"],
provider_configs=provider_configs
)
# Generate completions
results = generator.generate_completions(
dataset_path="data/datasets/nani_dataset.json",
categories=["Hardware"],
behaviors=["Engineering"]
)
from apis.analyzer import create_handler
# Initialize handler
handler = create_handler(
provider="nani",
model="deepseek-r1-qwen-2.5-32B-ablated",
base_url="https://nani.ooo/api/chat",
api_key="your-api-key"
)
# Generate response
response = handler.generate_response("Your prompt here")
Generate completions for specific categories/behaviors:
nanidao-evals \
--completions-dataset NANI \
--dataset-category Hardware \
--dataset-behavior Engineering
Evaluate with specific model and prompt:
nanidao-evals \
--evaluation-judge anthropic \
--provider-models "anthropic:claude-3-5-sonnet-20241022" \
--evaluation-prompt eval0_system_prompt
Generate and evaluate in one run with filters:
nanidao-evals \
--completions-dataset NANI \
--evaluation-judge gemini \
--dataset-category Hardware \
--dataset-source Original
--providers
: List of providers to use (e.g., nani, gemini, anthropic)--provider-urls
: Base URLs for providers (format: provider:url)--provider-models
: Model names for providers (format: provider:model)--provider-api-keys
: API keys for providers (format: provider:key)
--list-behaviors
: Show available behaviors--list-categories
: Show available categories--list-sources
: Show available sources--show-prompts
: Display prompts matching filters--completions-dataset
: Select dataset (default: JBB)
--dataset-category
: Filter by categories--dataset-behavior
: Filter by behaviors--dataset-source
: Filter by sources--output-dir
: Output directory (default: out)--config-file
: Custom model configuration file (supports multiple configs)
--evaluation-judge
: Judge provider (gemini/anthropic/openai/nani)--evaluation-prompt
: Evaluation prompt (default: eval0_system_prompt)--evaluate-file
: Existing completions file to evaluate
Provider | Model |
---|---|
gemini | gemini-2.0-flash-exp |
anthropic | claude-3-5-sonnet-20241022 |
openai | gpt-4o-mini-2024-07-18 |
nani | NaniDAO/deepseek-r1-qwen-2.5-32B-ablated |
huggingface | tgi |
/apis - LLM provider implementations
/data
/configs - Model configurations
/datasets - Input datasets
/prompts - Evaluation prompts
/generators - Core generation/evaluation logic
Results are saved with timestamps:
out/YYYYMMDD_HHMMSS_completions.json # For completions
out/YYYYMMDD_HHMMSS_eval_provider.json # For evaluations
Previous evaluation results are available in data/info/old_evals/
.
All providers can be configured either through environment variables in .env
or via CLI arguments.
nanidao-evals \
--providers huggingface \
--provider-urls "huggingface:https://your-endpoint" \
--provider-models "huggingface:your-model" \
--provider-api-keys "huggingface:your-key"
nanidao-evals \
--providers nani \
--provider-urls "nani:https://nani.ooo/api/chat" \
--provider-models "nani:NaniDAO/deepseek-r1-qwen-2.5-32B-ablated"
nanidao-evals \
--providers nani huggingface \
--provider-urls "nani:https://nani.ooo/api/chat" "huggingface:https://your-endpoint" \
--provider-models "nani:model1" "huggingface:model2"