Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(extra): aggregtion prompts tuning #83

Merged
merged 13 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 56 additions & 6 deletions extra/prompt_tuning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,81 @@

This folder contains scripts for prompt tuning and evaluation. Prompts (programs) used in dbally:

- `FILTERING_ASSESSOR` - assesses whether a question requires filtering.
- `FilteringAssessor` - assesses whether a question requires filtering.
- `AggregationAssessor` - assesses whether a question requires aggregation.

All evaluations are run on a dev split of the [BIRD](https://bird-bench.github.io/) dataset. For now, one configuration is available to run the suite against the `superhero` database.

## Usage

### Train new prompts

Tune `filtering-assessor` prompt on base signature using [COPRO](https://dspy-docs.vercel.app/docs/deep-dive/teleprompter/signature-optimizer#how-copro-works) optimizer on the `superhero` database with `gpt-3.5-turbo`:

```bash
python train.py prompt/type=filtering-assessor prompt/signature=baseline prompt/program=predict
```

Change optimizer to [MIPRO](https://dspy-docs.vercel.app/docs/cheatsheet#mipro):

```bash
python train.py prompt/type=filtering-assessor prompt/signature=baseline prompt/program=predict optimizer=mipro
```

Train multiple prompts:

```bash
python train.py --multirun \
prompt/type=filtering-assessor \
prompt/signature=baseline \
prompt/program=predict,cot
```

Tweak optimizer params to get different results:

```bash
python train.py \
optimizer=copro \
optimizer.params.breadth=2 \
optimizer.params.depth=3 \
optimizer.params.init_temperature=1.0
```

### Evaluate prompts

Run evalution of filtering assessor baseline on the `superhero` database with `gpt-3.5-turbo`:

```bash
python evaluate.py program=filtering-assessor-baseline
python evaluate.py prompt/type=filtering-assessor prompt/signature=baseline prompt/program=predict
```

Test multiple programs:
Test multiple prompts:

```bash
python evaluate.py --multirun \
prompt/type=filtering-assessor \
prompt/signature=baseline \
prompt/program=predict,cot
```

```bash
python evaluate.py --multirun program=filtering-assessor-baseline,filtering-assessor-cot
python evaluate.py --multirun \
prompt/type=aggregation-assessor \
prompt/signature=baseline \
prompt/program=predict,cot
```

Compare prompt performance on multiple LLMs:

```bash
python evaluate.py --multirun program=filtering-assessor-baseline llm=gpt-3.5-turbo,claude-3.5-sonnet
python evaluate.py --multirun \
prompt/type=filtering-assessor \
prompt/signature=baseline \
prompt/program=predict \
llm=gpt-3.5-turbo,claude-3.5-sonnet
```

### Log to Neptune
#### Log to Neptune

Before running the evaluation with Neptune, configure the following environment variables:

Expand Down
2 changes: 1 addition & 1 deletion extra/prompt_tuning/config/data/superhero.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
path: "micpst/bird-iql"
path: "deepsense-ai/bird-iql"
split: "dev"
db_ids: ["superhero"]
difficulties: ["simple", "moderate", "challenging"]
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
defaults:
- data: superhero
- llm: gpt-3.5-turbo
- program: filtering-assessor-baseline
- prompt: prompt
- _self_

num_threads: 32
neptune: False
6 changes: 6 additions & 0 deletions extra/prompt_tuning/config/optimizer/copro.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
name: COPRO
params:
breadth: 4
depth: 15
init_temperature: 1.5
compile:
9 changes: 9 additions & 0 deletions extra/prompt_tuning/config/optimizer/mipro.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
name: MIPRO
params:
num_candidates: 3
init_temperature: 1.4

compile:
max_bootstrapped_demos: 3
max_labeled_demos: 0
num_trials: 10

This file was deleted.

This file was deleted.

1 change: 1 addition & 0 deletions extra/prompt_tuning/config/prompt/program/cot.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
id: CoT
1 change: 1 addition & 0 deletions extra/prompt_tuning/config/prompt/program/coth.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
id: CoTH
1 change: 1 addition & 0 deletions extra/prompt_tuning/config/prompt/program/predict.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
id: Predict
8 changes: 8 additions & 0 deletions extra/prompt_tuning/config/prompt/prompt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
defaults:
- type: filtering-assessor
- signature: baseline
- program: predict
- _self_

num_threads: 32
neptune: False
1 change: 1 addition & 0 deletions extra/prompt_tuning/config/prompt/signature/baseline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
id: Baseline
1 change: 1 addition & 0 deletions extra/prompt_tuning/config/prompt/signature/optimized.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
id: Optimized
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
id: AggregationAssessor
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
id: FilteringAssessor
8 changes: 8 additions & 0 deletions extra/prompt_tuning/config/train.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
defaults:
- data: superhero
- llm: gpt-3.5-turbo
- prompt: prompt
- optimizer: copro
- _self_

num_threads: 32
43 changes: 15 additions & 28 deletions extra/prompt_tuning/evaluate.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
import asyncio
import logging
from enum import Enum
from pathlib import Path

import dspy
Expand All @@ -9,45 +8,32 @@
from dspy.evaluate import Evaluate
from neptune.utils import stringify_unsupported
from omegaconf import DictConfig
from tuning.loaders import IQLGenerationDataLoader
from tuning.metrics import filtering_assess_acc
from tuning import DATALOADERS, METRICS
from tuning.programs import PROGRAMS
from tuning.signatures import SIGNATURES
from tuning.utils import save, serialize_results

logging.getLogger("httpx").setLevel(logging.ERROR)
logging.getLogger("anthropic").setLevel(logging.ERROR)
log = logging.getLogger(__name__)


class EvaluationType(Enum):
"""
Enum representing the evaluation type.
"""

FILTERING_ASSESSOR = "FILTERING_ASSESSOR"


EVALUATION_DATALOADERS = {
EvaluationType.FILTERING_ASSESSOR.value: IQLGenerationDataLoader,
}

EVALUATION_METRICS = {
EvaluationType.FILTERING_ASSESSOR.value: filtering_assess_acc,
}


async def evaluate(config: DictConfig) -> None:
"""
Function running evaluation for all datasets and evaluation tasks defined in hydra config.

Args:
config: Hydra configuration.
"""
log.info("Starting evaluation: %s", config.program.name)
signature_name = f"{config.prompt.type.id}{config.prompt.signature.id}"
program_name = f"{config.prompt.type.id}{config.prompt.program.id}"

log.info("Starting evaluation: %s(%s) program", program_name, signature_name)

dataloader = EVALUATION_DATALOADERS[config.program.type](config)
metric = EVALUATION_METRICS[config.program.type]
program = PROGRAMS[config.program.name]()
dataloader = DATALOADERS[config.prompt.type.id](config)
metric = METRICS[config.prompt.type.id]
signature = SIGNATURES[signature_name]
program = PROGRAMS[program_name](signature)

dataset = await dataloader.load()

Expand All @@ -57,7 +43,7 @@ async def evaluate(config: DictConfig) -> None:
evaluator = Evaluate(
devset=dataset,
metric=metric,
num_threads=32,
num_threads=config.num_threads,
display_progress=True,
return_outputs=True,
)
Expand All @@ -75,8 +61,9 @@ async def evaluate(config: DictConfig) -> None:
run = neptune.init_run()
run["sys/tags"].add(
[
config.program.type,
config.program.name,
config.prompt.type.id,
config.prompt.signature.id,
config.prompt.program.id,
*config.data.db_ids,
*config.data.difficulties,
]
Expand All @@ -86,7 +73,7 @@ async def evaluate(config: DictConfig) -> None:
run["evaluation/results.json"].upload(results_file.as_posix())


@hydra.main(config_path="config", config_name="config", version_base="3.2")
@hydra.main(config_path="config", config_name="evaluate", version_base="3.2")
def main(config: DictConfig) -> None:
"""
Function running evaluation for all datasets and evaluation tasks defined in hydra config.
Expand Down
72 changes: 72 additions & 0 deletions extra/prompt_tuning/train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
import asyncio
import logging
from pathlib import Path

import dspy
import dspy.teleprompt
import hydra
from omegaconf import DictConfig
from tuning import DATALOADERS, METRICS
from tuning.programs import PROGRAMS
from tuning.signatures import SIGNATURES

logging.getLogger("httpx").setLevel(logging.ERROR)
logging.getLogger("anthropic").setLevel(logging.ERROR)
log = logging.getLogger(__name__)


async def train(config: DictConfig) -> None:
"""
Function running training for all datasets and training tasks defined in hydra config.

Args:
config: Hydra configuration.
"""
signature_name = f"{config.prompt.type.id}{config.prompt.signature.id}"
program_name = f"{config.prompt.type.id}{config.prompt.program.id}"

log.info("Starting training: %s(%s) program with %s optimizer", program_name, signature_name, config.optimizer.name)

dataloader = DATALOADERS[config.prompt.type.id](config)
metric = METRICS[config.prompt.type.id]
signature = SIGNATURES[signature_name]
program = PROGRAMS[program_name](signature)

dataset = await dataloader.load()

lm = dspy.__dict__[config.llm.provider](model=config.llm.model_name)
dspy.settings.configure(lm=lm)

optimizer = dspy.teleprompt.__dict__[config.optimizer.name](metric=metric, **config.optimizer.params)
compiled_program = optimizer.compile(
student=program,
trainset=dataset,
eval_kwargs={
"num_threads": config.num_threads,
"display_progress": True,
},
**(config.optimizer.compile or {}),
)

log.info("Training finished. Saving compiled program...")

output_dir = Path(hydra.core.hydra_config.HydraConfig.get().runtime.output_dir)
program_file = output_dir / f"{program.__class__.__name__}Optimized.json"
compiled_program.save(program_file)

log.info("Compiled program saved under directory: %s", output_dir)


@hydra.main(config_path="config", config_name="train", version_base="3.2")
def main(config: DictConfig) -> None:
"""
Function running evaluation for all datasets and evaluation tasks defined in hydra config.

Args:
config: Hydra configuration.
"""
asyncio.run(train(config))


if __name__ == "__main__":
main() # pylint: disable=no-value-for-parameter
24 changes: 24 additions & 0 deletions extra/prompt_tuning/tuning/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from enum import Enum

from .loaders import IQLGenerationDataLoader
from .metrics import aggregation_assess_acc, filtering_assess_acc


class ProgramType(Enum):
"""
Program types.
"""

FILTERING_ASSESSOR = "FilteringAssessor"
AGGREGATION_ASSESSOR = "AggregationAssessor"


DATALOADERS = {
ProgramType.FILTERING_ASSESSOR.value: IQLGenerationDataLoader,
ProgramType.AGGREGATION_ASSESSOR.value: IQLGenerationDataLoader,
}

METRICS = {
ProgramType.FILTERING_ASSESSOR.value: filtering_assess_acc,
ProgramType.AGGREGATION_ASSESSOR.value: aggregation_assess_acc,
}
5 changes: 3 additions & 2 deletions extra/prompt_tuning/tuning/loaders.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
from abc import ABC, abstractmethod
from typing import Dict, Iterable, List
from typing import Iterable, List

import dspy.datasets
from dspy import Example
from omegaconf import DictConfig


class DataLoader(ABC):
"""
Data loader.
"""

def __init__(self, config: Dict) -> None:
def __init__(self, config: DictConfig) -> None:
self.config = config

@abstractmethod
Expand Down
4 changes: 2 additions & 2 deletions extra/prompt_tuning/tuning/metrics/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from .iql import filtering_assess_acc
from .iql import aggregation_assess_acc, filtering_assess_acc

__all__ = ["filtering_assess_acc"]
__all__ = ["aggregation_assess_acc", "filtering_assess_acc"]
Loading
Loading