Project put in standby... Made a few changes to SFT because I wanted to train all classes together to prevent overfitting in gfn training.
- commands/
- local_data/
- blue_model1/ # save steps 2, 3, 4
- blue_model1/
- src/
- scripts/ # runable scripts, e.g., train, eval, etc
- models/ # defines model
- dataset/ # defines dataset
- trainer/ # defines trainer
- utils/
- environment/ # RL envs
- runs/ # any dumps from scripts
- notebooks/
Dump checkpoints and any run results under runs/ notebooks are used to visualize results under runs/
redteam and blueteam takes the different models and implements generation methods etc.
Need to do the following too:
pip install wheel
pip install flash-attn --no-build-isolation
and make sure your cuda version is 11.6 and above
- Create a json file with all the hand made red team attacks
download the dataset
cd local_data
git clone https://github.com/haizelabs/redteaming-resistance-benchmark.git
Combine them into a single jsonl file
python -m src.scripts.combine_human_datasets
- Compute all successful attacks on a target blue model, used for SFT and replay buffer
Set blue model to be greedy
poetry run python -m accelerate.commands.launch --num_processes 2 --mixed_precision bf16 -m src.scripts.compute_toxic_human_prompts --categories_with_descriptions --blue_model_name gg-hf/gemma-2b-it --blue_max_new_tokens 100 --json_path ./local_data/human_instructions.jsonl --blue_temperature 0.0
Candidate blue model:
- stabilityai/stablelm-2-1_6b-chat this model has no safe guarding
- THUDM/glm-4-9b-chat
- meta-llama/Llama-2-7b-chat-hf
- mistralai/Mistral-7B-Instruct-v0.2
- Qwen/Qwen2-7B-Instruct , I think this one needs a system message
- SFT on red model
python -m accelerate.commands.launch --num_processes 1 --mixed_precision bf16 -m src.scripts.red_team_sft --red_model_name microsoft/phi-2 --output_dir runs/red_team_sft_peft --learning_rate=1e-5 --per_device_train_batch_size=2 --num_train_epochs=1.0 --logging_steps=10 --do_eval
or run red_team_sft.sh
- GFN on red model
run red_team_gfn.sh
to train your gfn model
- compute asr for a certain category
poetry run python -m accelerate.commands.launch --num_processes 2 --mixed_precision bf16 -m src.scripts.compute_asr --categories_with_descriptions --blue_model_name THUDM/glm-4-9b-chat --blue_max_new_tokens 100 --blue_temperature 0.0 --red_model_name ./runs/glm-4-9b-chat/red_team_sft --toxicity_class "Sex Crimes" --max_steps 2000 --red_temperature 0.7
Violent Crimes Non-Violent Crimes Sex Crimes Child Exploitation Specialized Advice Privacy Intellectual Property Indiscriminate Weapons Hate Self-Harm Sexual Content
-
SFT the blue model on the generated data
-
Evaluate that the model didn't loose instruction following abilities
'./local_data/human_instructions_filtered.jsonl' the dataset from the humans red team attemps
To run blue team sft, run the following command:
bash /data/sqq/matt/gfn-redteam/commands/train/blue_team_sft.sh
dataset_name: the dataset to train on
model_name_or_path: the pretrained blue team model base
To run the Open LLM benchmark, run the following command:
bash /data/sqq/matt/gfn-redteam/lm-evaluation-harness/evaluate.sh
model_args: specify the model path to be evaluated
tasks: specify the tasks for evaluation
To run the red-teaming resistance benchmark, run the following command:
python /data/sqq/matt/gfn-redteam/redteaming-resistance-benchmark/run_eval_benchmarks.py
Change the model name at model_names
in the python file to specify the models to evaluate
To calculate the metrics:
- get data into pickle file with dict keys 'red_prompt', 'blue_response', 'decoded_prompt', 'safety', 'semantic_diversity', 'n_gram_diversity', 'gibberish', 'iteration'
- run ```