Skip to content

linyuhongg/gfn-redteam

Repository files navigation

Project put in standby... Made a few changes to SFT because I wanted to train all classes together to prevent overfitting in gfn training.

Repo organization

- commands/
- local_data/
    - blue_model1/ # save steps 2, 3, 4
    - blue_model1/
- src/ 
    - scripts/ # runable scripts, e.g., train, eval, etc
    - models/ # defines model 
    - dataset/ # defines dataset
    - trainer/ # defines trainer
    - utils/
    - environment/ # RL envs
- runs/ # any dumps from scripts
- notebooks/

Dump checkpoints and any run results under runs/ notebooks are used to visualize results under runs/

redteam and blueteam takes the different models and implements generation methods etc.

Details

Need to do the following too:

pip install wheel
pip install flash-attn --no-build-isolation

and make sure your cuda version is 11.6 and above

  1. Create a json file with all the hand made red team attacks

download the dataset

cd local_data
git clone https://github.com/haizelabs/redteaming-resistance-benchmark.git

Combine them into a single jsonl file

python -m src.scripts.combine_human_datasets
  1. Compute all successful attacks on a target blue model, used for SFT and replay buffer

Set blue model to be greedy

poetry run python -m accelerate.commands.launch --num_processes 2 --mixed_precision bf16 -m src.scripts.compute_toxic_human_prompts --categories_with_descriptions --blue_model_name gg-hf/gemma-2b-it --blue_max_new_tokens 100 --json_path ./local_data/human_instructions.jsonl --blue_temperature 0.0 

Candidate blue model:

  • stabilityai/stablelm-2-1_6b-chat this model has no safe guarding
  • THUDM/glm-4-9b-chat
  • meta-llama/Llama-2-7b-chat-hf
  • mistralai/Mistral-7B-Instruct-v0.2
  • Qwen/Qwen2-7B-Instruct , I think this one needs a system message
  1. SFT on red model
python -m accelerate.commands.launch --num_processes 1 --mixed_precision bf16 -m src.scripts.red_team_sft --red_model_name microsoft/phi-2 --output_dir runs/red_team_sft_peft --learning_rate=1e-5 --per_device_train_batch_size=2 --num_train_epochs=1.0 --logging_steps=10 --do_eval

or run red_team_sft.sh

  1. GFN on red model

run red_team_gfn.sh to train your gfn model

  1. compute asr for a certain category
poetry run python -m accelerate.commands.launch --num_processes 2 --mixed_precision bf16 -m src.scripts.compute_asr --categories_with_descriptions --blue_model_name THUDM/glm-4-9b-chat --blue_max_new_tokens 100 --blue_temperature 0.0 --red_model_name ./runs/glm-4-9b-chat/red_team_sft --toxicity_class "Sex Crimes" --max_steps 2000 --red_temperature 0.7

Violent Crimes Non-Violent Crimes Sex Crimes Child Exploitation Specialized Advice Privacy Intellectual Property Indiscriminate Weapons Hate Self-Harm Sexual Content

  1. SFT the blue model on the generated data

  2. Evaluate that the model didn't loose instruction following abilities

'./local_data/human_instructions_filtered.jsonl' the dataset from the humans red team attemps

SFT

To run blue team sft, run the following command: bash /data/sqq/matt/gfn-redteam/commands/train/blue_team_sft.sh dataset_name: the dataset to train on model_name_or_path: the pretrained blue team model base

Evaluation

To run the Open LLM benchmark, run the following command: bash /data/sqq/matt/gfn-redteam/lm-evaluation-harness/evaluate.sh model_args: specify the model path to be evaluated tasks: specify the tasks for evaluation

To run the red-teaming resistance benchmark, run the following command: python /data/sqq/matt/gfn-redteam/redteaming-resistance-benchmark/run_eval_benchmarks.py Change the model name at model_names in the python file to specify the models to evaluate

To calculate the metrics:

  1. get data into pickle file with dict keys 'red_prompt', 'blue_response', 'decoded_prompt', 'safety', 'semantic_diversity', 'n_gram_diversity', 'gibberish', 'iteration'
  2. run ```

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published