Measuring and Reducing Bias in LLMs introduced by RLHF

CS224R Final Project

RLHF Training

There are three main steps to the RLHF training process:

Supervised fine-tuning of the base LLM to create SFT LLM:
- ./scripts/supervised_finetuning.sh <SFT_MODEL_NAME>
Reward modeling using dialog pairs from the StackExchange dataset using SFT LLM to create RM:
- scripts/reward_modeling.sh <RM_MODEL_NAME>
RL fine-tuning of SFT LLM with the reward model:
- ./scripts/rl_training.sh <SFT_MODEL_NAME> <RM_MODEL_NAME> <NUM_TRAINING_EXAMPLES>

LoRA layers were using at all stages to reduce memory requirements. At each stage the peft adapter layers were merged with the base model, using:

python merge_peft_adapter.py --adapter_model_name=XXX --base_model_name=YYY --output_name=ZZZ

Note that this script requires peft>=0.3.0.

To evaluate the bias of finetuned and debiased GPT-Neo models, run:

python self_debiasing.py --models <MODEL_1> <MODEL_2> ... --modes default debiased

For LLAMA models, run:

python self_debiasing_llama.py --models <MODEL_1> <MODEL_2> ... --modes default debiased

To evaluate the perplexity of finetuned and debiased models, run:

python eval_perplexity.py --models <MODEL_1> <MODEL_2> ... --modes default debiased

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
metrics		metrics
scripts		scripts
self_debiasing		self_debiasing
trl		trl
.gitignore		.gitignore
README.md		README.md
eval_perplexity.py		eval_perplexity.py
evaluate_bias.py		evaluate_bias.py
merge_peft_adapter.py		merge_peft_adapter.py
requirements.txt		requirements.txt
reward_modeling.py		reward_modeling.py
rl_training.py		rl_training.py
self_debiasing.py		self_debiasing.py
self_debiasing_llama.py		self_debiasing_llama.py
supervised_finetuning.py		supervised_finetuning.py