Iterative Contrastive Unlearning

Specification of dependencies

The code has been verified on Python 3.8.19.

$ conda create -n icu python=3.8
$ conda activate icu
# Install the correct torch version depending on CUDA version from https://pytorch.org/
$ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch 
$ pip install -r requirements.txt

Training Code

Training scripts can be found under scripts directory.

With one gpu ("cuda:0"), the command below starts training of GPT-NEO-125m.

# train-125m.sh
$ python ./train_neo_meanwhile_update.py \
--exp "exp0" --model_name "EleutherAI/gpt-neo-125m" \
--tokenizer_name "EleutherAI/gpt-neo-125m" \
--gpt2_name "openai-community/gpt2" \
--bert_name "google-bert/bert-base-uncased" \
--prefix_length 200 --suffix_length 200 --target_length 200 \
--device "cuda:0" --batch_size 8 --num_workers 8 --lr 5e-6 \
--uw 1.0 --lw 0.5 --kl 1.0 --f1 0.3 --bleu 0.01 --acc 0.5994 \
--el 0.0499 --dir "result/test"

With three gpus (0,1,2), the command below starts training of GPT-NEO-1.3B.

# train-1_3b.sh
deepspeed --include localhost:0,1,2 \
./train_neo_meanwhile_update_deepspeed.py \
--deepspeed_config ./config/deepspeed3.json \
--exp "exp0" --model_name "EleutherAI/gpt-neo-1.3B" \
--tokenizer_name "EleutherAI/gpt-neo-1.3B" \
--gpt2_name "openai-community/gpt2" \
--bert_name "google-bert/bert-base-uncased" \
--prefix_length 200 --suffix_length 200 --target_length 200 \
--batch_size 4 --num_workers 8 --lr 5e-6 \
--uw 1.0 --lw 0.5 --kl 1.0 --f1 0.3 --bleu 0.01 --acc 0.5994 \
--el 0.0499 --dir "result/test"

With six gpus (0,1,2,3,4,5), the command below starts training of GPT-NEO-1.3B.

# train-2_7b.sh
$ deepspeed --include localhost:0,1,2,3,4,5 \
./train_neo_meanwhile_update_deepspeed.py \
--deepspeed_config ./config/deepspeed6.json \
--exp "exp0" --model_name "EleutherAI/gpt-neo-2.7B" \
--tokenizer_name "EleutherAI/gpt-neo-2.7B" \
--gpt2_name "openai-community/gpt2" \
--bert_name "google-bert/bert-base-uncased" \
--prefix_length 200 --suffix_length 200 --target_length 200 \
--batch_size 4 --num_workers 8 --lr 5e-6 \
--uw 1.0 --lw 0.5 --kl 1.0 --f1 0.3 --bleu 0.01 --acc 0.5994 \
--el 0.0499 --dir "result/test1"

Evaluation Code

Downstream Tasks

You can test the code on the downstream tasks using the command below.

# valid.sh
$ python ./valid.py \
--model_name "./result/test/EleutherAI/gpt-neo-125m_exp0_lr5e-06_uw1.0_lw0.5_kl1.0_epoch19_updateboth" \
--tokenizer_name "EleutherAI/gpt-neo-125m" \
--prefix_length 512 --suffix_length 512 --device "cuda:0" \
--batch_size 32 --num_workers 48 \
--dir "./result/test" --cache "./.cache"

Evaluating Unlearning

The original model can be evaluated using the command below.

# eval.sh
python ./eval.py --exp "all" \
--model_name "EleutherAI/gpt-neo-125m" \
--tokenizer_name "EleutherAI/gpt-neo-125m" \
--gpt2_name "openai-community/gpt2" \
--bert_name "google-bert/bert-base-uncased" \
--prefix_length 200 --suffix_length 200 --target_length 200 \
--device "cuda:0" --batch_size 8 --num_workers 8 \
--dir "./result/test"

GPT

The related code is in evaluation directory. test.ipynb is more convenient than api.py.

Fill in your GPT-4 api key in the code.
Use convert.py to convert the results of previous files in Evaluating. (Rearranging the files according to the code may be necessary.)
Run the code inside evaluation.

The files should be rearranged to a tree following below structure:

evaluation
│   125mneo.csv     # the results of gpt-neo-125m on all
│   125mopt.csv     # the results of opt-125m on all
│   api.py
│   convert.py
│   lm_extraction_128_0.csv
│   prompt.py
│   test.ipynb
│
├───125m-0
│       and.csv         # the results of KUMPR on 0
│       neo.csv         # generated
|       opt.csv         # generated
│       ours.csv        # our results
│       results.json    # generated

Data Preparation

Datasets

The target data can be downloaded from this link.

Below are the validation datasets used and can be downloaded from open source.

Preparing Target Data

First, place the train_dataset.npy under directory datasets. Then run data_prep.py. This will complete the converting and the KNN sampling process.

The data used in 5 runs in our paper is under directory datasets/exp/exp{0/1/2/3/4} respectively.

Comments

Our codebase is based on the following repo. Thanks for open-sourcing!

Knowledge Unlearning

llm_unlearn

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
config		config
datasets		datasets
evaluation		evaluation
figure		figure
scripts		scripts
.gitignore		.gitignore
Datasets_mean.py		Datasets_mean.py
Datasets_neo.py		Datasets_neo.py
Datasets_npo.py		Datasets_npo.py
Datasets_valid.py		Datasets_valid.py
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
requirements.txt		requirements.txt
train_llmu.py		train_llmu.py
train_llmu_deepspeed.py		train_llmu_deepspeed.py
train_neo_meanwhile_update.py		train_neo_meanwhile_update.py
train_neo_meanwhile_update_deepspeed.py		train_neo_meanwhile_update_deepspeed.py
train_po.py		train_po.py
train_po_deepspeed.py		train_po_deepspeed.py
utils.py		utils.py
valid.py		valid.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Iterative Contrastive Unlearning

Specification of dependencies

Training Code

Evaluation Code

Downstream Tasks

Evaluating Unlearning

GPT

Data Preparation

Datasets

Preparing Target Data

Comments

About

Releases

Packages

Contributors 2

Languages

License

himalalps/ICU

Folders and files

Latest commit

History

Repository files navigation

Iterative Contrastive Unlearning

Specification of dependencies

Training Code

Evaluation Code

Downstream Tasks

Evaluating Unlearning

GPT

Data Preparation

Datasets

Preparing Target Data

Comments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages