Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'OpenAIGPTTokenizerFast' object has no attribute 'added_tokens_encoder' #37

Open
wants to merge 60 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
ce418af
Simpler version - fine-tuning LM
vered1986 Feb 15, 2020
17f386f
Requirements
vered1986 Feb 15, 2020
9a69018
ignoring idea files
vered1986 Feb 15, 2020
75b2c2a
Simple version by fine-tuning GPT
vered1986 Feb 26, 2020
fd24bb5
Changes in repo structure
vered1986 Feb 27, 2020
9198813
Download script
vered1986 Feb 27, 2020
de92e16
CometModel class + run download in setup
vered1986 Feb 27, 2020
aff5f5d
Download script in setup
vered1986 Feb 27, 2020
62cba0c
Add requirements
vered1986 Feb 27, 2020
9b1d8f5
Data directory
vered1986 Feb 27, 2020
9f3dc4c
Subprocess
vered1986 Feb 27, 2020
a3fd132
Correct script directory
vered1986 Feb 27, 2020
a33e6a5
Correct script directory
vered1986 Feb 27, 2020
552a237
Try another way to execute sh
vered1986 Feb 27, 2020
ebb879e
...
vered1986 Feb 27, 2020
1523652
...
vered1986 Feb 27, 2020
2e06f2a
Move to parent dir
vered1986 Feb 27, 2020
d9bd151
...
vered1986 Feb 27, 2020
0f8553e
Make the script executable
vered1986 Feb 27, 2020
2f4fc0a
Empty args
vered1986 Feb 27, 2020
3437404
Typo in gdown command
vered1986 Feb 27, 2020
39ec4e8
Typo in file name
vered1986 Feb 27, 2020
1dc0fa2
Download script works
vered1986 Feb 27, 2020
9352550
Remove the hyphen from the package name
vered1986 Feb 27, 2020
f2bcc53
Rename the package
vered1986 Feb 27, 2020
6c82f3f
Install config files
vered1986 Feb 27, 2020
5b6cf49
Get rid of the config
vered1986 Feb 27, 2020
42643c2
Get rid of the config
vered1986 Feb 27, 2020
43fdd61
Beam search is working
vered1986 Feb 27, 2020
8aa3072
transformers version
vered1986 Feb 27, 2020
28263b5
Add develop install option
vered1986 Feb 28, 2020
9fb15a7
...
vered1986 Feb 28, 2020
b781428
Change the package name to comet2 to avoid name collision
vered1986 Feb 28, 2020
b3def9f
Revert the change
vered1986 Feb 28, 2020
729075f
Change the name
vered1986 Feb 28, 2020
fb6d1c7
Comet data dir as an argument
vered1986 Feb 28, 2020
9ad1384
Install options
vered1986 Feb 28, 2020
3155b2d
Use environment variable instead
vered1986 Feb 28, 2020
5fbbefc
Change the default model name
vered1986 Feb 28, 2020
0aafba9
Fix data directory
vered1986 Feb 28, 2020
b062291
Compute both micro and macro perplexity
vered1986 Feb 28, 2020
fb7fef6
Compute BLEU
vered1986 Feb 28, 2020
fb29b64
Add option to continue training
vered1986 Feb 29, 2020
1610173
Training the model for two more epochs
vered1986 Feb 29, 2020
c12e946
Change pretrained model file
vered1986 Feb 29, 2020
991607f
Fix issues with installation script
vered1986 Feb 29, 2020
4013a57
Wrong zip format
vered1986 Feb 29, 2020
799c70d
Set env variable
vered1986 Feb 29, 2020
bc30e82
Set env variable
vered1986 Feb 29, 2020
2becf1f
Default data dir in /usr/local/
vered1986 Feb 29, 2020
f22c1e2
Pass argument to sh
vered1986 Feb 29, 2020
4ab4b96
Remove -c
vered1986 Feb 29, 2020
8bbe780
Fix download path
vered1986 Feb 29, 2020
c8d3ccf
...
vered1986 Feb 29, 2020
a349903
Default model dir
vered1986 Feb 29, 2020
131aee5
with torch.no_grad()
vered1986 Feb 29, 2020
6470a6b
Remove the specific transformers version
vered1986 Feb 29, 2020
8084508
Installation dir
vered1986 Mar 2, 2020
8578789
Update the examples
vered1986 Mar 2, 2020
9424de8
Update link to pretrained model
vered1986 Sep 19, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

.DS_Store

.idea/
287 changes: 166 additions & 121 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,151 +1,196 @@
To run a generation experiment (either conceptnet or atomic), follow these instructions:
This repository contains a new version of COMET trained on ATOMIC.

For the original version see: [atcbosselut/comet-commonsense](https://github.com/atcbosselut/comet-commonsense).

<h1>First Steps</h1>
### Changes from previous version

First clone, the repo:
1. Variable length input

```
git clone https://github.com/atcbosselut/comet-commonsense.git
```

Then run the setup scripts to acquire the pretrained model files from OpenAI, as well as the ATOMIC and ConceptNet datasets

```
bash scripts/setup/get_atomic_data.sh
bash scripts/setup/get_conceptnet_data.sh
bash scripts/setup/get_model_files.sh
```

Then install dependencies (assuming you already have Python 3.6 and Pytorch >= 1.0:

```
pip install tensorflow
pip install ftfy==5.1
conda install -c conda-forge spacy
python -m spacy download en
pip install tensorboardX
pip install tqdm
pip install pandas
pip install ipython
```
<h1> Making the Data Loaders </h1>

Run the following scripts to pre-initialize a data loader for ATOMIC or ConceptNet:

```
python scripts/data/make_atomic_data_loader.py
python scripts/data/make_conceptnet_data_loader.py
```

For the ATOMIC KG, if you'd like to make a data loader for only a subset of the relation types, comment out any relations in lines 17-25.

For ConceptNet if you'd like to map the relations to natural language analogues, set ```opt.data.rel = "language"``` in line 26. If you want to initialize unpretrained relation tokens, set ```opt.data.rel = "relation"```

<h1> Setting the ATOMIC configuration files </h1>

Open ```config/atomic/changes.json``` and set which categories you want to train, as well as any other details you find important. Check ```src/data/config.py``` for a description of different options. Variables you may want to change: batch_size, learning_rate, categories. See ```config/default.json``` and ```config/atomic/default.json``` for default settings of some of these variables.

<h1> Setting the ConceptNet configuration files </h1>

Open ```config/conceptnet/changes.json``` and set any changes to the degault configuration that you may want to vary in this experiment. Check ```src/data/config.py``` for a description of different options. Variables you may want to change: batch_size, learning_rate, etc. See ```config/default.json``` and ```config/conceptnet/default.json``` for default settings of some of these variables.

<h1> Running the ATOMIC experiment </h1>

<h3> Training </h3>
For whichever experiment # you set in ```config/atomic/changes.json``` (e.g., 0, 1, 2, etc.), run:

```
python src/main.py --experiment_type atomic --experiment_num #
```

<h3> Evaluation </h3>

Once you've trained a model, run the evaluation script:
### Installation

```
python scripts/evaluate/evaluate_atomic_generation_model.py --split $DATASET_SPLIT --model_name /path/to/model/file
```
Define the `COMET_DATA_DIR` environment variable, otherwise the data will be saved in `~/.comet-data`.

<h3> Generation </h3>
Install the repository. This will also download the ATOMIC dataset and the pre-trained COMET model:

Once you've trained a model, run the generation script for the type of decoding you'd like to do:

```
python scripts/generate/generate_atomic_beam_search.py --beam 10 --split $DATASET_SPLIT --model_name /path/to/model/file
python scripts/generate/generate_atomic_greedy.py --split $DATASET_SPLIT --model_name /path/to/model/file
python scripts/generate/generate_atomic_topk.py --k 10 --split $DATASET_SPLIT --model_name /path/to/model/file
```
pip install git+https://github.com/vered1986/comet-commonsense.git
```

<h1> Running the ConceptNet experiment </h1>

<h3> Training </h3>
### Using a pre-trained model

For whichever experiment # you set in ```config/conceptnet/changes.json``` (e.g., 0, 1, 2, etc.), run:
The installation comes with a pre-trained model based on GPT.

```
python src/main.py --experiment_type conceptnet --experiment_num #
```
>>> from comet2.comet_model import PretrainedCometModel

Development and Test set tuples are automatically evaluated and generated with greedy decoding during training
>>> comet_model = PretrainedCometModel(device=1)

<h3> Generation </h3>
>>> comet_model.predict("PersonX asked PersonY what they thought of the demo", "xWant", num_beams=5)
['to listen to persony', 'to see what they think', 'to see what persony thinks', 'to see if persony likes it', "to listen to persony's response"]

If you want to generate with a larger beam size, run the generation script

```
python scripts/generate/generate_conceptnet_beam_search.py --beam 10 --split $DATASET_SPLIT --model_name /path/to/model/file
>>> comet_model.predict("PersonX went to the grocery store", "xEffect", p=0.9, num_samples=5)
['personx gets something to eat', 'buys the food', 'makes a purchase', 'bought groceries', 'they bought some snacks']
```

<h3> Classifying Generated Tupes </h3>

To run the classifier from Li et al., 2016 on your generated tuples to evaluate correctness, first download the pretrained model from:
The performance of the pre-trained model is:

```
wget https://ttic.uchicago.edu/~kgimpel/comsense_resources/ckbc-demo.tar.gz
tar -xvzf ckbc-demo.tar.gz
```
* **Micro perplexity**: 11.87 (original model: 11.14)
* **BLEU-2**: 14.43 (original model: 15.10)

then run the following script on the the generations file, which should be in .pickle format:
You can also specify a different model path `model_name_or_path` when you create `PretrainedCometModel`.

```
bash scripts/classify/classify.sh /path/to/generations_file/without/pickle/extension
```
If you use this classification script, you'll also need Python 2.7 installed.

<h1> Playing Around in Interactive Mode </h1>
### Training

First, download the pretrained models from the following link:
Run `python -m comet2.train` with the following arguments:

```
https://drive.google.com/open?id=1FccEsYPUHnjzmX-Y5vjCBeyRt1pLo8FB
```

Then untar the file:

```
tar -xvzf pretrained_models.tar.gz
```

Then run the following script to interactively generate arbitrary ATOMIC event effects:

```
python scripts/interactive/atomic_single_example.py --model_file pretrained_models/atomic_pretrained_model.pickle
```

Or run the following script to interactively generate arbitrary ConceptNet tuples:

```
python scripts/interactive/conceptnet_single_example.py --model_file pretrained_models/conceptnet_pretrained_model.pickle
```

<h1> Bug Fixes </h1>

<h3>Beam Search </h3>

In BeamSampler in `sampler.py`, there was a bug that made the scoring function for each beam candidate slightly different from normalized loglikelihood. Only sequences decoded with beam search are affected by this. It's been fixed in the repository, and seems to have little discernible impact on the quality of the generated sequences. If you'd like to replicate the exact paper results, however, you'll need to use the buggy beam search from before, by setting `paper_results = True` in Line 251 of `sampler.py`
usage: train.py [-h] [--train_file TRAIN_FILE] --out_dir OUT_DIR
[--adam_epsilon ADAM_EPSILON] [--device DEVICE] [--do_eval]
[--do_lower_case] [--do_train]
[--eval_batch_size EVAL_BATCH_SIZE]
[--eval_data_file EVAL_DATA_FILE] [--eval_during_train]
[--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
[--learning_rate LEARNING_RATE]
[--logging_steps LOGGING_STEPS]
[--max_input_length MAX_INPUT_LENGTH]
[--max_output_length MAX_OUTPUT_LENGTH]
[--max_grad_norm MAX_GRAD_NORM] [--max_steps MAX_STEPS]
[--model_name_or_path MODEL_NAME_OR_PATH]
[--model_type MODEL_TYPE]
[--num_train_epochs NUM_TRAIN_EPOCHS] [--overwrite_cache]
[--overwrite_out_dir] [--save_steps SAVE_STEPS]
[--save_total_limit SAVE_TOTAL_LIMIT] [--seed SEED]
[--train_batch_size TRAIN_BATCH_SIZE]
[--warmup_steps WARMUP_STEPS] [--weight_decay WEIGHT_DECAY]

<h1> References </h1>
optional arguments:
-h, --help show this help message and exit
--train_file TRAIN_FILE
The input training CSV file.
--out_dir OUT_DIR Out directory for checkpoints.
--adam_epsilon ADAM_EPSILON
Epsilon for Adam optimizer.
--device DEVICE GPU number or 'cpu'.
--do_eval Whether to run eval on the dev set.
--do_lower_case Set this flag if you are using an uncased model.
--do_train Whether to run training.
--eval_batch_size EVAL_BATCH_SIZE
Batch size for evaluation.
--eval_data_file EVAL_DATA_FILE
Validation file
--eval_during_train Evaluate at each train logging step.
--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
Steps before backward pass.
--learning_rate LEARNING_RATE
The initial learning rate for Adam.
--logging_steps LOGGING_STEPS
Log every X updates steps.
--max_input_length MAX_INPUT_LENGTH
Maximum input event length in words.
--max_output_length MAX_OUTPUT_LENGTH
Maximum output event length in words.
--max_grad_norm MAX_GRAD_NORM
Max gradient norm.
--max_steps MAX_STEPS
If > 0: total number of training steps to perform.
--model_name_or_path MODEL_NAME_OR_PATH
LM checkpoint for initialization.
--model_type MODEL_TYPE
The LM architecture to be fine-tuned.
--num_train_epochs NUM_TRAIN_EPOCHS
Number of training epochs to perform.
--overwrite_cache Overwrite the cached data.
--overwrite_out_dir Overwrite the output directory.
--save_steps SAVE_STEPS
Save checkpoint every X updates steps.
--save_total_limit SAVE_TOTAL_LIMIT
Maximum number of checkpoints to keep
--seed SEED Random seed for initialization.
--train_batch_size TRAIN_BATCH_SIZE
Batch size for training.
--warmup_steps WARMUP_STEPS
Linear warmup over warmup_steps.
--weight_decay WEIGHT_DECAY
Weight decay if we apply some.
```

### Evaluation

The training script can be used to evaluate with perplexity.
Use the `--do_eval` flag and set `--eval_data_file` to the validation set.


To get BLEU scores, run `python -m comet2.evaluate` with the following arguments:

```
usage: evaluate.py [-h] [--in_file IN_FILE]
[--model_name_or_path MODEL_NAME_OR_PATH]
[--num_samples NUM_SAMPLES] [--device DEVICE]
[--max_length MAX_LENGTH] [--do_lower_case]

optional arguments:
-h, --help show this help message and exit
--in_file IN_FILE CSV ATOMIC file
--model_name_or_path MODEL_NAME_OR_PATH
Pre-trained COMET model
--num_samples NUM_SAMPLES
how many texts to generate
--device DEVICE GPU number or 'cpu'.
```

### Generation

To run an interactive script for single predictions: `python -m comet2.interactive`

```
usage: interactive.py [-h] [--model_name_or_path MODEL_NAME_OR_PATH]
[--sampling_algorithm SAMPLING_ALGORITHM]
[--device DEVICE] [--max_length MAX_LENGTH]
[--do_lower_case]

optional arguments:
-h, --help show this help message and exit
--model_name_or_path MODEL_NAME_OR_PATH
Pre-trained COMET model
--sampling_algorithm SAMPLING_ALGORITHM
--device DEVICE GPU number or 'cpu'.
--max_length MAX_LENGTH
Maximum text length
--do_lower_case Set this flag if you are using an uncased model.
```

To generate predictions for a dataset, run `python -m comet2.predict` with the following arguments:

```
usage: predict.py [-h] --out_file OUT_FILE [--in_file IN_FILE]
[--model_name_or_path MODEL_NAME_OR_PATH]
[--max_length MAX_LENGTH] [--k K] [--p P]
[--num_beams NUM_BEAMS] [--num_samples NUM_SAMPLES]
[--device DEVICE] [--do_lower_case]

optional arguments:
-h, --help show this help message and exit
--out_file OUT_FILE jsonl file with input+output events.
--in_file IN_FILE CSV ATOMIC file
--model_name_or_path MODEL_NAME_OR_PATH
Pre-trained COMET model
--max_length MAX_LENGTH
Maximum text length
--k K k for top k sampling
--p P p for nucleus sampling
--num_beams NUM_BEAMS
number of beams in beam search
--num_samples NUM_SAMPLES
how many texts to generate
--device DEVICE GPU number or 'cpu'.
--do_lower_case Set this flag if you are using an uncased model.
```


### References

Please cite this repository using the following reference:

Expand Down
Empty file added comet2/__init__.py
Empty file.
38 changes: 38 additions & 0 deletions comet2/atomic.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import json
import logging
import pandas as pd


logging.basicConfig(
format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt='%m/%d/%Y %H:%M:%S', level=logging.INFO)

logger = logging.getLogger(__name__)

CATEGORIES = ["oReact", "oEffect", "oWant", "xAttr", "xEffect", "xIntent", "xNeed", "xReact", "xWant"]


def get_atomic_categories():
"""
Return the names of ATOMIC categories
"""
return CATEGORIES


def load_atomic_data(in_file, categories):
"""
Load ATOMIC data from the CSV file
:param in_file: CSV file
:param categories: list of ATOMIC categories
:return: list of tuples: (e1 and catgory, e2)
"""
df = pd.read_csv(in_file, index_col=0)
df.iloc[:, :len(categories)] = df.iloc[:, :len(categories)].apply(lambda col: col.apply(json.loads))
df = df.groupby("event").agg({cat: "sum" for cat in categories})

examples = {row.name.lower().replace('___', '<blank>'): {
cat: [e2.lower() for e2 in set(row[cat])] for cat in categories if len(row[cat]) > 0}
for _, row in df.iterrows()}

return examples

Loading