Communication-based Evaluation for Natural Language Generation

Project Description

Currently many NLG models are evaluated using n-gram overlap metrics like BLEU and ROUGE, but these metrics do not capture semantics let alone speaker intentions. People use language to communicate, and if we want NLG models to effectively communicate with people, we should evaluate them based on how well they communicate. We illustrate how this communication-based evaluation would work using the color reference game scenario from Monroe et al., 2017. We collected color reference game captions of various qualities and investigated how well models that use the captions to play the reference game can distinguish between dffierent quality captions compared to n-gram overlap metrics.

We have released the data we collected. It can be found in data/csv/clean_data.csv. The code to recreate the plots in the paper using the data and pretrained models can be found in this jupyter notebook.

Folder and File Descriptions

caption_featurizers.py contains code to process captions with an appropriate tokenizer into a format expected by the models. color_featurizers.py is a similar featurizer for the color inputs.

evaluation.py contains performance metric code for all models.

example_experiments.py contains examples of experiments that can be run with models such as the Literal Listener.

experiment.py contains code for model evaluation and the feature handler class that interfaces between the Monroe data, feature functions, and the models.

baseline_listener_samples, literal_listener_samples, and imaginative_listener_samples contain the ten sampled model parameters with optimal hyperparameters from the Baseline, Literal, and Imaginative Listener models, respectively.

data contains all the data used in the project, including the Monroe data and the synthetic data.

model contains all other model parameters for the models experimented with over the course of the project.

notebooks contains Jupyter notebooks for the experiments and scripts used to explore data, generate models, run models, sample models, score models, and other tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
data		data
model		model
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
caption_featurizers.py		caption_featurizers.py
color_eval_utils.py		color_eval_utils.py
color_featurizers.py		color_featurizers.py
environment.yml		environment.yml
evaluation.py		evaluation.py
experiment.py		experiment.py
models.py		models.py
monroe_data.py		monroe_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Communication-based Evaluation for Natural Language Generation

Project Description

Folder and File Descriptions

About

Releases

Packages

Contributors 3

Languages

bnewm0609/cs224u-project

Folders and files

Latest commit

History

Repository files navigation

Communication-based Evaluation for Natural Language Generation

Project Description

Folder and File Descriptions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages