GitHub - umd-psycholing/lm-syntactic-generalization

LM Syntactic Generalization

Environment Setup

Create a Python virtual environment, activate it, and install the dependencies.

python3 -m virtualenv lm-gen-env
source lm-gen-env/bin/activate
pip install -r requirements.txt

Getting the RNN language models

Download colorlessgreenRNNs from Gulordava et al (2018). Then download the full English vocab and English language model.

The vocab should be saved at colorlessgreenRNNs/src/data/lm (you may have to make a new folder), and the model should be saved to colorlessgreenRNNs/src/models.

Our models are available at https://huggingface.co/sathvik-n/augmented-rnns.

Downloading files to go with GRNN

Make sure you are in the root directory (not one of the subdirectories) for each of these datasets.

Vocabulary:

mkdir data/lm-data
cd data/lm-data
wget https://dl.fbaipublicfiles.com/colorless-green-rnns/training-data/English/vocab.txt

Pretrained English model:

mkdir models
cd models
wget https://dl.fbaipublicfiles.com/colorless-green-rnns/best-models/English/hidden650_batch128_dropout0.2_lr20.0.pt

CFGs and Sentence Generation

The file grammars.py contains text-based specifications for the different CFGs based on appendices from Lan et al.

Data

Example sentences generated by each of the CFGs formatted as JSON is located in grammar_outputs/sentence_lists. 2x2 tuples containing each sentence type and surprisal effect for each CFG are located in grammar_outputs/tuples, also in JSON format. The results for the pretrained model are in grammar_outputs/experiment1/grnn, the augmnted models' results are in grammar_outputs/experiment2/grnn. Wilcox et al's stimuli are listed in data/wilcox_csv, the outputs are in grammar_outputs/wilcox_replication.

Retraining the LMs

To augment the training data, run python augment_with_dependency.py --data_dir $DATA --dependency_name $DEPENDENCY --augmenting_data $CFG_DIR, where $DATAcorresponds to the LM's training data that's already split into training and validation sets (should have train.txt and valid.txt). If you downloaded data from the Gulordava et al repo this should be done for you. This will create a folder named $DEPENDENCY with augmented training data. $CFG_DIR should be set to a CSV in grammar_outputs/revised_training. Before running the line above, set the environment variable appropriately.

There are scripts to retrain the model using a cluster environment, I modified retrain_grnn.sh to train RNNs on clefting and topicalization at the same time.

Inference for the Retrained LMs

Change the path in surprisal.py from the pretrained RNN model to the model you want to use, make a directory in grammar_outputs, modify cfg_sentence_generation.py, and then run it. To evaluate the model on Wilcox et al's stimuli, run run_wilcox_replication.py.

Notebooks

Graphs for the paper are in pretrained_comparisons.ipynb, analyses for the Wilcox et al stimuli are based on comparison_wilcox.ipynb, which we used to verify our surprisal computation was working effectively. Plots and statistical analyses for the pretrained RNN are in simple_cfg_analysis.ipynb, these plots and measures for the retrained models are in retraining_analysis.ipynb.

If you use this implementation, please cite us at Katherine Howitt, Sathvik Nair, Allison Dods, and Robert Melvin Hopkins (2024). Generalizations across Filler-Gap Dependencies in Neural Language Models. Conference on Natural Language Learning (CoNLL 2024).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LM Syntactic Generalization

Environment Setup

Getting the RNN language models

Downloading files to go with GRNN

CFGs and Sentence Generation

Data

Retraining the LMs

Inference for the Retrained LMs

Notebooks

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
colorlessgreenRNNs		colorlessgreenRNNs
figures		figures
grammar_outputs		grammar_outputs
scripts		scripts
wilcox_csv		wilcox_csv
.gitignore		.gitignore
README.md		README.md
TEST.txt		TEST.txt
__init__.py		__init__.py
analysis.py		analysis.py
augment_with_dependency.py		augment_with_dependency.py
cfg_sentence_generation.py		cfg_sentence_generation.py
comparison_wilcox.ipynb		comparison_wilcox.ipynb
generate_corpora.py		generate_corpora.py
grammars.py		grammars.py
main.py		main.py
pretrained_comparisons.ipynb		pretrained_comparisons.ipynb
requirements.txt		requirements.txt
retraining_analysis.ipynb		retraining_analysis.ipynb
run_wilcox_replication.py		run_wilcox_replication.py
sentence_tuples.py		sentence_tuples.py
simple_cfg_analysis.ipynb		simple_cfg_analysis.ipynb
surprisal.py		surprisal.py
wilcox_replication.py		wilcox_replication.py

umd-psycholing/lm-syntactic-generalization

Folders and files

Latest commit

History

Repository files navigation

LM Syntactic Generalization

Environment Setup

Getting the RNN language models

Downloading files to go with GRNN

CFGs and Sentence Generation

Data

Retraining the LMs

Inference for the Retrained LMs

Notebooks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages