Create a Python virtual environment, activate it, and install the dependencies.
python3 -m virtualenv lm-gen-env
source lm-gen-env/bin/activate
pip install -r requirements.txt
Download colorlessgreenRNNs from Gulordava et al (2018). Then download the full English vocab and English language model.
The vocab should be saved at colorlessgreenRNNs/src/data/lm
(you may have to make a new folder), and the model should be saved to colorlessgreenRNNs/src/models
.
Our models are available at https://huggingface.co/sathvik-n/augmented-rnns.
Make sure you are in the root directory (not one of the subdirectories) for each of these datasets.
Vocabulary:
mkdir data/lm-data
cd data/lm-data
wget https://dl.fbaipublicfiles.com/colorless-green-rnns/training-data/English/vocab.txt
Pretrained English model:
mkdir models
cd models
wget https://dl.fbaipublicfiles.com/colorless-green-rnns/best-models/English/hidden650_batch128_dropout0.2_lr20.0.pt
The file grammars.py
contains text-based specifications for the different CFGs based on appendices from Lan et al.
Example sentences generated by each of the CFGs formatted as JSON is located in grammar_outputs/sentence_lists
. 2x2 tuples containing each sentence type and surprisal effect for each CFG are located in grammar_outputs/tuples
, also in JSON format.
The results for the pretrained model are in grammar_outputs/experiment1/grnn
, the augmnted models' results are in grammar_outputs/experiment2/grnn
.
Wilcox et al's stimuli are listed in data/wilcox_csv
, the outputs are in grammar_outputs/wilcox_replication
.
To augment the training data, run python augment_with_dependency.py --data_dir $DATA --dependency_name $DEPENDENCY --augmenting_data $CFG_DIR
, where $DATA
corresponds to the LM's training data that's already split into training and validation sets (should have train.txt
and valid.txt
). If you downloaded data from the Gulordava et al repo this should be done for you. This will create a folder named $DEPENDENCY
with augmented training data. $CFG_DIR
should be set to a CSV in grammar_outputs/revised_training
. Before running the line above, set the environment variable appropriately.
There are scripts to retrain the model using a cluster environment, I modified retrain_grnn.sh
to train RNNs on clefting and topicalization at the same time.
Change the path in surprisal.py
from the pretrained RNN model to the model you want to use, make a directory in grammar_outputs
, modify cfg_sentence_generation.py
, and then run it.
To evaluate the model on Wilcox et al's stimuli, run run_wilcox_replication.py
.
Graphs for the paper are in pretrained_comparisons.ipynb
, analyses for the Wilcox et al stimuli are based on comparison_wilcox.ipynb
, which we used to verify our surprisal computation was working effectively. Plots and statistical analyses for the pretrained RNN are in simple_cfg_analysis.ipynb
, these plots and measures for the retrained models are in retraining_analysis.ipynb
.
If you use this implementation, please cite us at Katherine Howitt, Sathvik Nair, Allison Dods, and Robert Melvin Hopkins (2024). Generalizations across Filler-Gap Dependencies in Neural Language Models. Conference on Natural Language Learning (CoNLL 2024).