This repository contains all code that has been used for my Master thesis on Contextualized Lexical Simplification.
There are the following folders in the structure:
scripts
: Folder containing the used scriptsdatasets
: Folder containing the used datasetsresults
: Folder containing the produced resultsmodels
: Folder where used embedding models can be storedtests
: Test examplesmedia
: Folder containing media files (icons, video)
In order to run all steps of the lexical simplification pipeline, follow these steps:
-
Clone this repository:
git clone https://github.com/Amsterdam-Internships/Readability-Lexical-Simplification
-
Install all dependencies:
pip install -r requirements.txt
Simplifications can be made on English and Dutch. They require a number of files:
Steps Needed for Running the model for English
- Download a word embedding model from (fasttext) and store it in the models folder as crawl-300d-2M-subword.vec
- Download the BenchLS, NNSeval and lex.mturk datasets from https://simpatico-project.com/?page_id=109 dataset and store them in the models folder
Then, the model can be run as follows:
python3 BERT_for_LS.py --model bert-large-uncased-whole-word-masking --eval_dir ../datasets/Dutch/dutch_data.txt
Steps Needed for Running the model for Dutch
- Download the word embedding model from https://dumps.wikimedia.org/nlwiki/20160501/ and store it in the models folder as wikipedia-320.txt
Then the model can be run as follows:
python3 BERT_for_LS.py --model GroNLP/bert-base-dutch-cased --eval_dir ../datasets/Dutch/dutch_data.txt
Additional Arguments can be passed:
Argument | Type or Action | Description | Default |
---|---|---|---|
--model |
str | the name of the model that is used for generating the predictions: a path to a folder or a huggingface directory. |
- |
--eval_dir |
str | path to the file with the to-be-simplified sentences. |
- |
--analysis |
Bool | whether or not to output all the generated candidates and the reason for their removal |
False |
--ranking |
Bool | whether or not to perform ranking of the generated candidates |
False |
--evaluation |
Bool | whether or not to perform an evaluation of the generated candidates |
True |
--num_selections |
int | the amount of candidates to generate |
10 |
Requirements for fine-tuning English:
- Download the simple--regular aligned wikipedia corpus
- Download the simple wikipedia corpus
Dutch:
- Download wablieft corpus
- Download domain-specific data
Can be done in three ways:
- Masked language modelling:
python3 only_mlm.py --nr_sents 10000 --epochs 2 --model_directory ../models/MLM_model --seed 3 --language nl --level simple
- Masked language modelling and next token prediction:
python3 mlm_nsp.py --nr_sents 10000 --epochs 2 --model_directory ../models/MLM_model --seed 3 --language nl
- Masked language modelling and simplification prediction:
python3 finetuning.py --nr_sents 10000 --epochs 2 --model_directory ../models/MLM_model --seed 3
Notebooks for analyses
This code is based on the LSBert pipeline: https://github.com/qiang2100/BERT-LS
The file "dutch frequencies" is the processed version of SUBTLEX NL (http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-nl)