Readability : Contextualized Lexical Simplification

This repository contains all code that has been used for my Master thesis on Contextualized Lexical Simplification.

Project Folder Structure

There are the following folders in the structure:

scripts: Folder containing the used scripts
datasets: Folder containing the used datasets
results: Folder containing the produced results
models: Folder where used embedding models can be stored
tests: Test examples
media: Folder containing media files (icons, video)

Installation

In order to run all steps of the lexical simplification pipeline, follow these steps:

Clone this repository:

git clone https://github.com/Amsterdam-Internships/Readability-Lexical-Simplification

Install all dependencies:

pip install -r requirements.txt

Deploying the pipeline

Simplifications can be made on English and Dutch. They require a number of files:

Steps Needed for Running the model for English

Download a word embedding model from (fasttext) and store it in the models folder as crawl-300d-2M-subword.vec
Download the BenchLS, NNSeval and lex.mturk datasets from https://simpatico-project.com/?page_id=109 dataset and store them in the models folder

Then, the model can be run as follows:

python3 BERT_for_LS.py --model bert-large-uncased-whole-word-masking --eval_dir ../datasets/Dutch/dutch_data.txt

Steps Needed for Running the model for Dutch

Download the word embedding model from https://dumps.wikimedia.org/nlwiki/20160501/ and store it in the models folder as wikipedia-320.txt

Then the model can be run as follows:

python3 BERT_for_LS.py --model GroNLP/bert-base-dutch-cased --eval_dir ../datasets/Dutch/dutch_data.txt

Additional Arguments can be passed:

Argument	Type or Action	Description	Default
`--model`	str	`the name of the model that is used for generating the predictions: a path to a folder or a huggingface directory.`	-
`--eval_dir`	str	`path to the file with the to-be-simplified sentences.`	-
`--analysis`	Bool	`whether or not to output all the generated candidates and the reason for their removal`	False
`--ranking`	Bool	`whether or not to perform ranking of the generated candidates`	False
`--evaluation`	Bool	`whether or not to perform an evaluation of the generated candidates`	True
`--num_selections`	int	`the amount of candidates to generate`	10

Finetuning a Model

Requirements for fine-tuning English:

Download the simple--regular aligned wikipedia corpus
Download the simple wikipedia corpus

Dutch:

Download wablieft corpus
Download domain-specific data

Can be done in three ways:

Masked language modelling:

python3 only_mlm.py  --nr_sents 10000   
                     --epochs 2
                     --model_directory ../models/MLM_model
                     --seed 3
                     --language nl
                     --level simple

Masked language modelling and next token prediction:

python3 mlm_nsp.py   --nr_sents 10000   
                     --epochs 2
                     --model_directory ../models/MLM_model
                     --seed 3
                     --language nl

Masked language modelling and simplification prediction:

python3 finetuning.py   --nr_sents 10000   
                        --epochs 2
                        --model_directory ../models/MLM_model
                        --seed 3

Notebooks

Notebooks for analyses

Acknowledgements

This code is based on the LSBert pipeline: https://github.com/qiang2100/BERT-LS

The file "dutch frequencies" is the processed version of SUBTLEX NL (http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-nl)

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
datasets		datasets
media		media
models		models
notebooks		notebooks
results/outputs		results/outputs
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pylintrc		.pylintrc
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Readability : Contextualized Lexical Simplification

Project Folder Structure

Installation

Deploying the pipeline

Finetuning a Model

Notebooks

Acknowledgements

About

Releases

Packages

Languages

Amsterdam-Internships/Readability-Lexical-Simplification

Folders and files

Latest commit

History

Repository files navigation

Readability : Contextualized Lexical Simplification

Project Folder Structure

Installation

Deploying the pipeline

Finetuning a Model

Notebooks

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages