A collection of scripts for working with the Swedish Learner Language corpora, SweLL-gold and SweLL-pilot.
SweLL-pilot has individual files for each essay with the following format:
essay id: XXXXXX
metadata: XXXXXX
source: XXXXXX [text as transcribed from the original essays]
target: XXXXXX [source text normalised]
svala graph: XXXXXX
The dataloaders script has two functions at the moment:
read_swell_file
takes as input the path for one of these files and returns a dictionary with an entry for each of the items in the aforementioned format.
For more details on what the keys to the dictionary contain, you can check it's documentation here.
read_swell_directory
takes as an input the path to the unzipped SweLL folder.
It will return a list with the dictionaries from the read_swell_file
for each essay in the folder.
Note that it has only been tested with SweLL-Pilot as of 2024-04-15.
A script to extract sentence-correction (also known as original-target) pairs from SweLL-gold XML files.
Alignments are obtained by simply filtering out essays where the original and target do not have the same number of sentences.
To facilitate further processing, anonymization placeholders such as "X-stad" and "X-land-gen" are replaced with naïve pseudonyms such as "Berlin" and "Tysklands" (cf. placeholder_map
).
Both essay metadata and token-level error labels are retained. The output can be:
- a TSV file, where error labels are gathered into a single field
- a CoNNL-U file, where they are stored in each token's MISC field.
python extract_sentence_pairs.py PATH/TO/sourceSweLL.xml PATH/TO/targetSweLL.xml [--format=tsv|conllu] [--outfile=OUTFILE]
Note that when --format=conllu
, the program will write two files, named org-OUTFILE
and trg-OUTFILE
.