Skip to content

Latest commit

 

History

History
46 lines (30 loc) · 2.1 KB

README.md

File metadata and controls

46 lines (30 loc) · 2.1 KB

SweLL-scripts

A collection of scripts for working with the Swedish Learner Language corpora, SweLL-gold and SweLL-pilot.

SweLL-pilot has individual files for each essay with the following format:

essay id: XXXXXX
metadata: XXXXXX

source: XXXXXX [text as transcribed from the original essays]

target: XXXXXX [source text normalised]

svala graph: XXXXXX

The dataloaders script has two functions at the moment:

read_swell_file takes as input the path for one of these files and returns a dictionary with an entry for each of the items in the aforementioned format. For more details on what the keys to the dictionary contain, you can check it's documentation here.

read_swell_directory takes as an input the path to the unzipped SweLL folder. It will return a list with the dictionaries from the read_swell_file for each essay in the folder.

Note that it has only been tested with SweLL-Pilot as of 2024-04-15.

A script to extract sentence-correction (also known as original-target) pairs from SweLL-gold XML files.

Alignments are obtained by simply filtering out essays where the original and target do not have the same number of sentences.

To facilitate further processing, anonymization placeholders such as "X-stad" and "X-land-gen" are replaced with naïve pseudonyms such as "Berlin" and "Tysklands" (cf. placeholder_map).

Both essay metadata and token-level error labels are retained. The output can be:

  • a TSV file, where error labels are gathered into a single field
  • a CoNNL-U file, where they are stored in each token's MISC field.

Usage

python extract_sentence_pairs.py PATH/TO/sourceSweLL.xml PATH/TO/targetSweLL.xml [--format=tsv|conllu] [--outfile=OUTFILE] 

Note that when --format=conllu, the program will write two files, named org-OUTFILE and trg-OUTFILE.