rename and document

spraakbanken · Apr 18, 2024 · 849e653 · 849e653
1 parent 3c56e50
commit 849e653
Show file tree

Hide file tree

Showing 3 changed files with 43 additions and 14 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,6 @@
+## Contributing
+If you have any SweLL-related scripts, you are welcome to add your code and add a few lines to the [README](README.md) saying:
+
+- what the name of your script is
+- what it does
+- how to run it
diff --git a/README.md b/README.md
@@ -1,17 +1,9 @@
 # SweLL-scripts
-A collection of scripts for working with the [Swedish Learner Language corpus](https://spraakbanken.gu.se/resurser/swell-gold).
+A collection of scripts for working with the Swedish Learner Language corpora, [SweLL-gold](https://spraakbanken.gu.se/resurser/swell-gold) and [SweLL-pilot](https://spraakbanken.gu.se/resurser/swell-pilot).
 
-## Contributing
-If you have any SweLL-related scripts, you are welcome to add your code and add a few lines to this README saying:
+## [dataloaders.py](dataloaders.py)
 
-- what the name of your script is
-- what it does
-- how to run it
-
-
-## Dataloaders
-
-Swell has individual files for each essay with the following format:
+SweLL-pilot has individual files for each essay with the following format:
 
 ```
 essay id: XXXXXX
@@ -24,11 +16,31 @@ target: XXXXXX [source text normalised]
 svala graph: XXXXXX
 ```
 
-The dataloaders folder has two functions at the moment:
+The dataloaders script has two functions at the moment:
 
 ```read_swell_file``` takes as input the path for one of these files and returns a dictionary with an entry for each of the items in the aforementioned format.
 For more details on what the keys to the dictionary contain, you can check it's documentation [here](https://github.com/spraakbanken/SweLL-scripts/blob/222a2d2e407dceed8985f9e23acdd49b97ee3b83/dataloaders.py#L85).
 
-```read_swell_directory``` takes as an input the path to the unzipped Swell folder.
+```read_swell_directory``` takes as an input the path to the unzipped SweLL folder.
 It will return a list with the dictionaries from the ```read_swell_file``` for each essay in the folder.
-Note that it has only been tested with Swell-Pilot as of 2024-04-15.
+
+Note that it has only been tested with SweLL-Pilot as of 2024-04-15.
+
+## [extract_sentence_pairs.py](extract_sentence_pairs.py)
+A script to extract sentence-correction (also known as original-target) pairs from SweLL-gold XML files.
+
+Alignments are obtained by simply filtering out essays where the original and target do not have the same number of sentences.
+
+To facilitate further processing, anonymization placeholders such as "X-stad" and "X-land-gen" are replaced with naïve pseudonyms such as "Berlin" and "Tysklands" (cf. `placeholder_map`).
+
+Both essay metadata and token-level error labels are retained.
+The output can be:
+- a TSV file, where error labels are gathered into a single field 
+- a CoNNL-U file, where they are stored in each token's MISC field.
+
+### Usage
+```
+python extract_sentence_pairs.py PATH/TO/sourceSweLL.xml PATH/TO/targetSweLL.xml [--format=tsv|conllu] [--outfile=OUTFILE] 
+```
+
+Note that when `--format=conllu`, the program will write two files, named `org-OUTFILE` and `trg-OUTFILE`.
diff --git a/extract_swell_sentence_pairs.py → extract_sentence_pairs.py b/extract_swell_sentence_pairs.py → extract_sentence_pairs.py
@@ -113,6 +113,17 @@ def pair_up(source_dict: dict, target_dict: dict):
     return paired_up
 
 def tokenlist(word_label_pairs, metadata):
+    '''
+    Build a conllu.TokenList out of the sentence-level information extracted 
+    from SweLL.
+
+    Args:
+        word_label_pairs (list): a list of word-correction label pairs.
+        metadata (dict): metadata of the essay the sentence belongs to, directly in SweLL format.
+
+    Returns:
+        a conllu.TokenList representing the sentence.
+    '''
     tokens = []
     for i, (word,label) in enumerate(word_label_pairs):
         tokens.append(conllu.Token(