Skip to content

Commit

Permalink
rename and document
Browse files Browse the repository at this point in the history
  • Loading branch information
harisont committed Apr 18, 2024
1 parent 3c56e50 commit 849e653
Show file tree
Hide file tree
Showing 3 changed files with 43 additions and 14 deletions.
6 changes: 6 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
## Contributing
If you have any SweLL-related scripts, you are welcome to add your code and add a few lines to the [README](README.md) saying:

- what the name of your script is
- what it does
- how to run it
40 changes: 26 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,9 @@
# SweLL-scripts
A collection of scripts for working with the [Swedish Learner Language corpus](https://spraakbanken.gu.se/resurser/swell-gold).
A collection of scripts for working with the Swedish Learner Language corpora, [SweLL-gold](https://spraakbanken.gu.se/resurser/swell-gold) and [SweLL-pilot](https://spraakbanken.gu.se/resurser/swell-pilot).

## Contributing
If you have any SweLL-related scripts, you are welcome to add your code and add a few lines to this README saying:
## [dataloaders.py](dataloaders.py)

- what the name of your script is
- what it does
- how to run it


## Dataloaders

Swell has individual files for each essay with the following format:
SweLL-pilot has individual files for each essay with the following format:

```
essay id: XXXXXX
Expand All @@ -24,11 +16,31 @@ target: XXXXXX [source text normalised]
svala graph: XXXXXX
```

The dataloaders folder has two functions at the moment:
The dataloaders script has two functions at the moment:

```read_swell_file``` takes as input the path for one of these files and returns a dictionary with an entry for each of the items in the aforementioned format.
For more details on what the keys to the dictionary contain, you can check it's documentation [here](https://github.com/spraakbanken/SweLL-scripts/blob/222a2d2e407dceed8985f9e23acdd49b97ee3b83/dataloaders.py#L85).

```read_swell_directory``` takes as an input the path to the unzipped Swell folder.
```read_swell_directory``` takes as an input the path to the unzipped SweLL folder.
It will return a list with the dictionaries from the ```read_swell_file``` for each essay in the folder.
Note that it has only been tested with Swell-Pilot as of 2024-04-15.

Note that it has only been tested with SweLL-Pilot as of 2024-04-15.

## [extract_sentence_pairs.py](extract_sentence_pairs.py)
A script to extract sentence-correction (also known as original-target) pairs from SweLL-gold XML files.

Alignments are obtained by simply filtering out essays where the original and target do not have the same number of sentences.

To facilitate further processing, anonymization placeholders such as "X-stad" and "X-land-gen" are replaced with naïve pseudonyms such as "Berlin" and "Tysklands" (cf. `placeholder_map`).

Both essay metadata and token-level error labels are retained.
The output can be:
- a TSV file, where error labels are gathered into a single field
- a CoNNL-U file, where they are stored in each token's MISC field.

### Usage
```
python extract_sentence_pairs.py PATH/TO/sourceSweLL.xml PATH/TO/targetSweLL.xml [--format=tsv|conllu] [--outfile=OUTFILE]
```

Note that when `--format=conllu`, the program will write two files, named `org-OUTFILE` and `trg-OUTFILE`.
11 changes: 11 additions & 0 deletions extract_swell_sentence_pairs.py → extract_sentence_pairs.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,17 @@ def pair_up(source_dict: dict, target_dict: dict):
return paired_up

def tokenlist(word_label_pairs, metadata):
'''
Build a conllu.TokenList out of the sentence-level information extracted
from SweLL.
Args:
word_label_pairs (list): a list of word-correction label pairs.
metadata (dict): metadata of the essay the sentence belongs to, directly in SweLL format.
Returns:
a conllu.TokenList representing the sentence.
'''
tokens = []
for i, (word,label) in enumerate(word_label_pairs):
tokens.append(conllu.Token(
Expand Down

0 comments on commit 849e653

Please sign in to comment.