This is the repository for the Detecting Personal Identifiable Information in Swedish Learner Essays (Szawerna et al., CALD-pseudo-WS 2024) paper, in which we investigate the possibility of a) using Swedish BERT for detecting PIIs in L2 learner essays and b) using a simple IOB annotation to signify the PII vs. not PII difference. Out of respect for the privacy of the data subjects and legal concerns we are unable to share the original data. One can apply for the access to the already pseudonymized SweLL data here.
The token classification with Transformers is based off of this example. Please make sure that you have this code saved. In our case we had it in a subfolder in this repository called bert
(which in this repository only contains our custom file for running everything). Once you have the code, the following steps need to be carried out in order to enable the weighted loss function option:
- Locate the
run_ner.py
file. - Replace lines 247 to 255 (initializing a Trainer() object) with the following:
weighted = True # change to False for not weighted
# custom trainer to be used with weighted CrossEntropyLoss
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.pop("labels")
# forward pass
outputs = model(**inputs)
logits = outputs.get("logits")
# compute custom loss (suppose one has 3 labels with different weights)
if weighted:
loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([12.64419148, 167.90310078, 0.34305829], device=model.device))
else:
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else loss
# Initialize our Trainer
trainer = CustomTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
data_collator=data_collator,
)
- The weights in the weighted can be altered if needed. Unfortunately, the switching between weighted and not weighted has to be done in the file.
Once this preparation is done, and if you have the appropriate SweLL-pilot files, you can do the following to re-run the experiments:
- In the main folder run
python3 reannotate_iob.py [INPUT/SWELL FOLDER] [OUTPUT FOLDER] [optional flags]
cd ./data/
sh preproc.sh
(optionally alsocat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
, needed only once if you don't have the labels file)cd ../bert/
sh run_iob.sh
(note: if you want to toggle between weighted or unweighted CrossEntropyLoss, you have to do it manually inrun_ner.py
; same goes for changing/adding settings inrun_iob.sh
)cd ../
python3 analyze_output.py [OUTPUT FOLDER] [MODEL NAME]
for each of the custom trained models.
This code is released under the CRAPL academic-strength open source license.