Skip to content

Latest commit

 

History

History
94 lines (67 loc) · 3.76 KB

README.md

File metadata and controls

94 lines (67 loc) · 3.76 KB

SloBENCH official evaluation scripts

This is an accompanying repository that contains evaluation scripts used by the evaluation leaderboards in SloBENCH tool - https://slobench.cjvt.si.

Submission evaluation methodology.

SloBENCH tool expects a user to upload a submission.zip file that contains contents according to the rules of a specific leaderboard.

Note: Zip file must not contain any other files, not expected by the system (e.g., __MACOSX). To be sure your submission is sound, you may use zip command from command line - for example: zip submission.zip ./*.txt

Uploaded submission file is automatically extracted along with the reference_dataset.zip file.

The run.py script unzips the ground truth data into the /data-reference path and the submitted data into /data-submission.

Then it runs the task's corresponding eval.py evaluation script, which compares the contents of the previously mentioned paths and returns a dictionary of metricName:metricScore pairs, for example:

{
    'overall': 88.2,
    'metric1': 100.0,
    'metric2': 32.1,
    'metric3': 123.33
}

Script run.py returns evaluation results (or possibly an error) along with some running metadata into a Task Submission Evaluation Object, which is passed to the SloBENCH Web Server.

Leaderboard evaluation

Each evaluation script is packaged into its own container - see specific Dockerfiles for your target leaderboard.

When a new task evaluation script gets added to this repo, the a docker container gets composed and pushed to the slobench/eval:[TASK-NAME] docker repo.

Compiling and running an evaluation locally

Build docker image from root directory of this repository, cloned to your machine, as follows:

docker buildx build --platform linux/amd64 -t eval:TASK_NAME -f evaluation_scripts/TASK_NAME/Dockerfile .

Test your evaluation as follows:

docker run -it --name eval-container --rm \
-v $PWD/DATA_WITH_LABELS.zip:/ground_truth.zip \
-v $PWD/YOUR_SYSTEM_OUTPUT_DATA.zip:/submission.zip \
eval:TASK_NAME ground_truth.zip submission.zip

Change TASK-NAME accordingly and provide paths to your samples of ground truth/reference data (i.e., DATA_WITH_LABELS.zip) and your system output (i.e., YOUR_SYSTEM_OUTPUT_DATA.zip) for the reference data.

For more information check README file of selected leaderboard.

Pushing an image to DockerHub

This repository is accompanied with DockerHub repository https://hub.docker.com/r/slobench/eval. Images are pushed from local builds using the following commands:

docker login
docker tag eval:eval_TASK-NAME slobench/eval:TASK-NAME_VERSION
docker push slobench/eval:TASK-NAME_VERSION

Currently supported tasks

This repository supports the following tasks:

  • eval_question_answering: Evaluation of selected SuperGLUE-like QA tasks.
  • eval_sequence_tagging_conllu: General CONLLU-based evaluation tasks.
  • eval_sequence_tagging_tab: General sequence labelling evaluation tasks.
  • eval_conll2002: CoNLL 2002 NER evaluation.
  • eval_summarization: Text summarization evaluation.
  • eval_translation_en: Machine translation evaluation (English target).
  • eval_translation_sl: Machine translation evaluation (Slovene target).
  • eval_sequence_pair_classification: Sequence pair classification evaluation.
  • eval_speech_recognition: Automated speech recognition evaluation.

SloBENCH tool was developed as an [Clarin.si 2021 project](https://www.clarin.si/info/storitve/projekti).