This repository contains two datasets:
NLU-Data-Home-Domain-Annotated-All-de-en.csv
: This dataset contains a labeled multi-domain (21 domains) German and English dataset with 25K user utterances for human-robot interaction.NLU-Data-Home-Domain-similarity-de.csv
: This dataset contains German sentence pairs fromNLU-Data-Home-Domain-Annotated-All-de-en.csv
with semantic similarity scores.
This dataset is collected and annotated for evaluating NLU services and platforms. The detailed paper on this dataset can be found at arXiv.org: Benchmarking Natural Language Understanding Services for building Conversational Agents
The dataset builds on the annotated data of the xliuhw/NLU-Evaluation-Data
repository. We have added an additional column (answer_de
)
by translating the texts in column answer
into German.
The translation was made with DeepL.
This dataset contains 1,127 German sentence pairs with a similarity score. The similarity score follows the approach in the STSbenchmark dataset and the SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Cross-lingual Focused Evaluation paper. The dataset can be used to train and improve SentenceTransformer models like our Cross English & German RoBERTa for Sentence Embeddings.
The following table describes our label scheme in detail:
similarity_score | German explanation | English explanation |
---|---|---|
5 | völlig gleichwertig | fully equivalent |
4 | völlig gleichwertig - aber einige unwichtige Details unterscheiden sich | completely equivalent - but some unimportant details differ |
3 | ungefähr gleichwertig - aber einige wichtige Details unterscheiden sich | roughly equivalent - but some important details differ |
2 | nicht gleichwertig - haben aber sehr ähnliche gemeinsame Details | not equivalent - but have very similar common details |
1 | nicht gleichwertig - betreffen aber dasselbe Thema | not equivalent - but concern the same subject |
0 | völlig unähnlich | completely dissimilar |
The dataset can be loaded with pandas:
import pandas as pd
df = pd.read_csv("NLU-Data-Home-Domain-Annotated-All-de-en.csv")
The dataset can be loaded with pandas:
import pandas as pd
df = pd.read_csv("NLU-Data-Home-Domain-similarity-de.csv")
The original dataset contains some NaN
values in the answer
column.
This means that there are also NaN
values in the translations (answer_de
column).
These lines can be filtered as follows:
answer_nan = df[df["answer"].isnull()]
answer_de_nan = df[df["answer_de"].isnull()]
Copyright (c) the authors of xliuhw/NLU-Evaluation-Data
Copyright (c) 2022 Philip May
All data is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You should have received a copy of the license along with this dataset. If not, see http://creativecommons.org/licenses/by/4.0/.