-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added Hindi benchmark. #1882
base: main
Are you sure you want to change the base?
feat: Added Hindi benchmark. #1882
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a couple of questions, and things you should consider, but this is a good start :D
mteb/benchmarks/benchmarks.py
Outdated
@@ -1276,3 +1276,45 @@ def load_results( | |||
year={2024} | |||
}""", | |||
) | |||
|
|||
MTEB_INDIC = Benchmark( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this not overwrite the Hindic benchmark above? I suppose this should have a different variable name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't saw this. I will update it.
name="MTEB(Hindi)", | ||
tasks=get_tasks( | ||
languages=["hin"], | ||
tasks=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you not adding any novel tasks to the benchmark? How will this benchmark be different from selecting Hindi as the language on MTEB(Indic)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also advise you to think about making the benchmark as zero-shot as possible. Meaning if many models train on a certain dataset, you should probably avoid adding it to the benchmark, as the performance might not be representative of the model's ability to generalize.
You can get these for a given task by saying:
from collections import defaultdict
import mteb
model_metas = mteb.get_model_metas()
tasks = mteb.get_benchmark("MTEB(Hindi)").tasks
task_names = [task.metadata.name for task in tasks]
models_trained_on_tasks = defaultdict(list)
for model_meta in model_metas:
if (model_meta.training_datasets is not None):
for training_dataset in model_meta.training_datasets:
if training_dataset in task_names:
models_trained_on_tasks[model_meta.name].append(training_dataset)
print(models_trained_on_tasks)
# And then this would print something like:
{
"Model1": ["task_1", "task_2", ...],
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I completely agree with you. I will remove those datasets. Thanks for pointing it out.
mteb/benchmarks/benchmarks.py
Outdated
"WikipediaRerankingMultilingual", | ||
], | ||
exclusive_language_filter=True, | ||
eval_splits=["test", "validation", "dev"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want to use all of these splits for all tasks? Perhaps narrowing it down might be beneficial for certain tasks, if you think model trainers might use the dev and validation splits for validating their models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought each dataset has either test, validation or dev split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and for each task you should select needed splits
eval_splits=["test", "validation", "dev"], | ||
), | ||
description="The Hindi Leaderboard benchmark extends the MTEB framework by incorporating Hindi-specific datasets and tasks derived from existing MTEB data. It evaluates text embedding models on a variety of tasks, including text classification, semantic similarity, and information retrieval, with a focus on Hindi language performance. This benchmark aims to provide a standardized evaluation platform to advance research and innovation in Hindi NLP, leveraging pre-existing datasets to ensure consistency and comparability across models.", | ||
reference=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you not have a technical report yet? How are you progressing with the paper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I don't have any technical report because we have not added any new model or dataset for any task. But we are planning to work on new model. So we decide to add about this benchmark in that paper.
So is it possible to add citation later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the benchmark is subject to change in the near future, I'm wondering if it's worth it to merge it now. In any case I would add a beta
marker in the name to suggest that it will change in the future. (name="MTEB(Hindi, beta)"
).
What do you think @KennethEnevoldsen ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def. add the beta marker
mteb/benchmarks/benchmarks.py
Outdated
exclusive_language_filter=True, | ||
eval_splits=["test"], | ||
) + get_tasks(tasks=["HindiDiscourseClassification", | ||
"SentimentAnalysisHindi"], eval_splits=["train"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tasks should not use train splits for evaluation. They're automatically used during training for classification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I checked the eval_splits for the task it only had 'train' in /tasks/Classification/hin/HindiDiscourseClassification.py
.
So If I set eval_splits='test' it will still used train split? or validation is not possible on this task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, yes train split is used, because it seems that this dataset have only one split
…they only have train split.
@SaileshP97 seems like this PR might have gotten a bit stale - are you still working on this? |
Code Quality
make lint
to maintain consistent style.Documentation
Testing
make test-with-coverage
.make test
ormake test-with-coverage
to ensure no existing functionality is broken.Adding datasets checklist
Reason for dataset addition: ...
mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.Adding a model checklist
mteb.get_model(model_name, revision)
andmteb.get_model_meta(model_name, revision)