Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added Hindi benchmark. #1882

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

SaileshP97
Copy link

@SaileshP97 SaileshP97 commented Jan 27, 2025

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

@Samoed Samoed changed the title Added Hindi benchmark. feat: Added Hindi benchmark. Jan 28, 2025
Copy link
Collaborator

@x-tabdeveloping x-tabdeveloping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a couple of questions, and things you should consider, but this is a good start :D

@@ -1276,3 +1276,45 @@ def load_results(
year={2024}
}""",
)

MTEB_INDIC = Benchmark(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this not overwrite the Hindic benchmark above? I suppose this should have a different variable name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't saw this. I will update it.

name="MTEB(Hindi)",
tasks=get_tasks(
languages=["hin"],
tasks=[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you not adding any novel tasks to the benchmark? How will this benchmark be different from selecting Hindi as the language on MTEB(Indic)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also advise you to think about making the benchmark as zero-shot as possible. Meaning if many models train on a certain dataset, you should probably avoid adding it to the benchmark, as the performance might not be representative of the model's ability to generalize.
You can get these for a given task by saying:

from collections import defaultdict
import mteb

model_metas = mteb.get_model_metas()
tasks = mteb.get_benchmark("MTEB(Hindi)").tasks
task_names = [task.metadata.name for task in tasks]
models_trained_on_tasks = defaultdict(list)
for model_meta in model_metas:
    if (model_meta.training_datasets is not None):
          for training_dataset in model_meta.training_datasets:
                 if training_dataset in task_names:
                     models_trained_on_tasks[model_meta.name].append(training_dataset)

print(models_trained_on_tasks)
# And then this would print something like:
{
     "Model1": ["task_1", "task_2", ...],
     ...
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I completely agree with you. I will remove those datasets. Thanks for pointing it out.

"WikipediaRerankingMultilingual",
],
exclusive_language_filter=True,
eval_splits=["test", "validation", "dev"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want to use all of these splits for all tasks? Perhaps narrowing it down might be beneficial for certain tasks, if you think model trainers might use the dev and validation splits for validating their models.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought each dataset has either test, validation or dev split.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and for each task you should select needed splits

eval_splits=["test", "validation", "dev"],
),
description="The Hindi Leaderboard benchmark extends the MTEB framework by incorporating Hindi-specific datasets and tasks derived from existing MTEB data. It evaluates text embedding models on a variety of tasks, including text classification, semantic similarity, and information retrieval, with a focus on Hindi language performance. This benchmark aims to provide a standardized evaluation platform to advance research and innovation in Hindi NLP, leveraging pre-existing datasets to ensure consistency and comparability across models.",
reference=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you not have a technical report yet? How are you progressing with the paper?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I don't have any technical report because we have not added any new model or dataset for any task. But we are planning to work on new model. So we decide to add about this benchmark in that paper.

So is it possible to add citation later?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the benchmark is subject to change in the near future, I'm wondering if it's worth it to merge it now. In any case I would add a beta marker in the name to suggest that it will change in the future. (name="MTEB(Hindi, beta)").
What do you think @KennethEnevoldsen ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def. add the beta marker

exclusive_language_filter=True,
eval_splits=["test"],
) + get_tasks(tasks=["HindiDiscourseClassification",
"SentimentAnalysisHindi"], eval_splits=["train"]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tasks should not use train splits for evaluation. They're automatically used during training for classification

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I checked the eval_splits for the task it only had 'train' in /tasks/Classification/hin/HindiDiscourseClassification.py.
So If I set eval_splits='test' it will still used train split? or validation is not possible on this task?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, yes train split is used, because it seems that this dataset have only one split

@KennethEnevoldsen
Copy link
Contributor

@SaileshP97 seems like this PR might have gotten a bit stale - are you still working on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants