[v2] Merge MIEB and MTEB tasks #2078

Samoed · 2025-02-16T20:52:56Z

Since in v2, we merged Retrieval and Reranking tasks, and MultiChoice appears to be a Reranking task with a single correct answer, can we merge it with the Retrieval task?
If we have MultiChoiceAny2Any, do we still need MultiChoiceAny2Text?
Can we create AbsTaskClassification and split it into AbsTaskTextClassification and AbsTaskImageClassification, since they share the same logic in the evaluate function?
Zero-shot classification seems more similar to ImageTextSimilarity—should we combine them?
I think we could introduce Any2AnyPairClassification and Any2AnySTS.

Ref:

The text was updated successfully, but these errors were encountered:

isaac-chung · 2025-02-19T07:58:03Z

Thanks for starting this! Updated to use question numberings. Tagging @gowitheflow-1998 while I take a deeper look into the AbsTasks. A few points to begin with:

I don't think so. See its docstrings.
Yes. See their respective docstrings.
It's a good first suggestion. I'd prefer not to add another layer of abstraction, and I'd like to work towards the proposal originally in this thread, where we create e.g. an AbsTaskAnyClassification that can handle any modalities, while reusing the evaluate parts.
I was not able to find ImageTextSimilarity. Do you have a link, or do you mean ImageTextPairClassification?
Agreed!

Samoed · 2025-02-19T08:18:27Z

I see the docstring, but I can't find any differences in the code, except for one relevant document that should be for the query.
I'm not sure I understand the difference between docstrings. Additionally, it's a bit confusing that we have both AbsTaskAny2TextMultipleChoice and AbsTaskAny2AnyMultiChoice. Why do we need Any2Text when we have Any2Any?
Yes, ImageTextPairClassification

isaac-chung · 2025-02-19T09:29:08Z

Hmm I see what you mean. It does seem like Any2AnyMC and Any2AnyRetrieval share a lot of code. The main diff is that the MC evaluator requires discounting qrels (see this section and this section). If there's a nice way to combine the two while maintaining this capability, I'd be happy to see that. Would love @gowitheflow-1998 's opinion as well.
Other than the share in name, these two classes and evaluators have extremely little in common in code. e.g. check out nyu-visionx/CV-Bench to see an example task. Perhaps if the naming is confusing, this PR aims to align the task types to the paper, where capabilities evaluated are used (instead of the abstask name). So AbsTaskAny2TextMC -> VisionCentric.
Similar to 2, the PR changes the ImageTextPairClassification task type to CompositionalityEvaluation. Indeed, on a very high level, both task involves comparing images to captions. But this is separate from zero shot image classification, and I'd prefer to keep them as is. The biggest differences can be seen in the evaluators. For the AbsTasks, the biggest differences to me is how the datasets are formatted. A similar argument can be made about BitextMining being similar to Reranking / Retrieval. Below is a quote from the MIEB paper to help with context:

Compositionality Evaluation: Albeit retrieval, clustering, and zero-shot classification tasks assess alignment, they
mostly require embeddings to be distinguishable on more or less a coarse-grained level, e.g., an image of a guitar to be
closer to its textual counterpart to that of a piano. Vision-Language Compositionality assesses the ability to infer whether the composition of a given set of elements aligns for an image and a text. Compositionality Evaluation provides a lens to inspect more fine-grained information encoded in the embeddings. Typically hosted by our image text pair classification implementation, it requires distinguishing ground truth and the hard negatives with small perturbations on the inputs, e.g., word order shuffling in ARObenchmark [53].

Samoed · 2025-02-19T10:03:10Z

So, perhaps we could change the evaluator for Any2AnyMC and keep the rest of the code the same as in Any2AnyRetrieval?
I understand, maybe we should consider the naming, as it is not clear exactly what VisionCentric mean.

gowitheflow-1998 · 2025-02-19T10:59:12Z

thanks for starting this! Any2AnyMC and Any2TextMC were implemented in completely different logic.

Any2TextMC is able to support datasets without turning them into retrieval format (query, corpus, qrels), e.g., it can support tasks like: https://huggingface.co/datasets/nyu-visionx/CV-Bench. In evaluation time, the unique candidate choices are first collected through a set operation, encoded, and their embeddings are retrieved to compare with each query's embedding for scoring. However, it is more difficult to do set operation on-the-fly on images (e.g., for tasks like https://huggingface.co/datasets/BLINK-Benchmark/BLINK), that's why we implemented Any2AnyMC with a different logic.

Any2AnyMC requires datasets to be pre-made with retrieval format. The difference with Any2AnyRetrieval is how they treat qrels. In evaluation time, only documents under the same query will be considered (the ones as candidates in the original multiple-choice format), and others will be masked out. With this, recall@1 attained by Any2AnyMC will be the same as accuracy in the original multiple-choice format. When making the dataset compatible with Any2AnyMC, it requires putting original choices that are not groundtruth in the qrels as well but with a relevant score of 0. e.g., the qrels of https://huggingface.co/datasets/JamieSJS/blink-it2i-multi (BLINK multiple-choice) is two times as large as https://huggingface.co/datasets/JamieSJS/blink-it2i (BLINK retrieval). Any2AnyMC can further support non-typical retrieval tasks like r-Oxford and r-Pairs (e.g., https://huggingface.co/datasets/JamieSJS/r-oxford-hard-multi) - for every query, they require masking out different images in the corpus. Such cases can't be supported by Any2AnyRetrieval.

I personally think keeping them as is (instead of combining with Any2AnyRetrieval) will be less confusing to people (if they try to add more MC tasks)?

Samoed · 2025-02-19T11:16:24Z

All logic with qrels is hidden in evaluator. We can make evaluator as parameter of class and with this all logic in AbsTaskRetrieval/Any2AnyMC will be the same and for r-Oxford can be created different evaluator.

Samoed added the v2 Issues and PRs related to `v2` branch label Feb 16, 2025

Samoed added this to the v2.0.0 milestone Feb 16, 2025

Samoed mentioned this issue Feb 16, 2025

Merge v2.0.0: Overview issue #1791

Open

38 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2] Merge MIEB and MTEB tasks #2078

[v2] Merge MIEB and MTEB tasks #2078

Samoed commented Feb 16, 2025 •

edited by isaac-chung

Loading

isaac-chung commented Feb 19, 2025 •

edited

Loading

Samoed commented Feb 19, 2025

isaac-chung commented Feb 19, 2025

Samoed commented Feb 19, 2025

gowitheflow-1998 commented Feb 19, 2025

Samoed commented Feb 19, 2025

[v2] Merge MIEB and MTEB tasks #2078

[v2] Merge MIEB and MTEB tasks #2078

Comments

Samoed commented Feb 16, 2025 • edited by isaac-chung Loading

isaac-chung commented Feb 19, 2025 • edited Loading

Samoed commented Feb 19, 2025

isaac-chung commented Feb 19, 2025

Samoed commented Feb 19, 2025

gowitheflow-1998 commented Feb 19, 2025

Samoed commented Feb 19, 2025

Samoed commented Feb 16, 2025 •

edited by isaac-chung

Loading

isaac-chung commented Feb 19, 2025 •

edited

Loading