Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2] Merge MIEB and MTEB tasks #2078

Open
Tracked by #1791
Samoed opened this issue Feb 16, 2025 · 6 comments
Open
Tracked by #1791

[v2] Merge MIEB and MTEB tasks #2078

Samoed opened this issue Feb 16, 2025 · 6 comments
Labels
v2 Issues and PRs related to `v2` branch
Milestone

Comments

@Samoed
Copy link
Collaborator

Samoed commented Feb 16, 2025

  1. Since in v2, we merged Retrieval and Reranking tasks, and MultiChoice appears to be a Reranking task with a single correct answer, can we merge it with the Retrieval task?
  2. If we have MultiChoiceAny2Any, do we still need MultiChoiceAny2Text?
  3. Can we create AbsTaskClassification and split it into AbsTaskTextClassification and AbsTaskImageClassification, since they share the same logic in the evaluate function?
  4. Zero-shot classification seems more similar to ImageTextSimilarity—should we combine them?
  5. I think we could introduce Any2AnyPairClassification and Any2AnySTS.

Ref:

@Samoed Samoed added the v2 Issues and PRs related to `v2` branch label Feb 16, 2025
@Samoed Samoed added this to the v2.0.0 milestone Feb 16, 2025
@isaac-chung
Copy link
Collaborator

isaac-chung commented Feb 19, 2025

Thanks for starting this! Updated to use question numberings. Tagging @gowitheflow-1998 while I take a deeper look into the AbsTasks. A few points to begin with:

  1. I don't think so. See its docstrings.
  2. Yes. See their respective docstrings.
  3. It's a good first suggestion. I'd prefer not to add another layer of abstraction, and I'd like to work towards the proposal originally in this thread, where we create e.g. an AbsTaskAnyClassification that can handle any modalities, while reusing the evaluate parts.
  4. I was not able to find ImageTextSimilarity. Do you have a link, or do you mean ImageTextPairClassification?
  5. Agreed!

@Samoed
Copy link
Collaborator Author

Samoed commented Feb 19, 2025

  1. I see the docstring, but I can't find any differences in the code, except for one relevant document that should be for the query.
  2. I'm not sure I understand the difference between docstrings. Additionally, it's a bit confusing that we have both AbsTaskAny2TextMultipleChoice and AbsTaskAny2AnyMultiChoice. Why do we need Any2Text when we have Any2Any?
  3. Yes, ImageTextPairClassification

@isaac-chung
Copy link
Collaborator

  1. Hmm I see what you mean. It does seem like Any2AnyMC and Any2AnyRetrieval share a lot of code. The main diff is that the MC evaluator requires discounting qrels (see this section and this section). If there's a nice way to combine the two while maintaining this capability, I'd be happy to see that. Would love @gowitheflow-1998 's opinion as well.
  2. Other than the share in name, these two classes and evaluators have extremely little in common in code. e.g. check out nyu-visionx/CV-Bench to see an example task. Perhaps if the naming is confusing, this PR aims to align the task types to the paper, where capabilities evaluated are used (instead of the abstask name). So AbsTaskAny2TextMC -> VisionCentric.
  3. Similar to 2, the PR changes the ImageTextPairClassification task type to CompositionalityEvaluation. Indeed, on a very high level, both task involves comparing images to captions. But this is separate from zero shot image classification, and I'd prefer to keep them as is. The biggest differences can be seen in the evaluators. For the AbsTasks, the biggest differences to me is how the datasets are formatted. A similar argument can be made about BitextMining being similar to Reranking / Retrieval. Below is a quote from the MIEB paper to help with context:

Compositionality Evaluation: Albeit retrieval, clustering, and zero-shot classification tasks assess alignment, they
mostly require embeddings to be distinguishable on more or less a coarse-grained level, e.g., an image of a guitar to be
closer to its textual counterpart to that of a piano. Vision-Language Compositionality assesses the ability to infer whether the composition of a given set of elements aligns for an image and a text. Compositionality Evaluation provides a lens to inspect more fine-grained information encoded in the embeddings. Typically hosted by our image text pair classification implementation, it requires distinguishing ground truth and the hard negatives with small perturbations on the inputs, e.g., word order shuffling in ARObenchmark [53].

@Samoed
Copy link
Collaborator Author

Samoed commented Feb 19, 2025

  1. So, perhaps we could change the evaluator for Any2AnyMC and keep the rest of the code the same as in Any2AnyRetrieval?
  2. I understand, maybe we should consider the naming, as it is not clear exactly what VisionCentric mean.

@gowitheflow-1998
Copy link
Contributor

thanks for starting this! Any2AnyMC and Any2TextMC were implemented in completely different logic.

Any2TextMC is able to support datasets without turning them into retrieval format (query, corpus, qrels), e.g., it can support tasks like: https://huggingface.co/datasets/nyu-visionx/CV-Bench. In evaluation time, the unique candidate choices are first collected through a set operation, encoded, and their embeddings are retrieved to compare with each query's embedding for scoring. However, it is more difficult to do set operation on-the-fly on images (e.g., for tasks like https://huggingface.co/datasets/BLINK-Benchmark/BLINK), that's why we implemented Any2AnyMC with a different logic.

Any2AnyMC requires datasets to be pre-made with retrieval format. The difference with Any2AnyRetrieval is how they treat qrels. In evaluation time, only documents under the same query will be considered (the ones as candidates in the original multiple-choice format), and others will be masked out. With this, recall@1 attained by Any2AnyMC will be the same as accuracy in the original multiple-choice format. When making the dataset compatible with Any2AnyMC, it requires putting original choices that are not groundtruth in the qrels as well but with a relevant score of 0. e.g., the qrels of https://huggingface.co/datasets/JamieSJS/blink-it2i-multi (BLINK multiple-choice) is two times as large as https://huggingface.co/datasets/JamieSJS/blink-it2i (BLINK retrieval). Any2AnyMC can further support non-typical retrieval tasks like r-Oxford and r-Pairs (e.g., https://huggingface.co/datasets/JamieSJS/r-oxford-hard-multi) - for every query, they require masking out different images in the corpus. Such cases can't be supported by Any2AnyRetrieval.

I personally think keeping them as is (instead of combining with Any2AnyRetrieval) will be less confusing to people (if they try to add more MC tasks)?

@Samoed
Copy link
Collaborator Author

Samoed commented Feb 19, 2025

All logic with qrels is hidden in evaluator. We can make evaluator as parameter of class and with this all logic in AbsTaskRetrieval/Any2AnyMC will be the same and for r-Oxford can be created different evaluator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2 Issues and PRs related to `v2` branch
Projects
None yet
Development

No branches or pull requests

3 participants