Why does opencompass exist if lm-harness already exists and is more widely used? #1428

brando90 · 2024-08-16T19:53:37Z

brando90
Aug 16, 2024

Why does opencompass exist if lm-harness already exists and is more widely used?, ref: https://github.com/EleutherAI/lm-evaluation-harness

tonysy · 2024-08-17T08:47:42Z

tonysy
Aug 17, 2024
Maintainer

Why does pytorch exists if tensorflow already exists？And why does new model release if llama already exists？

1 reply

brando90 Aug 19, 2024
Author

@tonysy

Thank you for your response and for raising this point. I appreciate the comparison to the PyTorch and TensorFlow approaches, which indeed introduced fundamentally different paradigms with their dynamic and static graph implementations. However, I the situation is distinct and warrants a different consideration.

In the case of PyTorch and TensorFlow, both frameworks were testing specific, conceptually novel ideas—dynamic vs. static graphs—each bringing unique benefits and trade-offs. They were exploring different ways to build and optimize models, which justified their independent existence and development.

However, when it comes to evaluation benchmarks, the scenario is quite different. Evaluation is a critical component, much like cryptography, where correctness and consistency are paramount. Re-implementing benchmarks without clear, conceptual innovation can lead to confusion and inconsistency, which are detrimental to the validity and comparability of results. Just as one wouldn’t typically re-implement cryptographic algorithms due to their complexity and the risks involved, it is crucial that evaluation benchmarks adhere to established, standardized methods to maintain integrity and reliability.

Given your question, I looked up the difference between Stanford's HELM and EleutherAI's LM-Harness. Based on what I found, the difference in focus and scope between these two frameworks doesn’t justify having both implementations, in my opinion I admit. While PyTorch vs. TensorFlow presented none obvious but big, important bets on fundamentally different paradigms, the same doesn't apply here. However, I concede I might be wrong, hence why I'm asking you. ;)

Moreover, it isn't my job to justify why an evaluation framework that I didn’t write exists. I believe that responsibility lies with those who developed it. This line of questioning seems like an attempt to avoid or deflect from the original question. Just as I wouldn't write the motivations or related work sections for someone else's paper, it shouldn't be my responsibility to justify the existence of these different frameworks. My expectation given your response is that there isn't a good reason, otherwise you would have given it instead of deflecting it to irrelevant difficult to compare scenarios.

To finalize, the reason for new models is usually because of economic or reputation incentives. We can debate if that is justified if you'd like but I don't think you are interested in that but to avoid answer my question.

I hope this clarifies my position and the reasoning behind my query. I'm looking forward to your response.

Best regards,
Brando Miranda

tonysy · 2024-08-22T07:34:00Z

tonysy
Aug 22, 2024
Maintainer

The reason is simple: if one toolkit cannot satisfy our demands, we develop our own tool to facilitate our research and projects.

First, OpenCompass originates from our internal R&D demands. We investigated the implementations in the community last year and found that the open-source solutions could not satisfy our demands. Therefore, we developed our own toolkit. We release our toolkit to facilitate the community, enabling every researcher to choose the appropriate toolkit according to their demands. Open-source is a result other than the reason. What we released is not only software like OpenCompass, but also evaluation benchmarks and methods, such as MMBench, MathBench, CIBench, Prism, and others.

Some specific points include:

We needed a high-efficiency toolkit for evaluating models larger than 70B parameters before the vLLM/LMDeploy release. Thus, we built our own system with task division and distributed evaluation capabilities. We also developed a task partitioner and runner to support this goal. https://opencompass.readthedocs.io/zh-cn/latest/user_guides/evaluation.html
We required flexible prompt engineering support, such as few-shot examples, Chain-of-Thought (CoT), and retriever-based few-shot example construction, which LM-Harness could not satisfy. https://github.com/open-compass/opencompass/blob/main/opencompass/openicl/icl_retriever/init.py
We needed an all-in-one evaluation system for subjective evaluation. Thus, we developed a subjective evaluation pipeline with a modular design that has high extensibility, allowing us to easily change different judgment LLMs and different judging methods. https://opencompass.readthedocs.io/en/latest/advanced_guides/subjective_evaluation.html

Secondly, the comparison with cryptographic algorithms is not suitable for LLM evaluation. The basic idea of evaluation is simple: prompt an LLM to generate a response, then check the answer against a reference, and anyone can implement this quickly. Many implementations (like the prompts) in LM-Harness are suboptimal; for example, we could not reproduce the performance of Llama with LM-Harness. In the AI community, everyone has the right to implement an algorithm and framework.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does opencompass exist if lm-harness already exists and is more widely used? #1428

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why does opencompass exist if lm-harness already exists and is more widely used? #1428

brando90 Aug 16, 2024

Replies: 2 comments · 1 reply

tonysy Aug 17, 2024 Maintainer

brando90 Aug 19, 2024 Author

tonysy Aug 22, 2024 Maintainer

brando90
Aug 16, 2024

Replies: 2 comments 1 reply

tonysy
Aug 17, 2024
Maintainer

brando90 Aug 19, 2024
Author

tonysy
Aug 22, 2024
Maintainer