You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The scores for the four datasets - Coursera, NarrativeQA, NQ, and Topic Retrieval - evaluated by Leval show significant differences, with respective differences of -9.3, -9.42, -14.29, and 8 points. Could you please tell me the reasons for the errors? What is an acceptable level of error? How can I reduce the error to make the results more accurate?"
The scores for the four datasets - NarrativeQA, TREC, LSHT (zh), and Topic Retrieval - evaluated by longbench show significant differences, with respective differences of -10.94, 31.04, and 7.17 points. Could you please tell me the reasons for the errors? What is an acceptable level of error? How can I reduce the error to make the results more accurate?"
For the TREC LSHT (zh) subset of the longbench dataset, I found the corresponding scores on the longbench rank (https://github.com/THUDM/LongBench/blob/main/README.md) to be not significantly different from my results. Should I rely on the OpenCompass leaderboard or the longbench leaderboard?
Other information
No response
The text was updated successfully, but these errors were encountered:
Prerequisite
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
OpenCompass 0.2.3
transformers 4.35.2
GPU A100
Reproduces the problem - code/configuration sample
None
Reproduces the problem - command or script
Reproduces the problem - error message
I reproduced most of the scores using the official configuration, But there are a few subsets of the score gap larger please help me analyze.
There is a big difference between the scores below and those on the list:
I need the following questions:
The scores for the four datasets - Coursera, NarrativeQA, NQ, and Topic Retrieval - evaluated by Leval show significant differences, with respective differences of -9.3, -9.42, -14.29, and 8 points. Could you please tell me the reasons for the errors? What is an acceptable level of error? How can I reduce the error to make the results more accurate?"
The scores for the four datasets - NarrativeQA, TREC, LSHT (zh), and Topic Retrieval - evaluated by longbench show significant differences, with respective differences of -10.94, 31.04, and 7.17 points. Could you please tell me the reasons for the errors? What is an acceptable level of error? How can I reduce the error to make the results more accurate?"
For the TREC LSHT (zh) subset of the longbench dataset, I found the corresponding scores on the longbench rank (https://github.com/THUDM/LongBench/blob/main/README.md) to be not significantly different from my results. Should I rely on the OpenCompass leaderboard or the longbench leaderboard?
Other information
No response
The text was updated successfully, but these errors were encountered: