Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Useful benchmarks that have human scores beyond AI SOTA. - Google Docs #954

Open
1 task
ShellLM opened this issue Nov 18, 2024 · 1 comment
Open
1 task
Labels
ai-leaderboards leaderdoards for llm's and other ml models llm-benchmarks testing and benchmarking large language models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets

Comments

@ShellLM
Copy link
Collaborator

ShellLM commented Nov 18, 2024

Useful benchmarks that have human scores beyond AI SOTA

Snippet

Useful benchmarks that have human scores beyond AI SOTA.

Full Content

Useful benchmarks that have human scores beyond AI SOTA.

There are a number of important real-world benchmarks where human performance surpasses the current state-of-the-art (SOTA) in AI:

  • SuperGLUE: A broad natural language understanding benchmark where expert human performance outperforms the current SOTA AI models.
  • QuALITY: A reading comprehension dataset where skilled annotators outperform the best AI systems.
  • BIG-bench: A diverse set of tasks that probe the capabilities of large language models, with many subtasks where humans outperform AI.
  • HotpotQA: A challenging reading comprehension task where human performance exceeds the best AI models.
  • SWAG: A commonsense reasoning task where human performance is significantly higher than SOTA AI.
  • HellaSwag: An extension of SWAG with more challenging examples, where humans again outclass AI.

These benchmarks suggest that there remain significant gaps between current AI capabilities and human-level performance on many real-world tasks. Closing these gaps will be an important area of research going forward.

Suggested labels

None

@ShellLM ShellLM added ai-leaderboards leaderdoards for llm's and other ml models llm-benchmarks testing and benchmarking large language models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets labels Nov 18, 2024
@ShellLM
Copy link
Collaborator Author

ShellLM commented Nov 18, 2024

Related content

#812 similarity score: 0.89
#940 similarity score: 0.87
#953 similarity score: 0.87
#810 similarity score: 0.86
#951 similarity score: 0.86
#706 similarity score: 0.86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ai-leaderboards leaderdoards for llm's and other ml models llm-benchmarks testing and benchmarking large language models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets
Projects
None yet
Development

No branches or pull requests

1 participant