Useful benchmarks that have human scores beyond AI SOTA. - Google Docs #954

ShellLM · 2024-11-18T12:50:58Z

Useful benchmarks that have human scores beyond AI SOTA. - Google Docs

Useful benchmarks that have human scores beyond AI SOTA

Snippet

Useful benchmarks that have human scores beyond AI SOTA.

Full Content

Useful benchmarks that have human scores beyond AI SOTA.

There are a number of important real-world benchmarks where human performance surpasses the current state-of-the-art (SOTA) in AI:

SuperGLUE: A broad natural language understanding benchmark where expert human performance outperforms the current SOTA AI models.
QuALITY: A reading comprehension dataset where skilled annotators outperform the best AI systems.
BIG-bench: A diverse set of tasks that probe the capabilities of large language models, with many subtasks where humans outperform AI.
HotpotQA: A challenging reading comprehension task where human performance exceeds the best AI models.
SWAG: A commonsense reasoning task where human performance is significantly higher than SOTA AI.
HellaSwag: An extension of SWAG with more challenging examples, where humans again outclass AI.

These benchmarks suggest that there remain significant gaps between current AI capabilities and human-level performance on many real-world tasks. Closing these gaps will be an important area of research going forward.

Suggested labels

None

ShellLM · 2024-11-18T12:50:59Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Useful benchmarks that have human scores beyond AI SOTA. - Google Docs #954

Useful benchmarks that have human scores beyond AI SOTA. - Google Docs #954

ShellLM commented Nov 18, 2024

ShellLM commented Nov 18, 2024

Useful benchmarks that have human scores beyond AI SOTA. - Google Docs #954

Useful benchmarks that have human scores beyond AI SOTA. - Google Docs #954

Comments

ShellLM commented Nov 18, 2024

Useful benchmarks that have human scores beyond AI SOTA

Snippet

Full Content

Suggested labels

None

ShellLM commented Nov 18, 2024

Related content