Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MODEL EVALUATION REQUEST] allenai/OLMo-2-1124-13B #658

Closed
8 tasks done
Rijgersberg opened this issue Dec 16, 2024 · 18 comments
Closed
8 tasks done

[MODEL EVALUATION REQUEST] allenai/OLMo-2-1124-13B #658

Rijgersberg opened this issue Dec 16, 2024 · 18 comments
Labels
large model (>8B) This model has more than 8B parameters, requiring more than an RTX 4090 GPU to evaluate. model evaluation request Request to evaluate a model and add it to the leaderboard(s)

Comments

@Rijgersberg
Copy link

Model ID

allenai/OLMo-2-1124-13B

Model type

Decoder model (e.g., GPT)

Evaluation languages

  • Danish
  • Swedish
  • Norwegian (Bokmål or Nynorsk)
  • Icelandic
  • Faroese
  • German
  • Dutch
  • English

Merged model

Not a merged model

@Rijgersberg Rijgersberg added the model evaluation request Request to evaluate a model and add it to the leaderboard(s) label Dec 16, 2024
@saattrupdan saattrupdan added the large model (>8B) This model has more than 8B parameters, requiring more than an RTX 4090 GPU to evaluate. label Dec 16, 2024
@Mikeriess
Copy link
Contributor

Tried to run this, but getting the following error:

(app-root) scandeval --model allenai/OLMo-2-1124-13B --evaluate-test-split  --cache-dir cache/scandeval_cache
2025-01-07 08:59:22 ⋅ Benchmarking allenai/OLMo-2-1124-13B on the truncated version of the Swedish sentiment classification dataset SweReC
2025-01-07 08:59:22 ⋅ Loading model...
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 623/623 [00:00<00:00, 7.38MB/s]
2025-01-07 08:59:23 ⋅ The model 'allenai/OLMo-2-1124-13B' could not be loaded. The error was ValueError("Unrecognized configuration class <class 'transformers.models.olmo2.configuration_olmo2.Olmo2Config'> for this kind of AutoModel: AutoModelForSequenceClassification.\nModel type should be one of AlbertConfig, BartConfig, BertConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BloomConfig, CamembertConfig, CanineConfig, LlamaConfig, ConvBertConfig, CTRLConfig, Data2VecTextConfig, DebertaConfig, DebertaV2Config, DistilBertConfig, ElectraConfig, ErnieConfig, ErnieMConfig, EsmConfig, FalconConfig, FlaubertConfig, FNetConfig, FunnelConfig, GemmaConfig, Gemma2Config, GlmConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTJConfig, IBertConfig, JambaConfig, JetMoeConfig, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LEDConfig, LiltConfig, LlamaConfig, LongformerConfig, LukeConfig, MarkupLMConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MobileBertConfig, MPNetConfig, MptConfig, MraConfig, MT5Config, MvpConfig, NemotronConfig, NezhaConfig, NystromformerConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PerceiverConfig, PersimmonConfig, PhiConfig, Phi3Config, PhimoeConfig, PLBartConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, SqueezeBertConfig, StableLmConfig, Starcoder2Config, T5Config, TapasConfig, TransfoXLConfig, UMT5Config, XLMConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, YosoConfig, ZambaConfig.").

@saattrupdan
Copy link
Member

@Mikeriess The reason for that was that they hadn't set the pipeline_tag in their model card, meaning that the framework thinks that it is an encoder model. I've made the detection logic a lot more robust now, by using the architectures parameter in the model configuration instead - try installing from source with the following command and try again:

pip uninstall -y -qqq scandeval && pip install -qqq scandeval[all]@git+https://github.com/ScandEval/ScandEval

@Mikeriess
Copy link
Contributor

@saattrupdan using your latest: ScandEval 14.1.2.dev0 I unfortunately still get the same issue

@saattrupdan
Copy link
Member

@Mikeriess Hopefully the "AutoModelForSequenceClassification" part of your error message should've changed to "AutoModelForCausalLM", meaning that it now detects it as a generative model. The last part remaining is updating vllm manually: pip install -U vllm.

Currently I've set a hard upper bound of <4.5.0 on vLLM, since the newer versions are incredibly slow and takes up a lot of memory (>50GB CPU RAM) on the NER task (due to the use of a newer Outlines version). But since you need the newer vLLM version to benchmark this specific model, you can update it manually - I'd advise to downgrade again afterwards, however.

@Mikeriess
Copy link
Contributor

Mikeriess commented Feb 10, 2025

Not supported yet it seems:

The model 'allenai/OLMo-2-1124-13B' could not be loaded. The error was ValueError("Model architectures ['Olmo2ForCausalLM'] are not supported for now. Supported architectures: dict_keys(['AquilaModel', 'AquilaForCausalLM', 'ArcticForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'InternLM2VEForCausalLM', 'JAISLMHeadModel', 'JambaForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MambaForCausalLM', 'FalconMambaForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'Phi3SmallForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'XverseForCausalLM', 'BartModel', 'BartForConditionalGeneration', 'Florence2ForConditionalGeneration', 'BertModel', 'RobertaModel', 'XLMRobertaModel', 'Gemma2Model', 'LlamaModel', 'MistralModel', 'Qwen2Model', 'Qwen2ForRewardModel', 'Qwen2ForSequenceClassification', 'LlavaNextForConditionalGeneration', 'Phi3VForCausalLM', 'Qwen2VLForConditionalGeneration', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'H2OVLChatModel', 'InternVLChatModel', 'Idefics3ForConditionalGeneration', 'LlavaForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'MolmoForCausalLM', 'NVLM_D', 'PaliGemmaForConditionalGeneration', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'Qwen2AudioForConditionalGeneration', 'UltravoxModel', 'MllamaForConditionalGeneration', 'EAGLEModel', 'MedusaModel', 'MLPSpeculatorPreTrainedModel'])").

@saattrupdan
Copy link
Member

Not supported yet it seems:

The model 'allenai/OLMo-2-1124-13B' could not be loaded. The error was ValueError("Model architectures ['Olmo2ForCausalLM'] are not supported for now. Supported architectures: dict_keys(['AquilaModel', 'AquilaForCausalLM', 'ArcticForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'InternLM2VEForCausalLM', 'JAISLMHeadModel', 'JambaForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MambaForCausalLM', 'FalconMambaForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'Phi3SmallForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'XverseForCausalLM', 'BartModel', 'BartForConditionalGeneration', 'Florence2ForConditionalGeneration', 'BertModel', 'RobertaModel', 'XLMRobertaModel', 'Gemma2Model', 'LlamaModel', 'MistralModel', 'Qwen2Model', 'Qwen2ForRewardModel', 'Qwen2ForSequenceClassification', 'LlavaNextForConditionalGeneration', 'Phi3VForCausalLM', 'Qwen2VLForConditionalGeneration', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'H2OVLChatModel', 'InternVLChatModel', 'Idefics3ForConditionalGeneration', 'LlavaForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'MolmoForCausalLM', 'NVLM_D', 'PaliGemmaForConditionalGeneration', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'Qwen2AudioForConditionalGeneration', 'UltravoxModel', 'MllamaForConditionalGeneration', 'EAGLEModel', 'MedusaModel', 'MLPSpeculatorPreTrainedModel'])").

Can you try manually updating vLLM and trying again?

@Mikeriess
Copy link
Contributor

Updating to vllm-0.7.2 does fix this issue, as the model runs now 👍

@Mikeriess
Copy link
Contributor

Here you go @saattrupdan - this was surprisingly slow considering its size :-)

{"dataset": "swerec", "task": "sentiment-classification", "dataset_languages": ["sv"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.7880261149342456, "macro_f1": 0.6913845695908963}, {"mcc": 0.8122867359769628, "macro_f1": 0.79640601638623}, {"mcc": 0.8132161020856932, "macro_f1": 0.7690083138726149}, {"mcc": 0.7791363811757237, "macro_f1": 0.721721752230165}, {"mcc": 0.7599527667440482, "macro_f1": 0.6736445170620661}, {"mcc": 0.787517019824895, "macro_f1": 0.769890905294444}, {"mcc": 0.7990225132199692, "macro_f1": 0.7171908103868921}, {"mcc": 0.8022230071528967, "macro_f1": 0.7844517329946209}, {"mcc": 0.7948244363293102, "macro_f1": 0.7716835924467517}, {"mcc": 0.7731771959087927, "macro_f1": 0.7715658875229415}], "total": {"test_mcc": 79.09382273352537, "test_mcc_se": 1.0524323410748568, "test_macro_f1": 74.66948097787622, "test_macro_f1_se": 2.6186736064381573}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "angry-tweets", "task": "sentiment-classification", "dataset_languages": ["da"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.22420266780580206, "macro_f1": 0.3057304019504858}, {"mcc": 0.15911987014233883, "macro_f1": 0.25519664437630263}, {"mcc": 0.1277489222126009, "macro_f1": 0.23912690852603521}, {"mcc": 0.19115058248153202, "macro_f1": 0.27508231832046054}, {"mcc": 0.20989234016792602, "macro_f1": 0.2875273931434639}, {"mcc": 0.10585848883614773, "macro_f1": 0.21003865612214204}, {"mcc": 0.16702875895784153, "macro_f1": 0.2663481491277874}, {"mcc": 0.1680888030641847, "macro_f1": 0.25573446474647765}, {"mcc": 0.15924297186350972, "macro_f1": 0.25602186270248295}, {"mcc": 0.230695170928256, "macro_f1": 0.32567764137748323}], "total": {"test_mcc": 17.430285764601397, "test_mcc_se": 2.4982028002949193, "test_macro_f1": 26.764844403931214, "test_macro_f1_se": 2.048831082811187}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "norec", "task": "sentiment-classification", "dataset_languages": ["nb", "nn", "no"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.2993471577781622, "macro_f1": 0.3640438813158194}, {"mcc": 0.2945988958001068, "macro_f1": 0.3681255761737792}, {"mcc": 0.28549757910671053, "macro_f1": 0.36999679381999373}, {"mcc": 0.31647510898050607, "macro_f1": 0.3773457952901085}, {"mcc": 0.2881823976329526, "macro_f1": 0.3736416243687049}, {"mcc": 0.27727506740832436, "macro_f1": 0.3621405616711035}, {"mcc": 0.2786875944628779, "macro_f1": 0.3565884835511834}, {"mcc": 0.3125438924817862, "macro_f1": 0.37335709818076507}, {"mcc": 0.29203626706498625, "macro_f1": 0.3644173676704581}, {"mcc": 0.30387319018699493, "macro_f1": 0.37172533956463516}], "total": {"test_mcc": 29.485171509034082, "test_mcc_se": 0.8244352167511184, "test_macro_f1": 36.8138252160655, "test_macro_f1_se": 0.3912382047884708}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "hotter-and-colder-sentiment", "task": "sentiment-classification", "dataset_languages": ["is"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.11591847624758876, "macro_f1": 0.3024735316551202}, {"mcc": 0.3646154241638809, "macro_f1": 0.43875463077482796}, {"mcc": 0.19131307653171964, "macro_f1": 0.4092565604413502}, {"mcc": 0.19423325562809143, "macro_f1": 0.38981875553887874}, {"mcc": 0.15564576992615117, "macro_f1": 0.3328063299091688}, {"mcc": 0.3330021658211181, "macro_f1": 0.4263323497961447}, {"mcc": 0.3132497411627525, "macro_f1": 0.41958469472888327}, {"mcc": 0.33566394618318246, "macro_f1": 0.5007486079234429}, {"mcc": 0.17813744773479337, "macro_f1": 0.35534641931159844}, {"mcc": 0.3539033791132551, "macro_f1": 0.43835128859431416}], "total": {"test_mcc": 25.356826825125335, "test_mcc_se": 5.866641223303434, "test_macro_f1": 40.13473168673728, "test_macro_f1_se": 3.6017283793797454}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "sb10k", "task": "sentiment-classification", "dataset_languages": ["de"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.4019570289651387, "macro_f1": 0.5883952855294478}, {"mcc": 0.39285802210086856, "macro_f1": 0.5255286517487475}, {"mcc": 0.4534372811656576, "macro_f1": 0.6087539841437157}, {"mcc": 0.46953246081697236, "macro_f1": 0.6462224196787211}, {"mcc": 0.4388287916063435, "macro_f1": 0.6024852594339623}, {"mcc": 0.4555378990869158, "macro_f1": 0.6171731978719389}, {"mcc": 0.3846836293632327, "macro_f1": 0.5042864506249273}, {"mcc": 0.3740722419176343, "macro_f1": 0.536615451408188}, {"mcc": 0.3837433889332082, "macro_f1": 0.5149944739267122}, {"mcc": 0.5820334123120201, "macro_f1": 0.7203867819579225}], "total": {"test_mcc": 43.366841562679916, "test_mcc_se": 3.8821930641857176, "test_macro_f1": 58.648419563242825, "test_macro_f1_se": 4.19690889722785}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "dutch-social", "task": "sentiment-classification", "dataset_languages": ["nl"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.12550509736648535, "macro_f1": 0.33580287382094354}, {"mcc": 0.12664193884323657, "macro_f1": 0.32562160981120397}, {"mcc": 0.1137708742730661, "macro_f1": 0.31068543094834966}, {"mcc": 0.09898500508036555, "macro_f1": 0.2605144399958945}, {"mcc": 0.15880276616071617, "macro_f1": 0.3492035568669507}, {"mcc": 0.14241959051544525, "macro_f1": 0.3804151502599396}, {"mcc": 0.1278181320953084, "macro_f1": 0.3284075050971275}, {"mcc": 0.14662588485755707, "macro_f1": 0.34346217217908787}, {"mcc": 0.12084951547817083, "macro_f1": 0.33978976313103265}, {"mcc": 0.12675818058408475, "macro_f1": 0.2818823437441699}], "total": {"test_mcc": 12.88176985254436, "test_mcc_se": 1.0538904002648088, "test_macro_f1": 32.557848458547, "test_macro_f1_se": 2.1234860755585165}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "sst5", "task": "sentiment-classification", "dataset_languages": ["en"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.6792724358379988, "macro_f1": 0.5924662973323459}, {"mcc": 0.6938195975270671, "macro_f1": 0.5995668383152675}, {"mcc": 0.6751350785212092, "macro_f1": 0.5987606940257705}, {"mcc": 0.6803069372268117, "macro_f1": 0.5921520000497141}, {"mcc": 0.6603450937998095, "macro_f1": 0.588438436712307}, {"mcc": 0.6751740853981213, "macro_f1": 0.5849488549552827}, {"mcc": 0.6723234757439167, "macro_f1": 0.5937965581831627}, {"mcc": 0.673989088234116, "macro_f1": 0.5858155839693355}, {"mcc": 0.6866675853454899, "macro_f1": 0.5970211171246547}, {"mcc": 0.6910111794734349, "macro_f1": 0.598325982267779}], "total": {"test_mcc": 67.88044557107975, "test_mcc_se": 0.6098602204517334, "test_macro_f1": 59.312923629356206, "test_macro_f1_se": 0.33285425097860255}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "fosent", "task": "sentiment-classification", "dataset_languages": ["fo"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.35629696738765804, "macro_f1": 0.5386907365435339}, {"mcc": 0.33227629563993777, "macro_f1": 0.5278863711262849}, {"mcc": 0.2999012887032078, "macro_f1": 0.5097669253655073}, {"mcc": 0.297818185482545, "macro_f1": 0.48308951511370196}, {"mcc": 0.07199380450512194, "macro_f1": 0.32010892228523447}, {"mcc": 0.3240880363720776, "macro_f1": 0.5487656461753652}, {"mcc": 0.2764331550196552, "macro_f1": 0.4724709724176619}, {"mcc": 0.3440331629246512, "macro_f1": 0.535658426918723}, {"mcc": 0.26699568334890883, "macro_f1": 0.4228971976856939}, {"mcc": 0.365376380980719, "macro_f1": 0.4684548943418924}], "total": {"test_mcc": 29.352129603644823, "test_mcc_se": 5.235677847369033, "test_macro_f1": 48.277896079735996, "test_macro_f1_se": 4.304000773374505}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "allocine", "task": "sentiment-classification", "dataset_languages": ["fr"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.952058969424976, "macro_f1": 0.9760011574696852}, {"mcc": 0.9462535318595936, "macro_f1": 0.9731265336115157}, {"mcc": 0.9665130041270075, "macro_f1": 0.9832220295677032}, {"mcc": 0.9619323713351113, "macro_f1": 0.9809378295731417}, {"mcc": 0.9551358351165695, "macro_f1": 0.9775080887605383}, {"mcc": 0.9531491782051432, "macro_f1": 0.9765410002720465}, {"mcc": 0.9648624511791447, "macro_f1": 0.9824077653223382}, {"mcc": 0.9472680601439958, "macro_f1": 0.9735883807216577}, {"mcc": 0.9539876386942309, "macro_f1": 0.9768896652943785}, {"mcc": 0.9598973825553898, "macro_f1": 0.9798956810718729}], "total": {"test_mcc": 95.61058422641162, "test_mcc_se": 0.4328956257866944, "test_macro_f1": 97.80118131664878, "test_macro_f1_se": 0.21642298793256276}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "suc3", "task": "named-entity-recognition", "dataset_languages": ["sv"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"micro_f1_no_misc": 0.532789274264063, "micro_f1": 0.2783059636992221}, {"micro_f1_no_misc": 0.5617259288853377, "micro_f1": 0.45859872611464975}, {"micro_f1_no_misc": 0.54634487583071, "micro_f1": 0.3110450718177315}, {"micro_f1_no_misc": 0.5806861499364676, "micro_f1": 0.47419668938656273}, {"micro_f1_no_misc": 0.5697909642616318, "micro_f1": 0.5187772925764191}, {"micro_f1_no_misc": 0.4376427592507995, "micro_f1": 0.3519588953114965}, {"micro_f1_no_misc": 0.5240014657383657, "micro_f1": 0.4594968205695328}, {"micro_f1_no_misc": 0.5069972011195523, "micro_f1": 0.4157509157509158}, {"micro_f1_no_misc": 0.5302571860816944, "micro_f1": 0.3912696300239553}, {"micro_f1_no_misc": 0.5595925297113752, "micro_f1": 0.44396240057845265}], "total": {"test_micro_f1_no_misc": 53.49828335079998, "test_micro_f1_no_misc_se": 2.5435018469595203, "test_micro_f1": 41.033624058289384, "test_micro_f1_se": 4.747619479888867}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "dansk", "task": "named-entity-recognition", "dataset_languages": ["da"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"micro_f1_no_misc": 0.4097723486951693, "micro_f1": 0.3449261553120534}, {"micro_f1_no_misc": 0.42803504380475593, "micro_f1": 0.25421530479896237}, {"micro_f1_no_misc": 0.413696715583508, "micro_f1": 0.2956989247311828}, {"micro_f1_no_misc": 0.425, "micro_f1": 0.31362467866323906}, {"micro_f1_no_misc": 0.4244120940649496, "micro_f1": 0.31651376146788995}, {"micro_f1_no_misc": 0.40781563126252507, "micro_f1": 0.2809022556390977}, {"micro_f1_no_misc": 0.39930354033662213, "micro_f1": 0.33423423423423426}, {"micro_f1_no_misc": 0.40566645202833224, "micro_f1": 0.23726817780394466}, {"micro_f1_no_misc": 0.4139749505603164, "micro_f1": 0.28264094955489616}, {"micro_f1_no_misc": 0.41976679622431984, "micro_f1": 0.25548245614035087}], "total": {"test_micro_f1_no_misc": 41.474435725604984, "test_micro_f1_no_misc_se": 0.5828068151359962, "test_micro_f1": 29.155068983458516, "test_micro_f1_se": 2.2283035765105876}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "norne-nb", "task": "named-entity-recognition", "dataset_languages": ["nb", "no"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"micro_f1_no_misc": 0.5728177818939569, "micro_f1": 0.48671625929861845}, {"micro_f1_no_misc": 0.5967199327165685, "micro_f1": 0.4376470588235294}, {"micro_f1_no_misc": 0.5730093817656964, "micro_f1": 0.4092087208415917}, {"micro_f1_no_misc": 0.5379023883696781, "micro_f1": 0.49918128654970767}, {"micro_f1_no_misc": 0.5949838589520735, "micro_f1": 0.5083951581413511}, {"micro_f1_no_misc": 0.5552813425468904, "micro_f1": 0.5227323628219485}, {"micro_f1_no_misc": 0.5323061630218687, "micro_f1": 0.3606194690265487}, {"micro_f1_no_misc": 0.530373831775701, "micro_f1": 0.47389240506329117}, {"micro_f1_no_misc": 0.6093003042155584, "micro_f1": 0.5768396133359637}, {"micro_f1_no_misc": 0.6044191019244475, "micro_f1": 0.5108614232209737}], "total": {"test_micro_f1_no_misc": 57.0711408718244, "test_micro_f1_no_misc_se": 1.8841872649347355, "test_micro_f1": 47.86093757123524, "test_micro_f1_se": 3.8347138420153493}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "norne-nn", "task": "named-entity-recognition", "dataset_languages": ["nn"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"micro_f1_no_misc": 0.5004681647940075, "micro_f1": 0.4434544208361892}, {"micro_f1_no_misc": 0.5560296429373456, "micro_f1": 0.44632681349145237}, {"micro_f1_no_misc": 0.5844129138391939, "micro_f1": 0.3904303859908744}, {"micro_f1_no_misc": 0.48081368469717983, "micro_f1": 0.45165341146923405}, {"micro_f1_no_misc": 0.4864226682408501, "micro_f1": 0.4482905982905982}, {"micro_f1_no_misc": 0.5477001703577512, "micro_f1": 0.48822195079392783}, {"micro_f1_no_misc": 0.5490020183897735, "micro_f1": 0.5091743119266056}, {"micro_f1_no_misc": 0.48958333333333337, "micro_f1": 0.4182555780933063}, {"micro_f1_no_misc": 0.5907290015847861, "micro_f1": 0.5173683368016646}, {"micro_f1_no_misc": 0.48387915714648383, "micro_f1": 0.4462422823078561}], "total": {"test_micro_f1_no_misc": 52.69040755320705, "test_micro_f1_no_misc_se": 2.6823204690367803, "test_micro_f1": 45.594180900017086, "test_micro_f1_se": 2.42529433277038}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "mim-gold-ner", "task": "named-entity-recognition", "dataset_languages": ["is"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"micro_f1_no_misc": 0.29452243325075694, "micro_f1": 0.23855380976516305}, {"micro_f1_no_misc": 0.37039540385265296, "micro_f1": 0.23394948497248483}, {"micro_f1_no_misc": 0.04589261128958238, "micro_f1": 0.04346138921226233}, {"micro_f1_no_misc": 0.36711059640995947, "micro_f1": 0.3185319438151337}, {"micro_f1_no_misc": 0.28650818153751156, "micro_f1": 0.2780612244897959}, {"micro_f1_no_misc": 0.1896361631753032, "micro_f1": 0.18151905675898}, {"micro_f1_no_misc": 0.21932681867535286, "micro_f1": 0.2357100766057749}, {"micro_f1_no_misc": 0.3620196887900921, "micro_f1": 0.3216655227636696}, {"micro_f1_no_misc": 0.4091499870028594, "micro_f1": 0.31936619718309855}, {"micro_f1_no_misc": 0.3647214854111406, "micro_f1": 0.2879464285714286}], "total": {"test_micro_f1_no_misc": 29.092833693952112, "test_micro_f1_no_misc_se": 6.915969637150435, "test_micro_f1": 24.58765134137791, "test_micro_f1_se": 5.252826396554767}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "fone", "task": "named-entity-recognition", "dataset_languages": ["fo"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"micro_f1_no_misc": 0.4841592201462226, "micro_f1": 0.4819718006673125}, {"micro_f1_no_misc": 0.522633744855967, "micro_f1": 0.5256721138639957}, {"micro_f1_no_misc": 0.5066908596053527, "micro_f1": 0.517649398428955}, {"micro_f1_no_misc": 0.44871641791044775, "micro_f1": 0.4409153679226672}, {"micro_f1_no_misc": 0.5272496831432193, "micro_f1": 0.5228262082821744}, {"micro_f1_no_misc": 0.5709950140906135, "micro_f1": 0.5574250182882224}, {"micro_f1_no_misc": 0.41649694501018325, "micro_f1": 0.4404660414890594}, {"micro_f1_no_misc": 0.4719749216300941, "micro_f1": 0.4677454153182309}, {"micro_f1_no_misc": 0.4343382728666326, "micro_f1": 0.4458504692607824}, {"micro_f1_no_misc": 0.42527842527842524, "micro_f1": 0.44183969636079473}], "total": {"test_micro_f1_no_misc": 48.085335045371586, "test_micro_f1_no_misc_se": 3.1474008241332014, "test_micro_f1": 48.42361529882194, "test_micro_f1_se": 2.695321994552217}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "germeval", "task": "named-entity-recognition", "dataset_languages": ["de"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"micro_f1_no_misc": 0.5438543247344462, "micro_f1": 0.39153240914200066}, {"micro_f1_no_misc": 0.5372148859543817, "micro_f1": 0.4206773618538324}, {"micro_f1_no_misc": 0.5373635600335853, "micro_f1": 0.39039139929447336}, {"micro_f1_no_misc": 0.5184942716857611, "micro_f1": 0.3592923581980667}, {"micro_f1_no_misc": 0.4257352941176471, "micro_f1": 0.3343694493783304}, {"micro_f1_no_misc": 0.5956850658447743, "micro_f1": 0.4141210901774282}, {"micro_f1_no_misc": 0.5245799626633479, "micro_f1": 0.3898204774117183}, {"micro_f1_no_misc": 0.5095158597662771, "micro_f1": 0.3711934156378601}, {"micro_f1_no_misc": 0.5384193194291987, "micro_f1": 0.3829939264022866}, {"micro_f1_no_misc": 0.5390251672507167, "micro_f1": 0.4624247185126997}], "total": {"test_micro_f1_no_misc": 52.698877114801356, "test_micro_f1_no_misc_se": 2.623676191365399, "test_micro_f1": 39.16816606008696, "test_micro_f1_se": 2.184812639756655}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "conll-nl", "task": "named-entity-recognition", "dataset_languages": ["nl"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"micro_f1_no_misc": 0.4791800128122998, "micro_f1": 0.4048865619546248}, {"micro_f1_no_misc": 0.5144876325088339, "micro_f1": 0.4819576939029448}, {"micro_f1_no_misc": 0.5745494265428727, "micro_f1": 0.5426187885382662}, {"micro_f1_no_misc": 0.5358730158730158, "micro_f1": 0.5065710872162486}, {"micro_f1_no_misc": 0.5411764705882354, "micro_f1": 0.4591885441527447}, {"micro_f1_no_misc": 0.5350819672131147, "micro_f1": 0.42690909090909096}, {"micro_f1_no_misc": 0.5589980224126565, "micro_f1": 0.4854881266490765}, {"micro_f1_no_misc": 0.5063829787234043, "micro_f1": 0.4647420728821581}, {"micro_f1_no_misc": 0.6053534660260811, "micro_f1": 0.4966261808367071}, {"micro_f1_no_misc": 0.5459811730629978, "micro_f1": 0.48305905130687316}], "total": {"test_micro_f1_no_misc": 53.97064165763512, "test_micro_f1_no_misc_se": 2.2042845811317147, "test_micro_f1": 47.52047198348735, "test_micro_f1_se": 2.4308497622608396}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "conll-en", "task": "named-entity-recognition", "dataset_languages": ["en"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"micro_f1_no_misc": 0.5498588899341487, "micro_f1": 0.47207646176911544}, {"micro_f1_no_misc": 0.5296183563568251, "micro_f1": 0.47555698028530213}, {"micro_f1_no_misc": 0.6033816425120774, "micro_f1": 0.5098455756189232}, {"micro_f1_no_misc": 0.5361842105263157, "micro_f1": 0.4514361467079099}, {"micro_f1_no_misc": 0.607041775161327, "micro_f1": 0.5330668604651163}, {"micro_f1_no_misc": 0.5656253711842262, "micro_f1": 0.4938182474505911}, {"micro_f1_no_misc": 0.48067661224010333, "micro_f1": 0.44843711543194686}, {"micro_f1_no_misc": 0.6240780664926812, "micro_f1": 0.5595992059740996}, {"micro_f1_no_misc": 0.5742597898758357, "micro_f1": 0.5108347465611457}, {"micro_f1_no_misc": 0.525583588776232, "micro_f1": 0.47101068488351727}], "total": {"test_micro_f1_no_misc": 55.963083030597716, "test_micro_f1_no_misc_se": 2.733424014616663, "test_micro_f1": 49.25682025147668, "test_micro_f1_se": 2.223297473982903}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "eltec", "task": "named-entity-recognition", "dataset_languages": ["fr"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"micro_f1_no_misc": 0.4627726247179744, "micro_f1": 0.44471038774533267}, {"micro_f1_no_misc": 0.4877293831527747, "micro_f1": 0.4737388416026572}, {"micro_f1_no_misc": 0.5491916859122402, "micro_f1": 0.5071002263840296}, {"micro_f1_no_misc": 0.5417534119824197, "micro_f1": 0.4877350776778414}, {"micro_f1_no_misc": 0.5285464971328847, "micro_f1": 0.5072531586335984}, {"micro_f1_no_misc": 0.5304761904761905, "micro_f1": 0.49495599914144667}, {"micro_f1_no_misc": 0.5234778701415411, "micro_f1": 0.49038461538461536}, {"micro_f1_no_misc": 0.528767761472164, "micro_f1": 0.49119793902962644}, {"micro_f1_no_misc": 0.45158544371142795, "micro_f1": 0.4209279368213228}, {"micro_f1_no_misc": 0.4998899889988999, "micro_f1": 0.44625592417061605}], "total": {"test_micro_f1_no_misc": 51.04190857698517, "test_micro_f1_no_misc_se": 2.0743223176419123, "test_micro_f1": 47.642601065910874, "test_micro_f1_se": 1.8221911132897006}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "scala-sv", "task": "linguistic-acceptability", "dataset_languages": ["sv"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.06666351608119601, "macro_f1": 0.438012841392296}, {"mcc": 0.06591863197877496, "macro_f1": 0.35731669904898256}, {"mcc": 0.10250779068072237, "macro_f1": 0.437302986048832}, {"mcc": 0.08494078505191308, "macro_f1": 0.4051173478084737}, {"mcc": 0.10535753440800759, "macro_f1": 0.48124620324568257}, {"mcc": 0.13282857790119534, "macro_f1": 0.4982074532524091}, {"mcc": 0.1485000318949025, "macro_f1": 0.557837549264371}, {"mcc": 0.11016611754427344, "macro_f1": 0.45170699491845684}, {"mcc": 0.21136741275119797, "macro_f1": 0.5821473347989257}, {"mcc": 0.16107141116336404, "macro_f1": 0.4674582022464371}], "total": {"test_mcc": 11.893218094555474, "test_mcc_se": 2.818896340349019, "test_macro_f1": 46.76353612024866, "test_macro_f1_se": 4.157753727006008}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "scala-da", "task": "linguistic-acceptability", "dataset_languages": ["da"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.0, "macro_f1": 0.3382875605815832}, {"mcc": 0.08931727387017328, "macro_f1": 0.4350513819279191}, {"mcc": 4.320060216252923e-05, "macro_f1": 0.33528075300227195}, {"mcc": 0.052510811398618754, "macro_f1": 0.3463447630556903}, {"mcc": 0.0, "macro_f1": 0.33093760209082}, {"mcc": 0.060597232990000374, "macro_f1": 0.3600888765449243}, {"mcc": 0.12215785102772875, "macro_f1": 0.4491496877677604}, {"mcc": 0.04887377472137856, "macro_f1": 0.3485606466253602}, {"mcc": 0.044119563480727805, "macro_f1": 0.3491743858236005}, {"mcc": 0.011650004552213033, "macro_f1": 0.328644368712882}], "total": {"test_mcc": 4.29269712643003, "test_mcc_se": 2.561631898413032, "test_macro_f1": 36.21520026132812, "test_macro_f1_se": 2.6841475862943898}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "scala-nb", "task": "linguistic-acceptability", "dataset_languages": ["nb", "no"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.0, "macro_f1": 0.32830436208592984}, {"mcc": 0.038639465556672824, "macro_f1": 0.3385544921752829}, {"mcc": 0.0, "macro_f1": 0.33202870189171557}, {"mcc": 0.038639465556672824, "macro_f1": 0.3385544921752829}, {"mcc": 0.0, "macro_f1": 0.32653732324893125}, {"mcc": 0.0008470729301956575, "macro_f1": 0.3394284046707548}, {"mcc": 0.0, "macro_f1": 0.33571196886149857}, {"mcc": 0.0, "macro_f1": 0.33506493506493507}, {"mcc": 0.0, "macro_f1": 0.3322464949462015}, {"mcc": 0.0, "macro_f1": 0.3331162487788994}], "total": {"test_mcc": 0.7812600404354131, "test_mcc_se": 1.0071439818458945, "test_macro_f1": 33.39547423899431, "test_macro_f1_se": 0.2700665886142132}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "scala-nn", "task": "linguistic-acceptability", "dataset_languages": ["nn"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.0, "macro_f1": 0.33159268929503916}, {"mcc": 0.0, "macro_f1": 0.32275132275132273}, {"mcc": 0.0, "macro_f1": 0.3324641460234681}, {"mcc": 0.0, "macro_f1": 0.3247609627431586}, {"mcc": 0.0, "macro_f1": 0.3342002600780234}, {"mcc": 0.0, "macro_f1": 0.32520593080724874}, {"mcc": 0.0, "macro_f1": 0.32940406024885394}, {"mcc": 0.0, "macro_f1": 0.3326816552623004}, {"mcc": 0.0, "macro_f1": 0.33202870189171557}, {"mcc": 0.0, "macro_f1": 0.33181076672104404}], "total": {"test_mcc": 0.0, "test_mcc_se": 0.0, "test_macro_f1": 32.969004958221745, "test_macro_f1_se": 0.24723727439354848}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "scala-is", "task": "linguistic-acceptability", "dataset_languages": ["is"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.023096096496170133, "macro_f1": 0.34408844410678263}, {"mcc": 0.0, "macro_f1": 0.33050016345210853}, {"mcc": -4.320060216252923e-05, "macro_f1": 0.3348422325751764}, {"mcc": 0.0, "macro_f1": 0.3380736910148675}, {"mcc": 0.022804323943903155, "macro_f1": 0.34132287274213075}, {"mcc": 0.0, "macro_f1": 0.3247609627431586}, {"mcc": 0.0, "macro_f1": 0.3372168284789644}, {"mcc": 0.0, "macro_f1": 0.34169077467052394}, {"mcc": 0.0, "macro_f1": 0.33289902280130296}, {"mcc": 0.0, "macro_f1": 0.32830436208592984}], "total": {"test_mcc": 0.45857219837910757, "test_mcc_se": 0.5999223395702318, "test_macro_f1": 33.53699354670946, "test_macro_f1_se": 0.38910119999158166}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "scala-fo", "task": "linguistic-acceptability", "dataset_languages": ["fo"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.0, "macro_f1": 0.330718954248366}, {"mcc": 0.0, "macro_f1": 0.3342002600780234}, {"mcc": 0.0, "macro_f1": 0.3236459709379128}, {"mcc": 0.0, "macro_f1": 0.3324641460234681}, {"mcc": -0.0399939795119836, "macro_f1": 0.34028439927225124}, {"mcc": -0.00017331214881234007, "macro_f1": 0.3397011188246873}, {"mcc": 0.0, "macro_f1": 0.35066582117945466}, {"mcc": 0.0, "macro_f1": 0.3333333333333333}, {"mcc": 0.0, "macro_f1": 0.3406310367031552}, {"mcc": -0.002509662664798949, "macro_f1": 0.322107401357048}], "total": {"test_mcc": -0.42676954325594885, "test_mcc_se": 0.7795541460397359, "test_macro_f1": 33.477524419577, "test_macro_f1_se": 0.5267159861802404}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "scala-de", "task": "linguistic-acceptability", "dataset_languages": ["de"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.34703625903850277, "macro_f1": 0.6608848785201116}, {"mcc": 0.24903308286792333, "macro_f1": 0.504478920447806}, {"mcc": 0.2005426008220316, "macro_f1": 0.4504830917874396}, {"mcc": 0.32680018733269883, "macro_f1": 0.6501652884398318}, {"mcc": 0.23973244644436742, "macro_f1": 0.5474406692118869}, {"mcc": 0.2420143231409706, "macro_f1": 0.5287763364560583}, {"mcc": 0.35004081876934684, "macro_f1": 0.6662088971870322}, {"mcc": 0.3135049753372037, "macro_f1": 0.6331557028346245}, {"mcc": 0.31132796721796235, "macro_f1": 0.6388631629953537}, {"mcc": 0.32283611450009064, "macro_f1": 0.6523921723147289}], "total": {"test_mcc": 29.028687754710987, "test_mcc_se": 3.254059709652503, "test_macro_f1": 59.328491201948744, "test_macro_f1_se": 4.837565201400746}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "scala-nl", "task": "linguistic-acceptability", "dataset_languages": ["nl"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.10921650548770742, "macro_f1": 0.3742973376923182}, {"mcc": 0.168760523716376, "macro_f1": 0.4628283741771717}, {"mcc": 0.06768030806236297, "macro_f1": 0.3704968147492664}, {"mcc": 0.12472148420189853, "macro_f1": 0.3990467923664984}, {"mcc": 0.1021897434316433, "macro_f1": 0.37252094364481564}, {"mcc": 0.12773259507711435, "macro_f1": 0.4399212908536019}, {"mcc": 0.20058267572995545, "macro_f1": 0.4880182443387318}, {"mcc": 0.16253017338411396, "macro_f1": 0.45488770850806076}, {"mcc": 0.10982180011751214, "macro_f1": 0.37025764069739664}, {"mcc": 0.16658205219842465, "macro_f1": 0.4564667332302027}], "total": {"test_mcc": 13.398178614071089, "test_mcc_se": 2.4664070117106673, "test_macro_f1": 41.88741880258064, "test_macro_f1_se": 2.8552092932273574}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "scala-en", "task": "linguistic-acceptability", "dataset_languages": ["en"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.5274385989834627, "macro_f1": 0.7573102035035796}, {"mcc": 0.525156700937606, "macro_f1": 0.755067929418342}, {"mcc": 0.5344203771571517, "macro_f1": 0.7659746387101525}, {"mcc": 0.5181792370553321, "macro_f1": 0.7439150700991153}, {"mcc": 0.45990389658536535, "macro_f1": 0.7279903224974225}, {"mcc": 0.4974656417306111, "macro_f1": 0.7434271178011973}, {"mcc": 0.5012860881794868, "macro_f1": 0.7393014276518322}, {"mcc": 0.48481170001501933, "macro_f1": 0.721623739817583}, {"mcc": 0.4779506135695645, "macro_f1": 0.727513844444853}, {"mcc": 0.4794503889272548, "macro_f1": 0.7390896077870026}], "total": {"test_mcc": 50.06063243140855, "test_mcc_se": 1.5532617634356813, "test_macro_f1": 74.21213901731079, "test_macro_f1_se": 0.8820187857745178}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "scala-fr", "task": "linguistic-acceptability", "dataset_languages": ["fr"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.11352437893075362, "macro_f1": 0.3763392746165511}, {"mcc": 0.08249405191601564, "macro_f1": 0.35452727261469114}, {"mcc": 0.24629377597547303, "macro_f1": 0.5144798622813765}, {"mcc": 0.27972625394186246, "macro_f1": 0.5256627130622631}, {"mcc": 0.20145756382539864, "macro_f1": 0.4616185661190694}, {"mcc": 0.1439368430285189, "macro_f1": 0.4158022711992816}, {"mcc": 0.06586050086687738, "macro_f1": 0.3460359439802849}, {"mcc": 0.04515410397095211, "macro_f1": 0.34226937819208564}, {"mcc": 0.046783558457289764, "macro_f1": 0.3399877214828282}, {"mcc": 0.17332852311258073, "macro_f1": 0.47185960921881165}], "total": {"test_mcc": 13.985595540257222, "test_mcc_se": 5.182153084840724, "test_macro_f1": 41.48582612767243, "test_macro_f1_se": 4.544000564355944}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "scandiqa-da", "task": "reading-comprehension", "dataset_languages": ["da"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"f1": 65.59771147689804, "em": 59.09090909090909}, {"f1": 64.35982702815626, "em": 58.45337376800607}, {"f1": 63.67713966936721, "em": 57.65345765345765}, {"f1": 64.51354947558487, "em": 58.371385083713854}, {"f1": 65.12947458908727, "em": 59.98457979953739}, {"f1": 63.693770157398355, "em": 58.16793893129771}, {"f1": 63.27905470017736, "em": 56.864274570982836}, {"f1": 63.29703010316654, "em": 58.490566037735846}, {"f1": 64.81504487077002, "em": 59.28792569659443}, {"f1": 65.59127909976431, "em": 58.48765432098765}], "total": {"test_f1": 64.39538811703702, "test_f1_se": 0.5492158508677327, "test_em": 58.48520649532226, "test_em_se": 0.5345066323208181}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "norquad", "task": "reading-comprehension", "dataset_languages": ["nb", "nn", "no"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"f1": 75.31766963159858, "em": 51.96078431372549}, {"f1": 68.63537759274101, "em": 42.764227642276424}, {"f1": 75.29051651579758, "em": 53.288925895087424}, {"f1": 70.41922157870759, "em": 45.95257563368765}, {"f1": 76.66410981654562, "em": 54.92142266335815}, {"f1": 70.12878308805715, "em": 45.02057613168724}, {"f1": 69.34493112291891, "em": 46.134453781512605}, {"f1": 75.17320944792381, "em": 53.04054054054054}, {"f1": 70.12322271841352, "em": 45.59800664451827}, {"f1": 71.02563629817185, "em": 45.115894039735096}], "total": {"test_f1": 72.21226778108756, "test_f1_se": 1.871421970522739, "test_em": 48.379740728612894, "test_em_se": 2.721992312952107}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "scandiqa-sv", "task": "reading-comprehension", "dataset_languages": ["sv"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"f1": 63.35934337003853, "em": 57.57575757575758}, {"f1": 64.04491188842074, "em": 57.619408642911296}, {"f1": 64.08818223434974, "em": 58.74125874125874}, {"f1": 65.10894764252266, "em": 59.28462709284627}, {"f1": 64.50782745104476, "em": 58.44255975327679}, {"f1": 62.61483312875318, "em": 56.48854961832061}, {"f1": 64.15884352125852, "em": 57.332293291731666}, {"f1": 65.35256842389991, "em": 58.9622641509434}, {"f1": 63.77971057150621, "em": 57.43034055727554}, {"f1": 64.26031527679127, "em": 57.56172839506173}], "total": {"test_f1": 64.12754835085858, "test_f1_se": 0.4904289746931691, "test_em": 57.94387878193837, "test_em_se": 0.5411945577537112}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "nqii", "task": "reading-comprehension", "dataset_languages": ["is"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"f1": 55.15029853330791, "em": 28.923076923076923}, {"f1": 50.556100554579444, "em": 21.339563862928348}, {"f1": 50.574950647294344, "em": 30.218068535825545}, {"f1": 53.81049343029494, "em": 30.79222720478326}, {"f1": 53.09347521742088, "em": 27.899686520376175}, {"f1": 54.399006278202464, "em": 28.505392912172574}, {"f1": 56.295883250887954, "em": 31.076923076923077}, {"f1": 52.03898396740549, "em": 29.141104294478527}, {"f1": 46.8911609624905, "em": 20.783132530120483}, {"f1": 55.420001079430605, "em": 31.325301204819276}], "total": {"test_f1": 52.82303539213145, "test_f1_se": 1.7742264987808445, "test_em": 28.000447706550414, "test_em_se": 2.3746280840656517}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "foqa", "task": "reading-comprehension", "dataset_languages": ["fo"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"f1": 60.7074023141971, "em": 44.85294117647059}, {"f1": 57.88665898028273, "em": 42.997542997543}, {"f1": 66.07104163949565, "em": 48.39506172839506}, {"f1": 59.92603843028599, "em": 44.41747572815534}, {"f1": 59.23372410555038, "em": 44.306930693069305}, {"f1": 59.5350197422772, "em": 43.95061728395062}, {"f1": 60.38448669341345, "em": 44.339622641509436}, {"f1": 63.441032301558174, "em": 46.0559796437659}, {"f1": 61.674234573129006, "em": 47.63092269326683}, {"f1": 61.80056244574895, "em": 44.96314496314496}], "total": {"test_f1": 61.06602012259386, "test_f1_se": 1.4506745714153586, "test_em": 45.191023954927104, "test_em_se": 1.045092193300479}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "germanquad", "task": "reading-comprehension", "dataset_languages": ["de"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"f1": 68.34894014772323, "em": 39.92424242424242}, {"f1": 69.31371335422416, "em": 39.120545868081884}, {"f1": 69.05658988504432, "em": 38.6946386946387}, {"f1": 68.95846205578441, "em": 40.334855403348556}, {"f1": 68.7397941872209, "em": 39.55281418658443}, {"f1": 70.20515544937014, "em": 40.38167938931298}, {"f1": 66.02506573401126, "em": 36.661466458658346}, {"f1": 69.23865208564013, "em": 38.522012578616355}, {"f1": 67.69546799776792, "em": 35.526315789473685}, {"f1": 66.28413868266294, "em": 38.19444444444444}], "total": {"test_f1": 68.38659795794493, "test_f1_se": 0.8333420722505627, "test_em": 38.69130152374019, "test_em_se": 0.9782157918277787}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "squad", "task": "reading-comprehension", "dataset_languages": ["en"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"f1": 91.45167113274051, "em": 83.25757575757575}, {"f1": 90.59002936879004, "em": 81.88021228203185}, {"f1": 91.91701489204031, "em": 84.30458430458431}, {"f1": 91.06903686561782, "em": 82.49619482496195}, {"f1": 91.1621898351397, "em": 81.8041634541249}, {"f1": 90.17356568805138, "em": 80.99236641221374}, {"f1": 91.57828807140301, "em": 82.8393135725429}, {"f1": 90.01032442137473, "em": 80.34591194968553}, {"f1": 90.14983623746723, "em": 81.26934984520123}, {"f1": 90.44142952595796, "em": 81.17283950617283}], "total": {"test_f1": 90.85433860385827, "test_f1_se": 0.41647354121041297, "test_em": 82.0362511909095, "test_em_se": 0.7406679128156624}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "squad-nl", "task": "reading-comprehension", "dataset_languages": ["nl"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"f1": 81.87919482196168, "em": 64.46969696969697}, {"f1": 82.5288472783838, "em": 67.85443517816528}, {"f1": 83.13994419911869, "em": 69.61926961926962}, {"f1": 81.54544730502701, "em": 67.8082191780822}, {"f1": 82.63473674906946, "em": 66.22976098689283}, {"f1": 81.76529941786606, "em": 64.27480916030534}, {"f1": 82.00795825364331, "em": 68.17472698907956}, {"f1": 83.12810202644194, "em": 68.78930817610063}, {"f1": 84.22111160975781, "em": 71.74922600619195}, {"f1": 82.0203506196513, "em": 64.50617283950618}], "total": {"test_f1": 82.48709922809209, "test_f1_se": 0.5099395881423843, "test_em": 67.34756251032907, "test_em_se": 1.5296115589873027}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "fquad", "task": "reading-comprehension", "dataset_languages": ["fr"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"f1": 76.27314289385757, "em": 47.72727272727273}, {"f1": 76.32876409141184, "em": 48.74905231235785}, {"f1": 77.13908052323784, "em": 48.562548562548564}, {"f1": 75.99346341099587, "em": 47.1841704718417}, {"f1": 74.31901156361167, "em": 47.72552043176561}, {"f1": 76.93651740225175, "em": 48.16793893129771}, {"f1": 75.37876917620686, "em": 48.361934477379094}, {"f1": 76.03249433317211, "em": 48.42767295597484}, {"f1": 74.161874581732, "em": 46.98142414860681}, {"f1": 77.6297546431045, "em": 48.842592592592595}], "total": {"test_f1": 76.01928726195821, "test_f1_se": 0.7036947724691714, "test_em": 48.07301276116375, "test_em_se": 0.3989664340036843}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "nordjylland-news", "task": "summarization", "dataset_languages": ["da"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"bertscore": 0.6581600227800664, "rouge_l": 0.2100880710239491}, {"bertscore": 0.6693902244005585, "rouge_l": 0.22203532272036286}, {"bertscore": 0.668086098463391, "rouge_l": 0.22447903883063666}, {"bertscore": 0.6632945349847432, "rouge_l": 0.22129677355831956}, {"bertscore": 0.6685347793536494, "rouge_l": 0.22512214405005776}, {"bertscore": 0.6641817572090076, "rouge_l": 0.21693695270409113}, {"bertscore": 0.67055610002717, "rouge_l": 0.22772240757780998}, {"bertscore": 0.6642858694831375, "rouge_l": 0.2190773044675578}, {"bertscore": 0.6610970982728759, "rouge_l": 0.2084207882902689}, {"bertscore": 0.669665408480796, "rouge_l": 0.22565825315560764}], "total": {"test_bertscore": 66.57251893455395, "test_bertscore_se": 0.2570317225739608, "test_rouge_l": 22.00837056378661, "test_rouge_l_se": 0.406046406812183}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "mlsum", "task": "summarization", "dataset_languages": ["de"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"bertscore": 0.6687178010470234, "rouge_l": 0.22875124037424796}, {"bertscore": 0.6681891617481597, "rouge_l": 0.2369135429102155}, {"bertscore": 0.6680283195892116, "rouge_l": 0.24664157775878381}, {"bertscore": 0.6651630124397343, "rouge_l": 0.23564448464329624}, {"bertscore": 0.6895833615126321, "rouge_l": 0.27944855991678846}, {"bertscore": 0.6644965206069173, "rouge_l": 0.22183273200273257}, {"bertscore": 0.698015172893065, "rouge_l": 0.3000614349784052}, {"bertscore": 0.6659215827821754, "rouge_l": 0.2322475776678516}, {"bertscore": 0.7018044463329716, "rouge_l": 0.30413484643535793}, {"bertscore": 0.6666711766738445, "rouge_l": 0.23382543450899765}], "total": {"test_bertscore": 67.56590555625735, "test_bertscore_se": 0.9122933051456603, "test_rouge_l": 25.19501431196677, "test_rouge_l_se": 1.9021091720737695}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "rrn", "task": "summarization", "dataset_languages": ["is"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"bertscore": 0.6555542367568705, "rouge_l": 0.1958573357312659}, {"bertscore": 0.6132484664849471, "rouge_l": 0.155146058697117}, {"bertscore": 0.6560917635797523, "rouge_l": 0.19588785557786514}, {"bertscore": 0.6318214571801946, "rouge_l": 0.16019778805827833}, {"bertscore": 0.5798090474854689, "rouge_l": 0.14085190187199995}, {"bertscore": 0.6166203754837625, "rouge_l": 0.16499354404081404}, {"bertscore": 0.6231731372245122, "rouge_l": 0.15970784867301258}, {"bertscore": 0.6043705164920539, "rouge_l": 0.1529564260244863}, {"bertscore": 0.6143289270694368, "rouge_l": 0.1581282201696449}, {"bertscore": 0.6205171144101769, "rouge_l": 0.16825019571283917}], "total": {"test_bertscore": 62.15535042167175, "test_bertscore_se": 1.4075396118696923, "test_rouge_l": 16.519771745573234, "test_rouge_l_se": 1.1010121782872397}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "no-sammendrag", "task": "summarization", "dataset_languages": ["nb", "nn", "no"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"bertscore": 0.6011693137115799, "rouge_l": 0.12055365326289272}, {"bertscore": 0.6421617012092611, "rouge_l": 0.18143241162599155}, {"bertscore": 0.6526047637598822, "rouge_l": 0.19003436783614647}, {"bertscore": 0.6470405261352425, "rouge_l": 0.1863918312420681}, {"bertscore": 0.6434540492191445, "rouge_l": 0.17994816565677446}, {"bertscore": 0.6302223092934582, "rouge_l": 0.16040018653697768}, {"bertscore": 0.6250589177070651, "rouge_l": 0.16263288831918382}, {"bertscore": 0.6537425579881528, "rouge_l": 0.19226308691260596}, {"bertscore": 0.6477453937404789, "rouge_l": 0.18520152146620028}, {"bertscore": 0.6315948013652815, "rouge_l": 0.1664721236026944}], "total": {"test_bertscore": 63.74794334129547, "test_bertscore_se": 0.9924814359947234, "test_rouge_l": 17.253302364615358, "test_rouge_l_se": 1.3335237765539185}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "wiki-lingua-nl", "task": "summarization", "dataset_languages": ["nl"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"bertscore": 0.656716868354124, "rouge_l": 0.21999165013150926}, {"bertscore": 0.6619181389833102, "rouge_l": 0.22676176906604928}, {"bertscore": 0.6444969673175365, "rouge_l": 0.1790946604615986}, {"bertscore": 0.6682259013032308, "rouge_l": 0.23807627729643455}, {"bertscore": 0.6545844300562749, "rouge_l": 0.2284775401529831}, {"bertscore": 0.6617104238539468, "rouge_l": 0.22774791870761724}, {"bertscore": 0.6628272258531069, "rouge_l": 0.23181661929468939}, {"bertscore": 0.6568818740925053, "rouge_l": 0.2191619966980627}, {"bertscore": 0.6572890779061709, "rouge_l": 0.22790998469392665}, {"bertscore": 0.646741017801105, "rouge_l": 0.21970744075467985}], "total": {"test_bertscore": 65.71391925521311, "test_bertscore_se": 0.4496837201911244, "test_rouge_l": 22.18745857257551, "test_rouge_l_se": 1.0005745286746859}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "swedn", "task": "summarization", "dataset_languages": ["sv"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"bertscore": 0.6429856803297298, "rouge_l": 0.1838882871659086}, {"bertscore": 0.6361564673279645, "rouge_l": 0.18367403278067862}, {"bertscore": 0.642668806904112, "rouge_l": 0.19273155351128515}, {"bertscore": 0.6386101857351605, "rouge_l": 0.18989387934187987}, {"bertscore": 0.6463703172921669, "rouge_l": 0.1819674596041507}, {"bertscore": 0.6405705137876794, "rouge_l": 0.18438352975491654}, {"bertscore": 0.6477899598394288, "rouge_l": 0.19715936412697488}, {"bertscore": 0.6352319183642976, "rouge_l": 0.1821804708139641}, {"bertscore": 0.6466198352281936, "rouge_l": 0.1791889771521345}, {"bertscore": 0.6421162800106686, "rouge_l": 0.18795177412088787}], "total": {"test_bertscore": 64.19119964819402, "test_bertscore_se": 0.2684311284996156, "test_rouge_l": 18.63019328372781, "test_rouge_l_se": 0.3442440256991017}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "cnn-dailymail", "task": "summarization", "dataset_languages": ["en"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"bertscore": 0.7079638028808404, "rouge_l": 0.29571031962272176}, {"bertscore": 0.7077012475056108, "rouge_l": 0.293882743209217}, {"bertscore": 0.707470945228124, "rouge_l": 0.2956770788502099}, {"bertscore": 0.7068240813096054, "rouge_l": 0.28990020421553137}, {"bertscore": 0.7059373986849096, "rouge_l": 0.2914848707393707}, {"bertscore": 0.7099657613725867, "rouge_l": 0.29818082374223276}, {"bertscore": 0.7102013377880212, "rouge_l": 0.2985980667541777}, {"bertscore": 0.7080012591322884, "rouge_l": 0.290787995364006}, {"bertscore": 0.708020893856883, "rouge_l": 0.29279302842120347}, {"bertscore": 0.7061783499084413, "rouge_l": 0.29011342799442175}], "total": {"test_bertscore": 70.78265077667311, "test_bertscore_se": 0.08700753920841263, "test_rouge_l": 29.37128558913093, "test_rouge_l_se": 0.19984090143196015}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "orange-sum", "task": "summarization", "dataset_languages": ["fr"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"bertscore": 0.6693327090761159, "rouge_l": 0.22372647269065526}, {"bertscore": 0.6717667744669598, "rouge_l": 0.23002134180403616}, {"bertscore": 0.6702868483844213, "rouge_l": 0.2251240681910523}, {"bertscore": 0.6456323299789801, "rouge_l": 0.18968641808948747}, {"bertscore": 0.6537523583101574, "rouge_l": 0.2031658489946432}, {"bertscore": 0.6718597604776733, "rouge_l": 0.22581458209492733}, {"bertscore": 0.66972101299325, "rouge_l": 0.2250929258464665}, {"bertscore": 0.6619330382090993, "rouge_l": 0.21685959421752599}, {"bertscore": 0.6713388020871207, "rouge_l": 0.22741982239897374}, {"bertscore": 0.644626070628874, "rouge_l": 0.1970308712871992}], "total": {"test_bertscore": 66.30249704612652, "test_bertscore_se": 0.6818336980951571, "test_rouge_l": 21.63941945614967, "test_rouge_l_se": 0.892248023908958}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "danske-talemaader", "task": "knowledge", "dataset_languages": ["da"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.23024550338106994, "accuracy": 0.4207920792079208}, {"mcc": 0.2628417349801044, "accuracy": 0.44801980198019803}, {"mcc": 0.2704222186831577, "accuracy": 0.4467821782178218}, {"mcc": 0.19321089077754078, "accuracy": 0.3849009900990099}, {"mcc": 0.286753553496851, "accuracy": 0.45544554455445546}, {"mcc": 0.19524147955213117, "accuracy": 0.3886138613861386}, {"mcc": 0.28891228727005114, "accuracy": 0.46163366336633666}, {"mcc": 0.25268624223540903, "accuracy": 0.4368811881188119}, {"mcc": 0.25313525589754143, "accuracy": 0.4381188118811881}, {"mcc": 0.29831792121357337, "accuracy": 0.47029702970297027}], "total": {"test_mcc": 25.3176708748743, "test_mcc_se": 2.292395574454722, "test_accuracy": 43.51485148514852, "test_accuracy_se": 1.796141849422643}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "danish-citizen-tests", "task": "knowledge", "dataset_languages": ["da"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.48445205797589735, "accuracy": 0.650390625}, {"mcc": 0.4946992873104822, "accuracy": 0.65625}, {"mcc": 0.482021450051285, "accuracy": 0.64453125}, {"mcc": 0.4826071303566352, "accuracy": 0.6484375}, {"mcc": 0.5010195342943258, "accuracy": 0.658203125}, {"mcc": 0.37539001198203575, "accuracy": 0.578125}, {"mcc": 0.5148285395626604, "accuracy": 0.673828125}, {"mcc": 0.4817191872982408, "accuracy": 0.646484375}, {"mcc": 0.4789040119511219, "accuracy": 0.646484375}, {"mcc": 0.5027287453313285, "accuracy": 0.65625}], "total": {"test_mcc": 47.98369956114013, "test_mcc_se": 2.3884992615943688, "test_accuracy": 64.58984375, "test_accuracy_se": 1.5691162222278203}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "mmlu-no", "task": "knowledge", "dataset_languages": ["nb", "nn", "no"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.22759944784042077, "accuracy": 0.39892578125}, {"mcc": 0.23296003568432894, "accuracy": 0.400390625}, {"mcc": 0.23814793315547997, "accuracy": 0.408203125}, {"mcc": 0.205707043899422, "accuracy": 0.3798828125}, {"mcc": 0.22783350359753085, "accuracy": 0.404296875}, {"mcc": 0.24047316545908148, "accuracy": 0.41357421875}, {"mcc": 0.2514585770624814, "accuracy": 0.42578125}, {"mcc": 0.25044589710374376, "accuracy": 0.4287109375}, {"mcc": 0.23701661467320265, "accuracy": 0.4189453125}, {"mcc": 0.2609400395902346, "accuracy": 0.42724609375}], "total": {"test_mcc": 23.725822580659266, "test_mcc_se": 0.9565584603246958, "test_accuracy": 41.0595703125, "test_accuracy_se": 0.9565472951707572}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "mmlu-sv", "task": "knowledge", "dataset_languages": ["sv"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.25991790708435564, "accuracy": 0.42724609375}, {"mcc": 0.22555947270005844, "accuracy": 0.40380859375}, {"mcc": 0.2512775424406984, "accuracy": 0.4111328125}, {"mcc": 0.25246116732368823, "accuracy": 0.412109375}, {"mcc": 0.27203567243014437, "accuracy": 0.44140625}, {"mcc": 0.25589371089285573, "accuracy": 0.42529296875}, {"mcc": 0.24672162856526886, "accuracy": 0.4267578125}, {"mcc": 0.259835221929494, "accuracy": 0.43212890625}, {"mcc": 0.2766332074114603, "accuracy": 0.4453125}, {"mcc": 0.27064243365057844, "accuracy": 0.439453125}], "total": {"test_mcc": 25.70977964428603, "test_mcc_se": 0.9153268487164629, "test_accuracy": 42.646484375, "test_accuracy_se": 0.8607608085379896}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "mmlu-de", "task": "knowledge", "dataset_languages": ["de"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.302584074932262, "accuracy": 0.47412109375}, {"mcc": 0.33586927396979555, "accuracy": 0.4970703125}, {"mcc": 0.29338047949987894, "accuracy": 0.46826171875}, {"mcc": 0.30455823412967004, "accuracy": 0.4794921875}, {"mcc": 0.3054098727157912, "accuracy": 0.47705078125}, {"mcc": 0.2903633286616759, "accuracy": 0.4677734375}, {"mcc": 0.325434240247899, "accuracy": 0.4892578125}, {"mcc": 0.30859350080412135, "accuracy": 0.474609375}, {"mcc": 0.31784483416358733, "accuracy": 0.486328125}, {"mcc": 0.302661474102483, "accuracy": 0.47119140625}], "total": {"test_mcc": 30.86699313227165, "test_mcc_se": 0.870333006267996, "test_accuracy": 47.8515625, "test_accuracy_se": 0.5961313778261518}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "mmlu-nl", "task": "knowledge", "dataset_languages": ["nl"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.26816589139577013, "accuracy": 0.4521484375}, {"mcc": 0.28531540425563656, "accuracy": 0.462890625}, {"mcc": 0.2732692894315277, "accuracy": 0.4521484375}, {"mcc": 0.2637933420424804, "accuracy": 0.4453125}, {"mcc": 0.2904833989028748, "accuracy": 0.466796875}, {"mcc": 0.2638508628674907, "accuracy": 0.447265625}, {"mcc": 0.26783382298584374, "accuracy": 0.44970703125}, {"mcc": 0.2787281015764299, "accuracy": 0.45751953125}, {"mcc": 0.28559936485154286, "accuracy": 0.4619140625}, {"mcc": 0.26104328871883953, "accuracy": 0.443359375}], "total": {"test_mcc": 27.380827670284365, "test_mcc_se": 0.6546136453067823, "test_accuracy": 45.390625, "test_accuracy_se": 0.4963047554050134}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "mmlu", "task": "knowledge", "dataset_languages": ["en"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.4918444530571616, "accuracy": 0.619140625}, {"mcc": 0.49244618648329, "accuracy": 0.6201171875}, {"mcc": 0.5151531008502934, "accuracy": 0.63623046875}, {"mcc": 0.525489001083972, "accuracy": 0.64306640625}, {"mcc": 0.5156120052831399, "accuracy": 0.63623046875}, {"mcc": 0.48090815672871084, "accuracy": 0.60986328125}, {"mcc": 0.5220844549167285, "accuracy": 0.64111328125}, {"mcc": 0.5382317362867154, "accuracy": 0.65185546875}, {"mcc": 0.4969750810979705, "accuracy": 0.62109375}, {"mcc": 0.535221265006509, "accuracy": 0.6494140625}], "total": {"test_mcc": 51.13965440794491, "test_mcc_se": 1.2242595332716182, "test_accuracy": 63.28125, "test_accuracy_se": 0.8880874146441115}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "mmlu-fr", "task": "knowledge", "dataset_languages": ["fr"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.36305897769151596, "accuracy": 0.52197265625}, {"mcc": 0.33862981758957966, "accuracy": 0.50146484375}, {"mcc": 0.2948591271301557, "accuracy": 0.4677734375}, {"mcc": 0.33879439027236563, "accuracy": 0.501953125}, {"mcc": 0.3598318010003803, "accuracy": 0.51904296875}, {"mcc": 0.3618941781049231, "accuracy": 0.51904296875}, {"mcc": 0.36530872517040885, "accuracy": 0.521484375}, {"mcc": 0.3475587928632904, "accuracy": 0.509765625}, {"mcc": 0.370025436591492, "accuracy": 0.52685546875}, {"mcc": 0.34725346198009965, "accuracy": 0.5078125}], "total": {"test_mcc": 34.87214708394212, "test_mcc_se": 1.3605291449838512, "test_accuracy": 50.97167968750001, "test_accuracy_se": 1.0630708026371254}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "arc-is", "task": "knowledge", "dataset_languages": ["is"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.0941600653055537, "accuracy": 0.3134765625}, {"mcc": 0.1039458044045476, "accuracy": 0.3056640625}, {"mcc": 0.10686428593722826, "accuracy": 0.326171875}, {"mcc": 0.07941421399176701, "accuracy": 0.3056640625}, {"mcc": 0.07115215242268488, "accuracy": 0.2998046875}, {"mcc": 0.09087605280987249, "accuracy": 0.3115234375}, {"mcc": 0.08469546393795264, "accuracy": 0.2998046875}, {"mcc": 0.05616392108314611, "accuracy": 0.2861328125}, {"mcc": 0.0850910793268674, "accuracy": 0.3037109375}, {"mcc": 0.11044592112764288, "accuracy": 0.3349609375}], "total": {"test_mcc": 8.828089603472629, "test_mcc_se": 1.0432613941963458, "test_accuracy": 30.869140625000004, "test_accuracy_se": 0.860949955561871}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "hellaswag-da", "task": "common-sense-reasoning", "dataset_languages": ["da"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.17186094604025098, "accuracy": 0.3662109375}, {"mcc": 0.08774171137823115, "accuracy": 0.3125}, {"mcc": 0.15560307162843037, "accuracy": 0.34814453125}, {"mcc": 0.16688088698978548, "accuracy": 0.36962890625}, {"mcc": 0.06597630847588974, "accuracy": 0.29248046875}, {"mcc": 0.15862765085986827, "accuracy": 0.3671875}, {"mcc": 0.0998567771846819, "accuracy": 0.326171875}, {"mcc": 0.11027556178924362, "accuracy": 0.326171875}, {"mcc": 0.1478274784557862, "accuracy": 0.35546875}, {"mcc": 0.18782902013576463, "accuracy": 0.39013671875}], "total": {"test_mcc": 13.524794129379325, "test_mcc_se": 2.5435288985129403, "test_accuracy": 34.541015625, "test_accuracy_se": 1.873990030181431}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "hellaswag-no", "task": "common-sense-reasoning", "dataset_languages": ["nb", "nn", "no"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.1518408449720295, "accuracy": 0.35009765625}, {"mcc": 0.11408063791942609, "accuracy": 0.33056640625}, {"mcc": 0.16706846781093426, "accuracy": 0.36279296875}, {"mcc": 0.13311096507613, "accuracy": 0.33154296875}, {"mcc": 0.18915772964915537, "accuracy": 0.36572265625}, {"mcc": 0.15807175511048402, "accuracy": 0.35546875}, {"mcc": 0.1035712297906596, "accuracy": 0.31787109375}, {"mcc": 0.19015857690852905, "accuracy": 0.38134765625}, {"mcc": 0.13435404390708722, "accuracy": 0.35205078125}, {"mcc": 0.12621606694158616, "accuracy": 0.33544921875}], "total": {"test_mcc": 14.676303180860211, "test_mcc_se": 1.8412732559656568, "test_accuracy": 34.8291015625, "test_accuracy_se": 1.1977173764961309}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "hellaswag-sv", "task": "common-sense-reasoning", "dataset_languages": ["sv"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.11182821279186345, "accuracy": 0.3115234375}, {"mcc": 0.10219303743305719, "accuracy": 0.3251953125}, {"mcc": 0.09746154752738746, "accuracy": 0.30810546875}, {"mcc": 0.08577701583255963, "accuracy": 0.2900390625}, {"mcc": 0.11651390407622295, "accuracy": 0.32958984375}, {"mcc": 0.11738605974439345, "accuracy": 0.3232421875}, {"mcc": 0.21754539589611774, "accuracy": 0.41064453125}, {"mcc": 0.12212536387052651, "accuracy": 0.330078125}, {"mcc": 0.11946222091218069, "accuracy": 0.32080078125}, {"mcc": 0.12081206130343404, "accuracy": 0.330078125}], "total": {"test_mcc": 12.111048193877432, "test_mcc_se": 2.224241111295528, "test_accuracy": 32.79296875, "test_accuracy_se": 1.9634670026835621}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "winogrande-is", "task": "common-sense-reasoning", "dataset_languages": ["is"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.004941842614223843, "accuracy": 0.5301339285714286}, {"mcc": 0.004958943592137308, "accuracy": 0.5189732142857143}, {"mcc": 0.00592157577263504, "accuracy": 0.5401785714285714}, {"mcc": 0.054415457616894695, "accuracy": 0.5613839285714286}, {"mcc": 0.0011711812587321569, "accuracy": 0.5011160714285714}, {"mcc": 0.04943733723111712, "accuracy": 0.5256696428571429}, {"mcc": -0.00950075968941823, "accuracy": 0.5502232142857143}, {"mcc": 0.0663830341254975, "accuracy": 0.5502232142857143}, {"mcc": -0.039795835440026676, "accuracy": 0.5424107142857143}, {"mcc": 0.024598804046590483, "accuracy": 0.5178571428571429}], "total": {"test_mcc": 1.6253158112838322, "test_mcc_se": 2.017379911031024, "test_accuracy": 53.381696428571445, "test_accuracy_se": 1.138551196971397}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "hellaswag-de", "task": "common-sense-reasoning", "dataset_languages": ["de"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.22723860623518735, "accuracy": 0.41455078125}, {"mcc": 0.218883127609746, "accuracy": 0.4052734375}, {"mcc": 0.18878991097276246, "accuracy": 0.38916015625}, {"mcc": 0.1947491918568087, "accuracy": 0.39111328125}, {"mcc": 0.161270298571091, "accuracy": 0.35107421875}, {"mcc": 0.2517044023401158, "accuracy": 0.43359375}, {"mcc": 0.2555600110634416, "accuracy": 0.43603515625}, {"mcc": 0.21358697981237007, "accuracy": 0.39599609375}, {"mcc": 0.24830691689009654, "accuracy": 0.43505859375}, {"mcc": 0.22261708502544109, "accuracy": 0.412109375}], "total": {"test_mcc": 21.827065303770603, "test_mcc_se": 1.8712646272846059, "test_accuracy": 40.6396484375, "test_accuracy_se": 1.6319526558888138}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "hellaswag-nl", "task": "common-sense-reasoning", "dataset_languages": ["nl"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.2081510330181242, "accuracy": 0.40185546875}, {"mcc": 0.17360514723873957, "accuracy": 0.37109375}, {"mcc": 0.19950941466584166, "accuracy": 0.3955078125}, {"mcc": 0.16994368737415036, "accuracy": 0.376953125}, {"mcc": 0.19213942469704406, "accuracy": 0.3935546875}, {"mcc": 0.1612565793317289, "accuracy": 0.36376953125}, {"mcc": 0.17669657384168708, "accuracy": 0.37939453125}, {"mcc": 0.16967266845962867, "accuracy": 0.369140625}, {"mcc": 0.1987753756137386, "accuracy": 0.39501953125}, {"mcc": 0.1551024635057757, "accuracy": 0.35888671875}], "total": {"test_mcc": 18.048523677464587, "test_mcc_se": 1.1128583290416532, "test_accuracy": 38.0517578125, "test_accuracy_se": 0.9333494661844226}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "hellaswag", "task": "common-sense-reasoning", "dataset_languages": ["en"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.4853583040469098, "accuracy": 0.59814453125}, {"mcc": 0.4498581668826491, "accuracy": 0.57177734375}, {"mcc": 0.40837392594994737, "accuracy": 0.53125}, {"mcc": 0.41795545729389477, "accuracy": 0.53515625}, {"mcc": 0.4616778882015442, "accuracy": 0.5732421875}, {"mcc": 0.4765551005733945, "accuracy": 0.587890625}, {"mcc": 0.3749497206096992, "accuracy": 0.515625}, {"mcc": 0.4311773977270226, "accuracy": 0.55908203125}, {"mcc": 0.44795039823223964, "accuracy": 0.56396484375}, {"mcc": 0.44989312009832605, "accuracy": 0.5771484375}], "total": {"test_mcc": 44.037494796156274, "test_mcc_se": 2.057324756881195, "test_accuracy": 56.13281249999999, "test_accuracy_se": 1.633707130892523}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "hellaswag-fr", "task": "common-sense-reasoning", "dataset_languages": ["fr"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"mcc": 0.2919049534979691, "accuracy": 0.46435546875}, {"mcc": 0.24388068128625925, "accuracy": 0.43212890625}, {"mcc": 0.2529813061590214, "accuracy": 0.435546875}, {"mcc": 0.26010114674242196, "accuracy": 0.44482421875}, {"mcc": 0.17879940992431081, "accuracy": 0.38037109375}, {"mcc": 0.2869599036078522, "accuracy": 0.45751953125}, {"mcc": 0.3096020533567047, "accuracy": 0.4853515625}, {"mcc": 0.23671234246367912, "accuracy": 0.4228515625}, {"mcc": 0.29875503491200683, "accuracy": 0.4736328125}, {"mcc": 0.28279959603238936, "accuracy": 0.4609375}], "total": {"test_mcc": 26.424964279826145, "test_mcc_se": 2.3997228941684527, "test_accuracy": 44.5751953125, "test_accuracy_se": 1.8651854598825328}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}
{"dataset": "speed", "task": "speed", "dataset_languages": ["ab", "aa", "af", "sq", "am", "ar", "an", "hy", "as", "av", "ae", "ay", "az", "bm", "ba", "eu", "be", "bn", "bi", "bs", "br", "bg", "my", "ca", "ch", "ce", "ny", "zh", "cu", "cv", "kw", "co", "cr", "hr", "cs", "da", "dv", "nl", "dz", "en", "eo", "et", "ee", "fo", "fj", "fi", "fr", "fy", "ff", "gd", "gl", "lg", "ka", "de", "el", "kl", "gn", "gu", "ht", "ha", "he", "hz", "hi", "ho", "hu", "is", "io", "ig", "id", "ia", "ie", "iu", "ik", "ga", "it", "ja", "kn", "kr", "ks", "kk", "km", "ki", "rw", "ky", "kv", "kg", "ko", "kj", "ku", "lo", "la", "lv", "li", "ln", "lt", "lu", "lb", "mk", "mg", "ms", "ml", "mt", "gv", "mi", "mr", "mh", "mn", "na", "nv", "nd", "nr", "ng", "ne", "no", "nb", "nn", "ii", "oc", "oj", "or", "om", "os", "pi", "ps", "fa", "pl", "pt", "pa", "qu", "ro", "rm", "rn", "ru", "se", "sm", "sg", "sa", "sc", "sr", "sn", "sd", "si", "sk", "sl", "so", "st", "es", "su", "sw", "ss", "sv", "tl", "ty", "tg", "ta", "tt", "te", "th", "bo", "ti", "to", "ts", "tn", "tr", "tk", "tw", "ug", "uk", "ur", "uz", "ve", "vi", "vo", "wa", "cy", "wo", "xh", "yi", "yo", "za", "zu"], "model": "allenai/OLMo-2-1124-13B", "results": {"raw": [{"test_speed": 271.21999999999997, "test_speed_short": 31.2}, {"test_speed": 542.85, "test_speed_short": 59.25}, {"test_speed": 803.91, "test_speed_short": 111.94}, {"test_speed": 1079.04, "test_speed_short": 136.79999999999998}, {"test_speed": 1330.29, "test_speed_short": 166.41}, {"test_speed": 1582.9599999999998, "test_speed_short": 217.73999999999998}, {"test_speed": 1875.62, "test_speed_short": 250.88}, {"test_speed": 2098.1400000000003, "test_speed_short": 267.67}, {"test_speed": 2391.4900000000002, "test_speed_short": 299.52}, {"test_speed": 2663.7999999999997, "test_speed_short": 325.55}], "total": {"test_speed": 1463.9319999999998, "test_speed_se": 495.780852313657, "test_speed_short": 186.69599999999997, "test_speed_short_se": 62.85626044518327}}, "num_model_parameters": 13716198400, "max_sequence_length": 4096, "vocabulary_size": 100352, "merge": false, "generative": true, "generative_type": "base", "few_shot": true, "validation_split": false, "scandeval_version": "15.0.0"}

@saattrupdan
Copy link
Member

Thanks @Mikeriess! Results are live now.

I've noticed some of my evaluations have become slow as well, not sure if it's due to an update in some of the packages. Let me know if it persists for other models 🙂

@Mikeriess
Copy link
Contributor

Will do 👌 Was using 8x H100 for this (nvidia-smi showed full utilization across all GPUs during benchmarking), yet it took about as long as a 70b model would in previous evaluations 🤔

I could try and compare benchmark time across vLLM versions on e.g. a llama-3.1-8b?

@saattrupdan
Copy link
Member

@Mikeriess Yes, that would be helpful! Also check if it is due to particular tasks. For instance, the NER task is sometimes the culprit, as that's the only task we're using structured generation.

@Mikeriess
Copy link
Contributor

Decided to test with gemma-2-2b instead in the interest of time :-) Currently running the baseline, will probably know more tomorrow 👍

@Mikeriess
Copy link
Contributor

It ran fine with base scandeval install, but for latest vllm version the kernel died after the 8 first benchmarks; re-running now to see if it was a coincidence

@Mikeriess
Copy link
Contributor

Third time the kernel dies now. Again after completing the allocine dataset (dataset number 8). I see NER is the next task.

This was the sequence I ran:

pip install -U scandeval[all]
pip install -U wheel && FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn --no-build-isolation
pip install --upgrade vllm

scandeval --evaluate-test-split --clear-model-cache --trust-remote-code -m google/gemma-2-2b --cache-dir cache/scandeval_cache --only-allow-safetensors

Seems like the latest scandeval and vllm causes some stability issues on NER, however, I am able to pick up after it has died and I can manually see that this process is extremely slow. Unfortunately my time measurements got ruined when the kernel kept dying.

@saattrupdan
Copy link
Member

saattrupdan commented Feb 13, 2025

@Mikeriess Yeah the crashing is a known issue with Outlines (dottxt-ai/outlines#1351), and happens as newer versions of vLLM uses newer versions of Outlines, encountering the bug. That's the reason why I added the vLLM upper bound in the first place, and I see it hasn't been fixed yet.

One solution could be to change the structured generation backend to XGrammar, which is super fast and reliable, but unfortunately they are currently missing features that we're using in the benchmark (mlc-ai/xgrammar#192, mlc-ai/xgrammar#131, mlc-ai/xgrammar#104), causing significantly worse evaluation results. They say they're working on it, though.

So we're at an annoying point where we can't really use the new versions of vLLM as the structured generation packages aren't good enough yet. Outlines is the lesser evil at the moment, since it is feature complete, but just requires a million GBs of memory (causing your crash). This is only relevant for newer model architectures, however, as the older ones are still supported by the older vLLM versions.

The joy of relying on external packages, eh? 🤷

@Mikeriess
Copy link
Contributor

@saattrupdan Haha, yeah, its great. But your hands are tied in my eyes, I only see waiting as a possibility, unless we re-evaluate a ton of models 😅

In the meantime we can focus on fine-tunes on existing architecture in the various languages I guess. I still think that is somewhat interesting to see.

@alessiodallapiazza
Copy link

alessiodallapiazza commented Mar 2, 2025

Hello,
I was following your discussion about the current limitation of xgrammar in handling specific properties. I was wondering if creating a custom grammar could help us work around this issue. For instance, something similar to what's demonstrated in https://github.com/ggml-org/llama.cpp/blob/master/examples/json_schema_to_grammar.py.

What do you think about this approach?

@saattrupdan
Copy link
Member

Hello, I was following your discussion about the current limitation of xgrammar in handling specific properties. I was wondering if creating a custom grammar could help us work around this issue. For instance, something similar to what's demonstrated in https://github.com/ggml-org/llama.cpp/blob/master/examples/json_schema_to_grammar.py.

What do you think about this approach?

Hi there! I would be wary in building our own custom grammars, as it would be another part that we would need to maintain over the years. Since the XGrammar team seem to already be working on this, it would probably be more beneficial to contribute directly to that repo, to ensure the longevity of the feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
large model (>8B) This model has more than 8B parameters, requiring more than an RTX 4090 GPU to evaluate. model evaluation request Request to evaluate a model and add it to the leaderboard(s)
Projects
None yet
Development

No branches or pull requests

4 participants