Fetch models results from HF #43

Samoed · 2024-11-08T09:30:56Z

I've implemented a base version of retrieving results from the model card. This PR is a work in progress and is open for discussion. Currently, I’ve parsed the results for intfloat/multilingual-e5-large as a demo. During integration with other results, I ran into some issues, particularly with loading results using TaskResult. External results don’t have evaluation_time and mteb_version, which TaskResult requires.

@KennethEnevoldsen @x-tabdeveloping
Ref: embeddings-benchmark/mteb#1405, embeddings-benchmark/mteb#1373

x-tabdeveloping · 2024-11-08T12:39:47Z

This is awesome! You're doing god's work here :D Thanks

x-tabdeveloping · 2024-11-08T12:42:33Z

I think adding something invalid for the evaluation_time attribute like -1 would make quite a bit of sense. Otherwise we can also make it possible in the mteb library to allow None, though I'd prefer not to do this. What do you think @KennethEnevoldsen

KennethEnevoldsen

This looks really good! We should probably do this as a one-time thing only (don't support it going forward).

KennethEnevoldsen · 2024-11-08T12:42:47Z

results/intfloat__multilingual-e5-large/external/ArxivClusteringS2S.json

+    "evaluation_time": 0,
+    "mteb_version": "0.0.0",


Hmm maybe we should make it clear that this is NA

(potentially by wrapping the original object to allow for None

Samoed · 2024-11-08T12:59:07Z

You mean that we should fetch results only one time for each model and not check them for updates?

KennethEnevoldsen · 2024-11-08T13:16:10Z

You mean that we should fetch results only one time for each model and not check them for updates?

Either that or we should update the create_meta CLI to also include MTEB version and eval time.

I would rather allow people to create a MTEB folder in the repo with the results.

1000 evaluation is also a bit hard to reason about:

I would much rather have:

score on MTEB(eng): 0.86
score on MTEB(multilingual): 0.71
...

Samoed · 2024-11-08T13:31:26Z

I would much rather have:
score on MTEB(eng): 0.86
score on MTEB(multilingual): 0.71

You mean that we should remove from mteb create_meta results of the tasks and left only average result of the benchmark? And all other results should be in repo of model?

isaac-chung · 2024-11-08T15:28:53Z

Hey @Samoed wow thanks for working on this!

I think adding something invalid for the evaluation_time attribute like -1 would make quite a bit of sense. Otherwise we can also make it possible in the mteb library to allow None, though I'd prefer not to do this. What do you think

Re: evaluation_time: using -1 is likely easier for us to keep track of the results that needed attention down the line. I'd keep away from using None - right now having the field check is good.

we should update the create_meta CLI to also include MTEB version and eval time.

Let's do that for sure.

I would rather allow people to create a MTEB folder in the repo with the results.

I think this could be a great main option forward, instead of PRs to the results repo (i.e. both are accepted, but the MTEB folder is preferred). People's results should be generated with MTEB version and eval time (see above). For time to be meaningful, we might also want to track hardware info, right? Every time we scan for the MTEB folders, we can validate the results objects as well.

I'm all for automating as much as we can :P

isaac-chung · 2024-11-08T15:35:16Z

Slightly related to the eval time topic, maybe we could start enforcing kg_co2_emissions to be filled in as well, as @x-tabdeveloping suggested? Maybe CodeCarbon takes into account both eval time and hardware together. Then this plot might be more meaningful when comparing different runs using different machines.

This will add 1 new dependency in MTEB but I think the benefits is worth it.

KennethEnevoldsen · 2024-11-09T17:21:12Z

+1 for enabling code-carbon by default

re -1 for values: This would lead to issues when e.g. taking a mean. I would use math.nan instead (it is a float and an invalid value).

It sounds like for now:

Keep create_meta: it is still useful for adding certain tasks to metadata but it will not auto-update the leaderboard
Use only embedding-benchmark/results though planning in the future to add the option to use a folder on HF

Samoed · 2024-11-09T18:13:45Z

For now what I should change, except -1 value?

isaac-chung · 2024-11-10T07:17:58Z

@KennethEnevoldsen in this PR embeddings-benchmark/mteb#1398, -1 was used in favor of NaN. To be consistent, should we keep the -1 approach? Unless some metrics aren't between [0-1].

KennethEnevoldsen · 2024-11-11T08:09:22Z

That PR is more about adding a warning to existing code (I also think it is a bad idea in that case)

Would def. change it to NA here.

Samoed · 2024-11-13T17:45:55Z

@KennethEnevoldsen I've updated results

KennethEnevoldsen · 2024-11-13T21:29:20Z

Hmm I believe NaN is not valid json, shouldn't it be "null"? If it can be read I guess it doesn't matter.

for the MTEB version I would probably change it to:
"mteb_version": "unknown"

Samoed · 2024-11-13T21:46:15Z

Yes, it’s a bit invalid as JSON, but Python can read it correctly.

Changing it to null would require updates to the MTEB loader, as it currently only accepts float values.
Now there are some tasks with NaN values too.

Samoed · 2024-11-13T23:26:05Z

It also seems the loader requires the version to be a parsable string

KennethEnevoldsen · 2024-11-14T10:41:35Z

I have added a PR which should allow the fix

KennethEnevoldsen · 2024-11-14T10:42:04Z

@Samoed feel free to just merge it in

Samoed · 2024-11-14T10:48:56Z

@KennethEnevoldsen I don't have permission to merge in this repository. Also, should we keep evaluation_time as math.nan, or change it to None?

KennethEnevoldsen · 2024-11-14T10:52:07Z

Either is fine - I just wanted to make sure that the json was generally valid

KennethEnevoldsen · 2024-11-14T14:17:46Z

@Samoed I think the update is in on main - updating to the latest version would allow us to get this merged in

x-tabdeveloping · 2024-11-15T07:34:59Z

Oh god, the problem seems to be that if the mteb_version is None, then TaskResult will try to load as if the results from before version 1.11.0, while the results are organized along the new format. @KennethEnevoldsen Is it necessary to assume, that if mteb_version is None then the version is <1.1.0? cause if so, we are in a bit of a deadlock. The only way I see around this error is either if we assume the modern result format when mteb_version is None or if we bump mteb version to be something relatively new in this PR.

x-tabdeveloping · 2024-11-15T07:54:22Z

Oooor (and I saw you did this at one point @Samoed ), we can say something like version=="unknown" and then when we load the results, we assume that it's the new format.

Samoed · 2024-11-15T07:57:14Z

@x-tabdeveloping I think that in embeddings-benchmark/mteb#1453 issue with loading was fixed

x-tabdeveloping · 2024-11-15T08:13:07Z

what's up with the test failing then?

Samoed · 2024-11-15T08:13:59Z

It was not updated after PR merge. I tried with new mteb verison, but found small issue embeddings-benchmark/mteb#1460

x-tabdeveloping · 2024-11-15T12:13:33Z

Release workflow is in action, once it's done we should probably update the mteb version here and get this baby merged

Samoed · 2024-11-15T12:37:30Z

@x-tabdeveloping Now results can be imported

x-tabdeveloping · 2024-11-15T12:37:43Z

Awesome, thanks <3

init

0cab11b

x-tabdeveloping requested review from x-tabdeveloping, KennethEnevoldsen and isaac-chung November 8, 2024 12:40

KennethEnevoldsen reviewed Nov 8, 2024

View reviewed changes

Samoed added 3 commits November 8, 2024 19:48

updates from run

ff3321a

download results

31ac260

remove models list

3b57f3c

isaac-chung mentioned this pull request Nov 9, 2024

docs: Update recommendation for pushing results embeddings-benchmark/mteb#1401

Merged

2 tasks

update

2295cd7

Samoed marked this pull request as ready for review November 9, 2024 12:11

Samoed added 4 commits November 9, 2024 15:53

remove test ci bump

dc9baed

fix tests

0258d02

small fixes

c023953

small fixes

f35d352

Merge branch 'main' into external_results

1c2ce62

Samoed mentioned this pull request Nov 9, 2024

Leaderboard 2.0: should we fetch data from model card? embeddings-benchmark/mteb#1373

Closed

isaac-chung mentioned this pull request Nov 11, 2024

Enable Code Carbon by default in evaluation runs embeddings-benchmark/mteb#1427

Closed

x-tabdeveloping mentioned this pull request Nov 13, 2024

Overview issue: Leaderboard 2.0 release embeddings-benchmark/mteb#1405

Open

8 tasks

Samoed added 3 commits November 13, 2024 20:35

update paths

de59139

get back removed files

f3f1ca3

get back removed files

9c4afb6

Merge branch 'main' into external_results

dad8c98

KennethEnevoldsen approved these changes Nov 13, 2024

View reviewed changes

rename mteb version name

808d45d

This was referenced Nov 14, 2024

fix: Unsure TaskResults can handle runtime and version being unspecified embeddings-benchmark/mteb#1447

Closed

fix: update task metadata to allow for null embeddings-benchmark/mteb#1448

Merged

Samoed and others added 3 commits November 14, 2024 19:29

Merge branch 'main' into external_results

7c469d9

change evaluation time and mteb version to null

389f9dc

fixes

533a084

pre-release fix

2e04c27

x-tabdeveloping merged commit 29546d7 into embeddings-benchmark:main Nov 15, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch models results from HF #43

Fetch models results from HF #43

Samoed commented Nov 8, 2024 •

edited by isaac-chung

Loading

x-tabdeveloping commented Nov 8, 2024

x-tabdeveloping commented Nov 8, 2024

KennethEnevoldsen left a comment

KennethEnevoldsen Nov 8, 2024

KennethEnevoldsen Nov 8, 2024

Samoed commented Nov 8, 2024

KennethEnevoldsen commented Nov 8, 2024

Samoed commented Nov 8, 2024 •

edited

Loading

isaac-chung commented Nov 8, 2024

isaac-chung commented Nov 8, 2024 •

edited

Loading

KennethEnevoldsen commented Nov 9, 2024

Samoed commented Nov 9, 2024

isaac-chung commented Nov 10, 2024

KennethEnevoldsen commented Nov 11, 2024 •

edited

Loading

Samoed commented Nov 13, 2024

KennethEnevoldsen commented Nov 13, 2024

Samoed commented Nov 13, 2024 •

edited

Loading

Samoed commented Nov 13, 2024

KennethEnevoldsen commented Nov 14, 2024

KennethEnevoldsen commented Nov 14, 2024

Samoed commented Nov 14, 2024

KennethEnevoldsen commented Nov 14, 2024

KennethEnevoldsen commented Nov 14, 2024

x-tabdeveloping commented Nov 15, 2024

x-tabdeveloping commented Nov 15, 2024

Samoed commented Nov 15, 2024

x-tabdeveloping commented Nov 15, 2024

Samoed commented Nov 15, 2024 •

edited

Loading

x-tabdeveloping commented Nov 15, 2024

Samoed commented Nov 15, 2024

x-tabdeveloping commented Nov 15, 2024

Fetch models results from HF #43

Fetch models results from HF #43

Conversation

Samoed commented Nov 8, 2024 • edited by isaac-chung Loading

x-tabdeveloping commented Nov 8, 2024

x-tabdeveloping commented Nov 8, 2024

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

KennethEnevoldsen Nov 8, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Nov 8, 2024

Choose a reason for hiding this comment

Samoed commented Nov 8, 2024

KennethEnevoldsen commented Nov 8, 2024

Samoed commented Nov 8, 2024 • edited Loading

isaac-chung commented Nov 8, 2024

isaac-chung commented Nov 8, 2024 • edited Loading

KennethEnevoldsen commented Nov 9, 2024

Samoed commented Nov 9, 2024

isaac-chung commented Nov 10, 2024

KennethEnevoldsen commented Nov 11, 2024 • edited Loading

Samoed commented Nov 13, 2024

KennethEnevoldsen commented Nov 13, 2024

Samoed commented Nov 13, 2024 • edited Loading

Samoed commented Nov 13, 2024

KennethEnevoldsen commented Nov 14, 2024

KennethEnevoldsen commented Nov 14, 2024

Samoed commented Nov 14, 2024

KennethEnevoldsen commented Nov 14, 2024

KennethEnevoldsen commented Nov 14, 2024

x-tabdeveloping commented Nov 15, 2024

x-tabdeveloping commented Nov 15, 2024

Samoed commented Nov 15, 2024

x-tabdeveloping commented Nov 15, 2024

Samoed commented Nov 15, 2024 • edited Loading

x-tabdeveloping commented Nov 15, 2024

Samoed commented Nov 15, 2024

x-tabdeveloping commented Nov 15, 2024

Samoed commented Nov 8, 2024 •

edited by isaac-chung

Loading

Samoed commented Nov 8, 2024 •

edited

Loading

isaac-chung commented Nov 8, 2024 •

edited

Loading

KennethEnevoldsen commented Nov 11, 2024 •

edited

Loading

Samoed commented Nov 13, 2024 •

edited

Loading

Samoed commented Nov 15, 2024 •

edited

Loading