Updates #6

Muennighoff · 2024-07-09T23:34:02Z

Feel free to merge if it looks good!

The file size makes me feel good that we're not putting these in the main mteb repo 😅

26M results/GritLM-7B/13f00a0e36500c80ce12870ea513846a066004af/FloresBitextMining.json

KennethEnevoldsen · 2024-07-10T08:43:28Z

Oh wow. File sizes even result in an error @orionw what is the best approach here?

orionw · 2024-07-10T15:37:25Z

I added a script to the repo remove_spaces_from_large_json_files.py that should be run before committing and should lower the file size for ones that are too large.

I don't know if it will reduce the size that much though -- why is the GritLM one so large? All the other FloresBitexMining ones were about 13 MB and went to 8MB or so with that file.

KennethEnevoldsen · 2024-07-10T20:40:07Z

I don't know if it will reduce the size that much though -- why is the GritLM one so large? All the other FloresBitexMining ones were about 13 MB and went to 8MB or so with that file.

I believe the previous run on Flores did not include all current splits (since the cross-lingual grid was added). It does seem a bit excessive atm. but runtime it is actually quite reasonable (due to each language only being embedded once)

orionw · 2024-07-10T20:50:54Z

I see, thanks. Would it be possible to separate the results file by split? Or craft some other solution for Flores, as I think it's the only one with this problem. One potentially solution is that I could write a script that would turn Flores files into a number of chunks (by split or some other method) and read those chunks in that specific format in the results?

Otherwise we will need to think of a new solution, as the files have to < 10 MB and non-binary files for everything to work (Git LFS is required for HF spaces at 10MB or for any size of binary files).

KennethEnevoldsen · 2024-07-10T21:12:49Z

Hmm so the simplest idea I can think of is .gz everything that is larger than 10mb, though we loose readability from that (though removing spaces + large file size already gets us there)

We might also consider only saving meaningful digits ( 0.821602063301282 -> 0.82160). This should also make normal results files more readable so might be a good thing regardless.

Going beyond that we might have a script that removed everything but the main score of the tasks (e.g. here it is f1). I would probably only do this for >10mb files:

{
-         "accuracy": 0.98828125,
         "f1": 0.9845377604166666,
         "hf_subset": "asm_Beng-ben_Beng",
         "languages": [
           "asm-Beng",
           "ben-Beng"
         ],
         "main_score": 0.9845377604166666,
-         "precision": 0.9827473958333333,
-         "recall": 0.98828125
},

We could optionally also remove languages as it can be reconstructed (task name -> task metadata -(along with hf_subset)-> languages), but atm. this is a 100% valid mteb result.

All of these naturally only work up to a point (at which point we should probably take a look at the task).

orionw · 2024-07-10T21:39:34Z

Hmm so the simplest idea I can think of is .gz everything that is larger than 10mb, though we loose readability from that (though removing spaces + large file size already gets us there)

Unfortunately gz files are binary and so have to be Git LFS in Huggingface Datasets/Spaces. :(

We might also consider only saving meaningful digits ( 0.821602063301282 -> 0.82160). This should also make normal results files more readable so might be a good thing regardless.

I like this as a general idea anyways, we shouldn't need more than 6 ish significant digits and even that is very conservative. Between that and the language removal though, I'm not 100% sure if we could reduce the file to 1/3 of the size.

Going beyond that we might have a script that removed everything but the main score of the tasks (e.g. here it is f1)

I don't love the idea of removing results, if can help it. Just in case someone wanted to analyze other metrics.

@KennethEnevoldsen do you think separating by split is too awkward?

If this ends up being too difficult we can not have a Github mirror, but if it's just the Flores file I think I'm fine creating a workaround hack for it.

KennethEnevoldsen · 2024-07-11T08:49:37Z

@KennethEnevoldsen do you think separating by split is too awkward?

Hmm yeah, this is a good question. Generally, I would want all the scores that make up the "main_score" within one file. If that is across two splits I would probably do that.

You can naturally run a task on the "train" set without it being required. Therefore I have added the:

    def validate_and_filter_scores(self):
         """This ensures that the scores are correct for the given task, by removing any splits besides those specified in the task metadata.
         Additionally, it also ensure that all of the splits required as well as the languages are present in the scores.
         """

Which removes all unrequired splits (and that all required splits are present)

Currently in PR: https://github.com/embeddings-benchmark/mteb/pull/1038/files#diff-7e2303fbc5dbdaea469f570ad2c469a88773c92be19827925dd19ade83c04b66

KennethEnevoldsen · 2024-07-11T09:31:22Z

Okay so did an experiment here:

❯ du -h FloresBitextMining*
 28M    FloresBitextMining.json
 23M    FloresBitextMining_six_dec.json
 15M    FloresBitextMining_small_six_dec_and_no_spaces.json

That does not quite get us there, but I see that we have both dev and devtest, which seems over the top.

❯ du -h FloresBitextMining*
 14M    FloresBitextMining.json
 11M    FloresBitextMining_six_dec.json
8.0M    FloresBitextMining_small_six_dec_and_no_spaces.json

The question is whether that leads to a problem for the leaderboard (I could not find cases where it was used).

I have made the required changes here:

embeddings-benchmark/mteb#1075

orionw · 2024-07-11T15:54:00Z

I don't think Flores is in the leaderboard yet since we haven't shown MMTEB. As long as we have the same split for all models we want to show I think keeping only the one is fine.

From their paper:

FLORES-101 is divided into dev, devtest, and test. The dev set is meant to be used for hyperparameter tuning. The devtest is meant to be used for testing purposes during the development phase. The test set will not be released, but is available via a publicly available evaluation server, while the dev and devtest are publicly downloadable

So I think keeping only devtest is reasonable for the leaderboard. Thanks for digging into this @KennethEnevoldsen!

Updates

1137243

Muennighoff mentioned this pull request Jul 9, 2024

Added models x tasks which haven't yet been run embeddings-benchmark/mteb#1038

Merged

This was referenced Aug 9, 2024

Merge in updated #14

Merged

Added integration tests for MTEB #13

Merged

KennethEnevoldsen merged commit 3231a01 into main Aug 12, 2024
1 check failed

KennethEnevoldsen deleted the updateres branch August 12, 2024 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates #6

Updates #6

Muennighoff commented Jul 9, 2024 •

edited

Loading

KennethEnevoldsen commented Jul 10, 2024

orionw commented Jul 10, 2024

KennethEnevoldsen commented Jul 10, 2024

orionw commented Jul 10, 2024

KennethEnevoldsen commented Jul 10, 2024

orionw commented Jul 10, 2024

KennethEnevoldsen commented Jul 11, 2024

KennethEnevoldsen commented Jul 11, 2024

orionw commented Jul 11, 2024 •

edited

Loading

Updates #6

Updates #6

Conversation

Muennighoff commented Jul 9, 2024 • edited Loading

KennethEnevoldsen commented Jul 10, 2024

orionw commented Jul 10, 2024

KennethEnevoldsen commented Jul 10, 2024

orionw commented Jul 10, 2024

KennethEnevoldsen commented Jul 10, 2024

orionw commented Jul 10, 2024

KennethEnevoldsen commented Jul 11, 2024

KennethEnevoldsen commented Jul 11, 2024

orionw commented Jul 11, 2024 • edited Loading

Muennighoff commented Jul 9, 2024 •

edited

Loading

orionw commented Jul 11, 2024 •

edited

Loading