Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates #6

Merged
merged 1 commit into from
Aug 12, 2024
Merged

Updates #6

merged 1 commit into from
Aug 12, 2024

Conversation

Muennighoff
Copy link
Contributor

@Muennighoff Muennighoff commented Jul 9, 2024

Feel free to merge if it looks good!

The file size makes me feel good that we're not putting these in the main mteb repo 😅

26M results/GritLM-7B/13f00a0e36500c80ce12870ea513846a066004af/FloresBitextMining.json

@KennethEnevoldsen
Copy link
Contributor

Oh wow. File sizes even result in an error @orionw what is the best approach here?

@orionw
Copy link
Contributor

orionw commented Jul 10, 2024

I added a script to the repo remove_spaces_from_large_json_files.py that should be run before committing and should lower the file size for ones that are too large.

I don't know if it will reduce the size that much though -- why is the GritLM one so large? All the other FloresBitexMining ones were about 13 MB and went to 8MB or so with that file.

@KennethEnevoldsen
Copy link
Contributor

I don't know if it will reduce the size that much though -- why is the GritLM one so large? All the other FloresBitexMining ones were about 13 MB and went to 8MB or so with that file.

I believe the previous run on Flores did not include all current splits (since the cross-lingual grid was added). It does seem a bit excessive atm. but runtime it is actually quite reasonable (due to each language only being embedded once)

@orionw
Copy link
Contributor

orionw commented Jul 10, 2024

I see, thanks. Would it be possible to separate the results file by split? Or craft some other solution for Flores, as I think it's the only one with this problem. One potentially solution is that I could write a script that would turn Flores files into a number of chunks (by split or some other method) and read those chunks in that specific format in the results?

Otherwise we will need to think of a new solution, as the files have to < 10 MB and non-binary files for everything to work (Git LFS is required for HF spaces at 10MB or for any size of binary files).

@KennethEnevoldsen
Copy link
Contributor

Hmm so the simplest idea I can think of is .gz everything that is larger than 10mb, though we loose readability from that (though removing spaces + large file size already gets us there)

We might also consider only saving meaningful digits ( 0.821602063301282 -> 0.82160). This should also make normal results files more readable so might be a good thing regardless.

Going beyond that we might have a script that removed everything but the main score of the tasks (e.g. here it is f1). I would probably only do this for >10mb files:

{
-         "accuracy": 0.98828125,
         "f1": 0.9845377604166666,
         "hf_subset": "asm_Beng-ben_Beng",
         "languages": [
           "asm-Beng",
           "ben-Beng"
         ],
         "main_score": 0.9845377604166666,
-         "precision": 0.9827473958333333,
-         "recall": 0.98828125
},

We could optionally also remove languages as it can be reconstructed (task name -> task metadata -(along with hf_subset)-> languages), but atm. this is a 100% valid mteb result.

All of these naturally only work up to a point (at which point we should probably take a look at the task).

@orionw
Copy link
Contributor

orionw commented Jul 10, 2024

Hmm so the simplest idea I can think of is .gz everything that is larger than 10mb, though we loose readability from that (though removing spaces + large file size already gets us there)

Unfortunately gz files are binary and so have to be Git LFS in Huggingface Datasets/Spaces. :(

We might also consider only saving meaningful digits ( 0.821602063301282 -> 0.82160). This should also make normal results files more readable so might be a good thing regardless.

I like this as a general idea anyways, we shouldn't need more than 6 ish significant digits and even that is very conservative. Between that and the language removal though, I'm not 100% sure if we could reduce the file to 1/3 of the size.

Going beyond that we might have a script that removed everything but the main score of the tasks (e.g. here it is f1)

I don't love the idea of removing results, if can help it. Just in case someone wanted to analyze other metrics.

@KennethEnevoldsen do you think separating by split is too awkward?

If this ends up being too difficult we can not have a Github mirror, but if it's just the Flores file I think I'm fine creating a workaround hack for it.

@KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen do you think separating by split is too awkward?

Hmm yeah, this is a good question. Generally, I would want all the scores that make up the "main_score" within one file. If that is across two splits I would probably do that.

You can naturally run a task on the "train" set without it being required. Therefore I have added the:

    def validate_and_filter_scores(self):
         """This ensures that the scores are correct for the given task, by removing any splits besides those specified in the task metadata.
         Additionally, it also ensure that all of the splits required as well as the languages are present in the scores.
         """

Which removes all unrequired splits (and that all required splits are present)

Currently in PR: https://github.com/embeddings-benchmark/mteb/pull/1038/files#diff-7e2303fbc5dbdaea469f570ad2c469a88773c92be19827925dd19ade83c04b66

@KennethEnevoldsen
Copy link
Contributor

Okay so did an experiment here:

❯ du -h FloresBitextMining*
 28M    FloresBitextMining.json
 23M    FloresBitextMining_six_dec.json
 15M    FloresBitextMining_small_six_dec_and_no_spaces.json

That does not quite get us there, but I see that we have both dev and devtest, which seems over the top.

❯ du -h FloresBitextMining*
 14M    FloresBitextMining.json
 11M    FloresBitextMining_six_dec.json
8.0M    FloresBitextMining_small_six_dec_and_no_spaces.json

The question is whether that leads to a problem for the leaderboard (I could not find cases where it was used).

I have made the required changes here:

embeddings-benchmark/mteb#1075

@orionw
Copy link
Contributor

orionw commented Jul 11, 2024

I don't think Flores is in the leaderboard yet since we haven't shown MMTEB. As long as we have the same split for all models we want to show I think keeping only the one is fine.

From their paper:

FLORES-101 is divided into dev, devtest, and test. The dev set is meant to be used for hyperparameter tuning. The devtest is meant to be used for testing purposes during the development phase. The test set will not be released, but is available via a publicly available evaluation server, while the dev and devtest are publicly downloadable

So I think keeping only devtest is reasonable for the leaderboard. Thanks for digging into this @KennethEnevoldsen!

This was referenced Aug 9, 2024
@KennethEnevoldsen KennethEnevoldsen merged commit 3231a01 into main Aug 12, 2024
1 check failed
@KennethEnevoldsen KennethEnevoldsen deleted the updateres branch August 12, 2024 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants