-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates #6
Updates #6
Conversation
Oh wow. File sizes even result in an error @orionw what is the best approach here? |
I added a script to the repo I don't know if it will reduce the size that much though -- why is the GritLM one so large? All the other FloresBitexMining ones were about 13 MB and went to 8MB or so with that file. |
I believe the previous run on Flores did not include all current splits (since the cross-lingual grid was added). It does seem a bit excessive atm. but runtime it is actually quite reasonable (due to each language only being embedded once) |
I see, thanks. Would it be possible to separate the results file by split? Or craft some other solution for Flores, as I think it's the only one with this problem. One potentially solution is that I could write a script that would turn Flores files into a number of chunks (by split or some other method) and read those chunks in that specific format in the results? Otherwise we will need to think of a new solution, as the files have to < 10 MB and non-binary files for everything to work (Git LFS is required for HF spaces at 10MB or for any size of binary files). |
Hmm so the simplest idea I can think of is .gz everything that is larger than 10mb, though we loose readability from that (though removing spaces + large file size already gets us there) We might also consider only saving meaningful digits ( 0.821602063301282 -> 0.82160). This should also make normal results files more readable so might be a good thing regardless. Going beyond that we might have a script that removed everything but the main score of the tasks (e.g. here it is f1). I would probably only do this for >10mb files: {
- "accuracy": 0.98828125,
"f1": 0.9845377604166666,
"hf_subset": "asm_Beng-ben_Beng",
"languages": [
"asm-Beng",
"ben-Beng"
],
"main_score": 0.9845377604166666,
- "precision": 0.9827473958333333,
- "recall": 0.98828125
}, We could optionally also remove languages as it can be reconstructed (task name -> task metadata -(along with hf_subset)-> languages), but atm. this is a 100% valid mteb result. All of these naturally only work up to a point (at which point we should probably take a look at the task). |
Unfortunately gz files are binary and so have to be Git LFS in Huggingface Datasets/Spaces. :(
I like this as a general idea anyways, we shouldn't need more than 6 ish significant digits and even that is very conservative. Between that and the language removal though, I'm not 100% sure if we could reduce the file to 1/3 of the size.
I don't love the idea of removing results, if can help it. Just in case someone wanted to analyze other metrics. @KennethEnevoldsen do you think separating by split is too awkward? If this ends up being too difficult we can not have a Github mirror, but if it's just the Flores file I think I'm fine creating a workaround hack for it. |
Hmm yeah, this is a good question. Generally, I would want all the scores that make up the "main_score" within one file. If that is across two splits I would probably do that. You can naturally run a task on the "train" set without it being required. Therefore I have added the:
Which removes all unrequired splits (and that all required splits are present) Currently in PR: https://github.com/embeddings-benchmark/mteb/pull/1038/files#diff-7e2303fbc5dbdaea469f570ad2c469a88773c92be19827925dd19ade83c04b66 |
Okay so did an experiment here:
That does not quite get us there, but I see that we have both dev and devtest, which seems over the top.
The question is whether that leads to a problem for the leaderboard (I could not find cases where it was used). I have made the required changes here: |
I don't think Flores is in the leaderboard yet since we haven't shown MMTEB. As long as we have the same split for all models we want to show I think keeping only the one is fine. From their paper:
So I think keeping only |
Feel free to merge if it looks good!
The file size makes me feel good that we're not putting these in the main mteb repo 😅
26M results/GritLM-7B/13f00a0e36500c80ce12870ea513846a066004af/FloresBitextMining.json