-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MODEL EVALUATION REQUEST] allenai/OLMo-2-1124-13B #658
Comments
Tried to run this, but getting the following error:
|
@Mikeriess The reason for that was that they hadn't set the pip uninstall -y -qqq scandeval && pip install -qqq scandeval[all]@git+https://github.com/ScandEval/ScandEval |
@saattrupdan using your latest: |
@Mikeriess Hopefully the "AutoModelForSequenceClassification" part of your error message should've changed to "AutoModelForCausalLM", meaning that it now detects it as a generative model. The last part remaining is updating vllm manually: Currently I've set a hard upper bound of |
Not supported yet it seems:
|
Can you try manually updating vLLM and trying again? |
Updating to |
Here you go @saattrupdan - this was surprisingly slow considering its size :-)
|
Thanks @Mikeriess! Results are live now. I've noticed some of my evaluations have become slow as well, not sure if it's due to an update in some of the packages. Let me know if it persists for other models 🙂 |
Will do 👌 Was using 8x H100 for this (nvidia-smi showed full utilization across all GPUs during benchmarking), yet it took about as long as a 70b model would in previous evaluations 🤔 I could try and compare benchmark time across vLLM versions on e.g. a llama-3.1-8b? |
@Mikeriess Yes, that would be helpful! Also check if it is due to particular tasks. For instance, the NER task is sometimes the culprit, as that's the only task we're using structured generation. |
Decided to test with gemma-2-2b instead in the interest of time :-) Currently running the baseline, will probably know more tomorrow 👍 |
It ran fine with base scandeval install, but for latest vllm version the kernel died after the 8 first benchmarks; re-running now to see if it was a coincidence |
Third time the kernel dies now. Again after completing the This was the sequence I ran:
Seems like the latest scandeval and vllm causes some stability issues on NER, however, I am able to pick up after it has died and I can manually see that this process is extremely slow. Unfortunately my time measurements got ruined when the kernel kept dying. |
@Mikeriess Yeah the crashing is a known issue with Outlines (dottxt-ai/outlines#1351), and happens as newer versions of vLLM uses newer versions of Outlines, encountering the bug. That's the reason why I added the vLLM upper bound in the first place, and I see it hasn't been fixed yet. One solution could be to change the structured generation backend to XGrammar, which is super fast and reliable, but unfortunately they are currently missing features that we're using in the benchmark (mlc-ai/xgrammar#192, mlc-ai/xgrammar#131, mlc-ai/xgrammar#104), causing significantly worse evaluation results. They say they're working on it, though. So we're at an annoying point where we can't really use the new versions of vLLM as the structured generation packages aren't good enough yet. Outlines is the lesser evil at the moment, since it is feature complete, but just requires a million GBs of memory (causing your crash). This is only relevant for newer model architectures, however, as the older ones are still supported by the older vLLM versions. The joy of relying on external packages, eh? 🤷 |
@saattrupdan Haha, yeah, its great. But your hands are tied in my eyes, I only see waiting as a possibility, unless we re-evaluate a ton of models 😅 In the meantime we can focus on fine-tunes on existing architecture in the various languages I guess. I still think that is somewhat interesting to see. |
Hello, What do you think about this approach? |
Hi there! I would be wary in building our own custom grammars, as it would be another part that we would need to maintain over the years. Since the XGrammar team seem to already be working on this, it would probably be more beneficial to contribute directly to that repo, to ensure the longevity of the feature. |
Model ID
allenai/OLMo-2-1124-13B
Model type
Decoder model (e.g., GPT)
Evaluation languages
Merged model
Not a merged model
The text was updated successfully, but these errors were encountered: