-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use Factor to evaluate instruction-tuned LLMs? #2
Comments
Hey @HillZhang1999 Thanks for your interest in our paper! We have recently added Expert-FACTOR, based on ExpertQA (https://arxiv.org/abs/2309.07852), a long-from question answering dataset. In order to adapt the QA task to text completion, we first concatenated each question-answer pair into a single doc, and then utilized the FACTOR data pipeline. Since the benchmark evaluates factuality in a more task-specific scenario, it may be more suited for evaluating instruction-tuned model. |
Thanks a lot! I will try them asap. |
Hi @HillZhang1999 @dorm-ai21 if you're still trying to benchmark / eval llms I'm the maintainer of LiteLLM. I believe LiteLLM makes it easier for you to run benchmarks and evaluate LLMs (I'd love your feedback if it does not) Try it here: https://docs.litellm.ai/docs/simple_proxy Using LiteLLM Proxy ServerCreating a proxy serverOllama models $ litellm --model ollama/llama2 --api_base http://localhost:11434 Hugging Face Models $ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model claude-instant-1 Anthropic $ export ANTHROPIC_API_KEY=my-api-key
$ litellm --model claude-instant-1 Palm $ export PALM_API_KEY=my-palm-key
$ litellm --model palm/chat-bison Set api base to proxy
Using to run an eval on lm harness:python3 -m lm_eval \
--model openai-completions \
--model_args engine=davinci \
--task crows_pairs_english_age |
Dear authors,
Firstly, I would like to express my gratitude for your exceptional work.
Recently, I attempted to utilize Factor for evaluating instruction-tuned models, such as llama2-chat. However, I observed that the evaluation format of Factor is primarily designed for text completion, making it more suitable for base models rather than instruction-tuned models.
In an effort to instruct SFT models, I experimented with prompts such as "Please complete the following text." However, their performance still falls behind that of base models. This differs from the results I obtained when conducting experiments on other benchmarks, such as TruthfulQA.
I would greatly appreciate any insights or suggestions you may have. Thank you!
The text was updated successfully, but these errors were encountered: