Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use Factor to evaluate instruction-tuned LLMs? #2

Open
HillZhang1999 opened this issue Nov 2, 2023 · 3 comments
Open

How to use Factor to evaluate instruction-tuned LLMs? #2

HillZhang1999 opened this issue Nov 2, 2023 · 3 comments

Comments

@HillZhang1999
Copy link

Dear authors,

Firstly, I would like to express my gratitude for your exceptional work.

Recently, I attempted to utilize Factor for evaluating instruction-tuned models, such as llama2-chat. However, I observed that the evaluation format of Factor is primarily designed for text completion, making it more suitable for base models rather than instruction-tuned models.

In an effort to instruct SFT models, I experimented with prompts such as "Please complete the following text." However, their performance still falls behind that of base models. This differs from the results I obtained when conducting experiments on other benchmarks, such as TruthfulQA.

I would greatly appreciate any insights or suggestions you may have. Thank you!

@dorm-ai21
Copy link
Collaborator

dorm-ai21 commented Nov 5, 2023

Hey @HillZhang1999

Thanks for your interest in our paper!

We have recently added Expert-FACTOR, based on ExpertQA (https://arxiv.org/abs/2309.07852), a long-from question answering dataset. In order to adapt the QA task to text completion, we first concatenated each question-answer pair into a single doc, and then utilized the FACTOR data pipeline. Since the benchmark evaluates factuality in a more task-specific scenario, it may be more suited for evaluating instruction-tuned model.

@HillZhang1999
Copy link
Author

Thanks a lot! I will try them asap.

@ishaan-jaff
Copy link

Hi @HillZhang1999 @dorm-ai21 if you're still trying to benchmark / eval llms

I'm the maintainer of LiteLLM. I believe LiteLLM makes it easier for you to run benchmarks and evaluate LLMs (I'd love your feedback if it does not)

Try it here: https://docs.litellm.ai/docs/simple_proxy
https://github.com/BerriAI/litellm

Using LiteLLM Proxy Server

Creating a proxy server

Ollama models

$ litellm --model ollama/llama2 --api_base http://localhost:11434

Hugging Face Models

$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model claude-instant-1

Anthropic

$ export ANTHROPIC_API_KEY=my-api-key
$ litellm --model claude-instant-1

Palm

$ export PALM_API_KEY=my-palm-key
$ litellm --model palm/chat-bison

Set api base to proxy

openai.api_base = "http://0.0.0.0:8000"

Using to run an eval on lm harness:

python3 -m lm_eval \
  --model openai-completions \
  --model_args engine=davinci \
  --task crows_pairs_english_age

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants