Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing HF model Summaries #80

Open
Noor-Nizar opened this issue Nov 27, 2024 · 2 comments
Open

Reproducing HF model Summaries #80

Noor-Nizar opened this issue Nov 27, 2024 · 2 comments

Comments

@Noor-Nizar
Copy link

Noor-Nizar commented Nov 27, 2024

I'm trying to reproduce the summaries generated by HF models, namely Phi-2, and Llama 3.2-1B instruct, since the result I'm getting following the described prompt / pipeline is not close to the one in the leaderboard. Comparing the summary generation im getting with the one in the hf dataset, I found that theres a large difference. One thing those models are sturggling with for example that I don't see in HF dataset was sentence reptition.

So my question is is generation config used for hugginface models ? I'm currently using the text-gerenation pipelien setting do_sample=False (as i found mentioned in another issue that 0 temperature was used), if code can be provided it could be also helpeful to see whats giving raise to this variation in results.

Edit, I also can't reproduce the leadeboard score for Llama 3.2-1B using the generated summaries in the HF dataset linked, this is because
1 - I don't know what threshold was used to determine if a response is hallucinated / consisten or not edit : I will use top_k = 1

2 - The dataset includes the ommited samples (length is 1006 and not 850ish)

@Miaoranmmm
Copy link
Contributor

Miaoranmmm commented Dec 3, 2024

Hi @Noor-Nizar, thanks for your interest in our leaderboard.

For the summary generation:

  • Phi-2 is accessed via LiteLLM Python SDK HuggingFace API with temperature=0.0.
  • Llama 3.2 1B is accessed via Together AI chat endpoint with temperature=0 and max_tokens=250.

Please note that the leaderboard is scored based on the HHEM-2.1 model, which excels in hallucination detection but not open-sourced. While we offer HHEM-2.1-Open as an open-source alternative, it may produce slightly different results.

@forrestbao
Copy link
Contributor

@Noor-Nizar Kindly please let us know whether we have answered your question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants