You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to reproduce the summaries generated by HF models, namely Phi-2, and Llama 3.2-1B instruct, since the result I'm getting following the described prompt / pipeline is not close to the one in the leaderboard. Comparing the summary generation im getting with the one in the hf dataset, I found that theres a large difference. One thing those models are sturggling with for example that I don't see in HF dataset was sentence reptition.
So my question is is generation config used for hugginface models ? I'm currently using the text-gerenation pipelien setting do_sample=False (as i found mentioned in another issue that 0 temperature was used), if code can be provided it could be also helpeful to see whats giving raise to this variation in results.
Edit, I also can't reproduce the leadeboard score for Llama 3.2-1B using the generated summaries in the HF dataset linked, this is because 1 - I don't know what threshold was used to determine if a response is hallucinated / consisten or not edit : I will use top_k = 1
2 - The dataset includes the ommited samples (length is 1006 and not 850ish)
The text was updated successfully, but these errors were encountered:
Hi @Noor-Nizar, thanks for your interest in our leaderboard.
For the summary generation:
Phi-2 is accessed via LiteLLM Python SDK HuggingFace API with temperature=0.0.
Llama 3.2 1B is accessed via Together AI chat endpoint with temperature=0 and max_tokens=250.
Please note that the leaderboard is scored based on the HHEM-2.1 model, which excels in hallucination detection but not open-sourced. While we offer HHEM-2.1-Open as an open-source alternative, it may produce slightly different results.
I'm trying to reproduce the summaries generated by HF models, namely Phi-2, and Llama 3.2-1B instruct, since the result I'm getting following the described prompt / pipeline is not close to the one in the leaderboard. Comparing the summary generation im getting with the one in the hf dataset, I found that theres a large difference. One thing those models are sturggling with for example that I don't see in HF dataset was sentence reptition.
So my question is is generation config used for hugginface models ? I'm currently using the text-gerenation pipelien setting do_sample=False (as i found mentioned in another issue that 0 temperature was used), if code can be provided it could be also helpeful to see whats giving raise to this variation in results.
Edit, I also can't reproduce the leadeboard score for Llama 3.2-1B using the generated summaries in the HF dataset linked, this is because
1 - I don't know what threshold was used to determine if a response is hallucinated / consisten or notedit : I will use top_k = 12 - The dataset includes the ommited samples (length is 1006 and not 850ish)
The text was updated successfully, but these errors were encountered: