\n", + " | metrics | \n", + "score | \n", + "
---|---|---|
0 | \n", + "context_relevance | \n", + "0.200000 | \n", + "
1 | \n", + "faithfulness | \n", + "0.611111 | \n", + "
2 | \n", + "sas | \n", + "0.546086 | \n", + "
\n", + " | questions | \n", + "contexts | \n", + "true_answers | \n", + "predicted_answers | \n", + "context_relevance | \n", + "faithfulness | \n", + "sas | \n", + "
---|---|---|---|---|---|---|---|
0 | \n", + "What are the two main tasks BERT is pre-traine... | \n", + "[Document(id=1996eb783b7e2934527de00e3d5f82fb5... | \n", + "Masked LM (MLM) and Next Sentence Prediction (... | \n", + "The two main tasks BERT is pre-trained on are ... | \n", + "0 | \n", + "1.000000 | \n", + "0.552495 | \n", + "
1 | \n", + "What model sizes are reported for BERT, and wh... | \n", + "[Document(id=8906a653a71ec55161d5f8c6203335456... | \n", + "BERTBASE (L=12, H=768, A=12, Total Parameters=... | \n", + "The BERT model sizes reported are:\\n\\n1. **BER... | \n", + "0 | \n", + "0.000000 | \n", + "0.664142 | \n", + "
2 | \n", + "How does BERT's architecture facilitate the us... | \n", + "[Document(id=320d3c00ef93938ee6cc92f6a742ba1ed... | \n", + "BERT uses a multi-layer bidirectional Transfor... | \n", + "BERT's architecture facilitates the use of a u... | \n", + "0 | \n", + "1.000000 | \n", + "0.817575 | \n", + "
3 | \n", + "Can you describe the modifications LLaMA makes... | \n", + "[Document(id=f360dea1ec15f8f778718ae1e13eb855b... | \n", + "LLaMA incorporates pre-normalization (using R... | \n", + "None | \n", + "0 | \n", + "0.000000 | \n", + "0.015276 | \n", + "
4 | \n", + "How does LLaMA's approach to embedding layer o... | \n", + "[Document(id=f360dea1ec15f8f778718ae1e13eb855b... | \n", + "LLaMA introduces optimizations in its embeddin... | \n", + "None | \n", + "0 | \n", + "0.000000 | \n", + "0.075397 | \n", + "
5 | \n", + "How were the questions for the multitask test ... | \n", + "[Document(id=9415e713cf73ffea5ca383126c54f7ec4... | \n", + "Questions were manually collected by graduate ... | \n", + "The questions for the multitask test were manu... | \n", + "0 | \n", + "1.000000 | \n", + "0.652526 | \n", + "
6 | \n", + "How does BERT's performance on the GLUE benchm... | \n", + "[Document(id=606c67eb5eeb136ad77616d2ef06a580b... | \n", + "BERT achieved new state-of-the-art on the GLUE... | \n", + "BERT significantly outperforms all previous st... | \n", + "0 | \n", + "0.833333 | \n", + "0.857448 | \n", + "
7 | \n", + "What significant improvements does BERT bring ... | \n", + "[Document(id=4ca8419f5c01c094bbda9617b3ce328cb... | \n", + "BERT set new records on SQuAD v1.1 and v2.0, s... | \n", + "BERT brings substantial improvements to the SQ... | \n", + "0 | \n", + "1.000000 | \n", + "0.586361 | \n", + "
8 | \n", + "What unique aspect of the LLaMA training datas... | \n", + "[Document(id=236e5c1e3c782e68912426a7f2543710c... | \n", + "LLaMA's training dataset is distinctive for b... | \n", + "The unique aspect of the LLaMA training datase... | \n", + "0 | \n", + "0.666667 | \n", + "0.962779 | \n", + "
9 | \n", + "What detailed methodology does LLaMA utilize t... | \n", + "[Document(id=9885fbffa74c564acd7a255e8b66a3343... | \n", + "LLaMA's methodology for ensuring data diversit... | \n", + "None | \n", + "0 | \n", + "0.000000 | \n", + "-0.005470 | \n", + "
10 | \n", + "What are the specific domains covered by the m... | \n", + "[Document(id=9415e713cf73ffea5ca383126c54f7ec4... | \n", + "The test covers 57 subjects across STEM, human... | \n", + "The specific domains covered by the multitask ... | \n", + "1 | \n", + "1.000000 | \n", + "0.620999 | \n", + "
11 | \n", + "What specific enhancements are recommended for... | \n", + "[Document(id=ac7c3c2e29e31cf47dc1027f7d31ea94d... | \n", + "Enhancements should focus on developing models... | \n", + "The context does not provide specific enhancem... | \n", + "0 | \n", + "1.000000 | \n", + "0.370423 | \n", + "
12 | \n", + "What methodology does DetectGPT use to generat... | \n", + "[Document(id=a862c889a8c02afa59e422bc2cbeb2425... | \n", + "DetectGPT generates minor perturbations using ... | \n", + "DetectGPT generates minor perturbations in the... | \n", + "1 | \n", + "0.666667 | \n", + "0.734830 | \n", + "
13 | \n", + "Discuss the significance of DetectGPT's detect... | \n", + "[Document(id=ef8ff80b74a24f6cec05be8135930ba1b... | \n", + "DtectGPT's approach is significant as it provi... | \n", + "DetectGPT's detection approach is significant ... | \n", + "0 | \n", + "1.000000 | \n", + "0.508008 | \n", + "
14 | \n", + "How is the student model, DistilBERT, initiali... | \n", + "[Document(id=33d936e116b7764ce538130aaa40c7b37... | \n", + "DistilBERT is initialized from the teacher mod... | \n", + "The student model, DistilBERT, is initialized ... | \n", + "1 | \n", + "0.000000 | \n", + "0.778503 | \n", + "
\n", + " | metrics | \n", + "score | \n", + "
---|---|---|
0 | \n", + "metric_context_relevance | \n", + "0.266667 | \n", + "
1 | \n", + "metric_sas | \n", + "0.537721 | \n", + "
2 | \n", + "metric_faithfulness | \n", + "0.747778 | \n", + "
\n", + " | metrics | \n", + "score | \n", + "
---|---|---|
0 | \n", + "metric_context_relevance | \n", + "0.266667 | \n", + "
1 | \n", + "metric_sas | \n", + "0.654073 | \n", + "
2 | \n", + "metric_faithfulness | \n", + "0.796429 | \n", + "
\n", + " | questions | \n", + "contexts | \n", + "responses | \n", + "ground_truth_answers | \n", + "rag_eval_metric_context_relevance | \n", + "rag_eval_metric_sas | \n", + "rag_eval_metric_faithfulness | \n", + "harness_eval_run_gpt4_metric_context_relevance | \n", + "harness_eval_run_gpt4_metric_sas | \n", + "harness_eval_run_gpt4_metric_faithfulness | \n", + "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", + "What are the two main tasks BERT is pre-traine... | \n", + "[pre-trained with Ima-\\ngeNet (Deng et al., 20... | \n", + "The two main tasks BERT is pre-trained on are ... | \n", + "Masked LM (MLM) and Next Sentence Prediction (... | \n", + "0 | \n", + "0.593595 | \n", + "1.000000 | \n", + "0 | \n", + "0.220820 | \n", + "1.000000 | \n", + "
1 | \n", + "What model sizes are reported for BERT, and wh... | \n", + "[the\\ntraining loss for 336M and 752M BERT mod... | \n", + "The model sizes reported for BERT and their sp... | \n", + "BERTBASE (L=12, H=768, A=12, Total Parameters=... | \n", + "0 | \n", + "0.626480 | \n", + "1.000000 | \n", + "0 | \n", + "0.762167 | \n", + "1.000000 | \n", + "
2 | \n", + "How does BERT's architecture facilitate the us... | \n", + "[BERT: Pre-training of Deep Bidirectional Tran... | \n", + "BERT's architecture facilitates the use of a u... | \n", + "BERT uses a multi-layer bidirectional Transfor... | \n", + "1 | \n", + "0.878212 | \n", + "1.000000 | \n", + "1 | \n", + "0.697250 | \n", + "1.000000 | \n", + "
3 | \n", + "Can you describe the modifications LLaMA makes... | \n", + "[to the transformer\\narchitecture (Vaswani et ... | \n", + "None | \n", + "LLaMA incorporates pre-normalization (using R... | \n", + "0 | \n", + "0.015276 | \n", + "0.000000 | \n", + "0 | \n", + "0.563944 | \n", + "0.857143 | \n", + "
4 | \n", + "How does LLaMA's approach to embedding layer o... | \n", + "[to the transformer\\narchitecture (Vaswani et ... | \n", + "None | \n", + "LLaMA introduces optimizations in its embeddin... | \n", + "0 | \n", + "0.075397 | \n", + "0.000000 | \n", + "0 | \n", + "0.626173 | \n", + "1.000000 | \n", + "
5 | \n", + "How were the questions for the multitask test ... | \n", + "[of subjects that either do not neatly ο¬t into... | \n", + "The questions for the multitask test were manu... | \n", + "Questions were manually collected by graduate ... | \n", + "0 | \n", + "0.639905 | \n", + "0.800000 | \n", + "0 | \n", + "0.611838 | \n", + "1.000000 | \n", + "
6 | \n", + "How does BERT's performance on the GLUE benchm... | \n", + "[GLUE provides a lightweight classiο¬cation API... | \n", + "BERT significantly outperforms previous state-... | \n", + "BERT achieved new state-of-the-art on the GLUE... | \n", + "0 | \n", + "0.808857 | \n", + "1.000000 | \n", + "0 | \n", + "0.853133 | \n", + "1.000000 | \n", + "
7 | \n", + "What significant improvements does BERT bring ... | \n", + "[ο¬ne-tuning data shufο¬ing and clas-\\nsiο¬er lay... | \n", + "BERT brings significant improvements to the SQ... | \n", + "BERT set new records on SQuAD v1.1 and v2.0, s... | \n", + "0 | \n", + "0.653101 | \n", + "1.000000 | \n", + "0 | \n", + "0.662145 | \n", + "0.375000 | \n", + "
8 | \n", + "What unique aspect of the LLaMA training datas... | \n", + "[model, Gopher, has worse\\nperformance than Ch... | \n", + "LLaMA was trained exclusively on publicly avai... | \n", + "LLaMA's training dataset is distinctive for b... | \n", + "0 | \n", + "0.894204 | \n", + "1.000000 | \n", + "0 | \n", + "0.949199 | \n", + "1.000000 | \n", + "
9 | \n", + "What detailed methodology does LLaMA utilize t... | \n", + "[the description and satisο¬es the\\ntest cases.... | \n", + "None | \n", + "LLaMA's methodology for ensuring data diversit... | \n", + "0 | \n", + "-0.005470 | \n", + "0.000000 | \n", + "0 | \n", + "0.681471 | \n", + "0.000000 | \n", + "
10 | \n", + "What are the specific domains covered by the m... | \n", + "[of subjects that either do not neatly ο¬t into... | \n", + "The specific domains covered by the multitask ... | \n", + "The test covers 57 subjects across STEM, human... | \n", + "1 | \n", + "0.581956 | \n", + "0.666667 | \n", + "1 | \n", + "0.532457 | \n", + "0.714286 | \n", + "
11 | \n", + "What specific enhancements are recommended for... | \n", + "[Published as a conference paper at ICLR 2021\\... | \n", + "The context does not provide specific enhancem... | \n", + "Enhancements should focus on developing models... | \n", + "0 | \n", + "0.310243 | \n", + "1.000000 | \n", + "0 | \n", + "0.458608 | \n", + "1.000000 | \n", + "
12 | \n", + "What methodology does DetectGPT use to generat... | \n", + "[of the data distribution on DetectGPT, partic... | \n", + "DetectGPT generates minor perturbations in the... | \n", + "DetectGPT generates minor perturbations using ... | \n", + "1 | \n", + "0.780353 | \n", + "1.000000 | \n", + "1 | \n", + "0.822207 | \n", + "1.000000 | \n", + "
13 | \n", + "Discuss the significance of DetectGPT's detect... | \n", + "[different from the\\nsource model, detection p... | \n", + "DetectGPT's detection approach is significant ... | \n", + "DtectGPT's approach is significant as it provi... | \n", + "0 | \n", + "0.491360 | \n", + "1.000000 | \n", + "0 | \n", + "0.566447 | \n", + "1.000000 | \n", + "
14 | \n", + "How is the student model, DistilBERT, initiali... | \n", + "[works focus on building task-speciο¬c distilla... | \n", + "The student model, DistilBERT, is initialized ... | \n", + "DistilBERT is initialized from the teacher mod... | \n", + "1 | \n", + "0.722349 | \n", + "0.750000 | \n", + "1 | \n", + "0.803231 | \n", + "0.000000 | \n", + "
\n", + " | metrics | \n", + "score | \n", + "
---|---|---|
0 | \n", + "metric_sas | \n", + "0.574303 | \n", + "
1 | \n", + "metric_faithfulness | \n", + "0.780000 | \n", + "
2 | \n", + "metric_context_relevance | \n", + "0.400000 | \n", + "