Download NQ test data from test data of NQ; Dev data of HotpotQA from KILT; dev data of MSMACRO from msmarco and msm-qa
We directly use RocketQAv2 on wiki-based NQ and HotpotQA datsets and ADORE on web-based MSMARCO dataset.
We use the entity substition and generation method
Filter out results from existing retrievers that do not contain answers, and the reference is noisy passages
We have also provided the final GTI benchmark, which you can download from link
We have also provided the final GTI benchmark, which you can download from link
Taking the testing of LlaMa 2-13B as an example, we demonstrated the use of four methods: pointwise, pairwise, list wise set, and list wise rank. If you want to test other models, you can directly replace them.
python llama2-point.py