Reproducing repllama on passage retrieval. #152

hengran · 2024-09-19T07:13:34Z

Hello, thanks for sharing the great work!
I download the checkpoint of repllama from huggiface. I encode the query and passages with this checkpoint and find the result I got is MRR@10: 39.69, while in the paper is 41.2.

p_max_len=512
q_max_len=512
CUDA_VISIBLE_DEVICES=1 python -m tevatron.retriever.driver.encode \
  --output_dir=temp \
  --model_name_or_path $model_path \
  --lora_name_or_path $lora_model_save_path \
  --normalize \
  --encode_is_query \
  --fp16 \
  --per_device_eval_batch_size 64 \
  --passage_max_len $p_max_len \
  --pooling eos \
  --append_eos_token \
  --query_max_len $q_max_len \
  --dataset_path $dev_query_path \
  --encode_output_path $encode_path/dev_query_emb.pkl

sleep 10s
echo "====== encode corpus"
for s in 0 1 2 3 4 5 6 7;
do
gpuid=$s
CUDA_VISIBLE_DEVICES=$gpuid python -m tevatron.retriever.driver.encode \
  --output_dir=temp \
  --model_name_or_path $model_path \
  --lora_name_or_path $lora_model_save_path \
  --normalize \
  --fp16 \
  --per_device_eval_batch_size 64 \
  --pooling eos \
  --append_eos_token \
  --passage_max_len $p_max_len \
  --dataset_path $corpus_path \
  --query_max_len $q_max_len \
  --dataset_number_of_shards 8 \
  --dataset_shard_index ${s} \
  --encode_output_path $encode_path/corpus_emb.${s}.pkl   &
  # 等待最后一个循环完成
  if [ "$s" == "7" ]; then
      wait
  fi
done

The text was updated successfully, but these errors were encountered:

hengran · 2024-09-19T07:17:36Z

Hello, thanks for sharing the great work! I download the checkpoint of repllama from huggiface. I encode the query and passages with this checkpoint and find the result I got is MRR@10: 39.69, while in the paper is 41.2.

p_max_len=512
q_max_len=512
CUDA_VISIBLE_DEVICES=1 python -m tevatron.retriever.driver.encode \
  --output_dir=temp \
  --model_name_or_path $model_path \
  --lora_name_or_path $lora_model_save_path \
  --normalize \
  --encode_is_query \
  --fp16 \
  --per_device_eval_batch_size 64 \
  --passage_max_len $p_max_len \
  --pooling eos \
  --append_eos_token \
  --query_max_len $q_max_len \
  --dataset_path $dev_query_path \
  --encode_output_path $encode_path/dev_query_emb.pkl

sleep 10s
echo "====== encode corpus"
for s in 0 1 2 3 4 5 6 7;
do
gpuid=$s
CUDA_VISIBLE_DEVICES=$gpuid python -m tevatron.retriever.driver.encode \
  --output_dir=temp \
  --model_name_or_path $model_path \
  --lora_name_or_path $lora_model_save_path \
  --normalize \
  --fp16 \
  --per_device_eval_batch_size 64 \
  --pooling eos \
  --append_eos_token \
  --passage_max_len $p_max_len \
  --dataset_path $corpus_path \
  --query_max_len $q_max_len \
  --dataset_number_of_shards 8 \
  --dataset_shard_index ${s} \
  --encode_output_path $encode_path/corpus_emb.${s}.pkl   &
  # 等待最后一个循环完成
  if [ "$s" == "7" ]; then
      wait
  fi
done

Since the paper only includes Recal@1000, I want to reproduce the model to obtain the values for Recall@100 and Recall@50, but I've found that the current reproduction results are different from the results of paper. It's possible that the parameters I've set for reproduction are not correct.

MXueguang · 2024-09-22T15:01:18Z

Hi @hengran, are you using the corpus downloaded from tevatron?

MXueguang · 2024-09-22T15:03:40Z

I used qlen=32 plen=156 during training and encoding repllama. Not sure if this can made that difference.

hengran · 2024-09-28T05:28:51Z

I used qlen=32 plen=156 during training and encoding repllama. Not sure if this can made that difference.

thanks for you reply, I try it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing repllama on passage retrieval. #152

Reproducing repllama on passage retrieval. #152

hengran commented Sep 19, 2024

hengran commented Sep 19, 2024

MXueguang commented Sep 22, 2024

MXueguang commented Sep 22, 2024

hengran commented Sep 28, 2024

Reproducing repllama on passage retrieval. #152

Reproducing repllama on passage retrieval. #152

Comments

hengran commented Sep 19, 2024

hengran commented Sep 19, 2024

MXueguang commented Sep 22, 2024

MXueguang commented Sep 22, 2024

hengran commented Sep 28, 2024