-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: added evaluation script #14
base: main
Are you sure you want to change the base?
Conversation
Add WikiCLIR Retrieval Task
Add the GerDaLIR dataset
Add German STSBenchmark task
Add German XMarket dataset
Co-authored-by: Saba Sturua <[email protected]>
add paws x dataset
Add ir_datasets as dependency
Fix: Adding MTEB_SINGLE_GPU environment variable
Add GermanDPR dataset
feat: add miracl reranking task for german
Fixes mismatch between description and HuggingFace dataset
TASK_LIST = ["MIRACL", "GermanDPR", "PawsX", "GermanSTSBenchmark", "XMarket", "GerDaLIR", "WikiCLIR"] | ||
MODELS = ['intfloat/multilingual-e5-base', 'intfloat/multilingual-e5-large', 'T-Systems-onsite/cross-en-de-roberta-sentence-transformer', 'sentence-transformers/distiluse-base-multilingual-cased-v2'] | ||
for model_name in MODELS: | ||
model = SentenceTransformer(model_name, device='cuda') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This automatically limits the max_seq_length
to 512. If this is desired, then I think the MTEB scores we publish should also result from the same max_seq_length
of 512 and not 8k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence-transformers/distiluse-base-multilingual-cased-v2
model actually uses a sequence length of 128. I'm not sure how large the positional embeddings even are for these models.
No description provided.