Results on the train section of the data by zero-shotting with the following prompt (example in Croatian):
You will be given a task. The task definition is in English, but the task itself is in another language. Here is the task!
Given the premise "Dječak je imao problema sa zakopčavanjem svoje košulje.", and that we are looking for the result of this premise, which hypothesis is more plausible?
Hypothesis 1: "Nije htio nositi košulju.".
Hypothesis 2: "Tražio je svoju majku da mu pomogne.".
Answer only with "1" or "2".
Answer:
system | avg | copa-en.train | copa-hr.train | copa-mk.train | copa-mk.train.trans | copa-sl-cer.train | copa-sl.train | copa-sr-tor.train | copa-sr-tor.train.trans | copa-sr.train | copa-sr.train.trans |
---|---|---|---|---|---|---|---|---|---|---|---|
bigscience/mt0-xxl | 0.766 | 0.89 | 0.738 | 0.838 | 0.782 | 0.54 | 0.787 | 0.78 | 0.713 | 0.828 | 0.765 |
CohereForAI/aya-101 | 0.678 | 0.808 | 0.645 | 0.72 | 0.63 | 0.53 | 0.728 | 0.69 | 0.623 | 0.745 | 0.665 |
google/gemma-7b-it | 0.599 | 0.797 | 0.57 | 0.605 | 0.54 | 0.522 | 0.593 | 0.57 | 0.552 | 0.627 | 0.618 |
gpt-3.5-turbo-0125 | 0.756 | 0.922 | 0.82 | 0.745 | 0.67 | 0.547 | 0.802 | 0.693 | 0.745 | 0.787 | 0.83 |
gpt-4-0125-preview | 0.912 | 0.988 | 0.96 | 0.943 | 0.92 | 0.595 | 0.96 | 0.9 | 0.925 | 0.965 | 0.968 |
meta-llama/Llama-2-7b-chat-hf | 0.533 | 0.152 | 0.035 | 0.033 | 0.02 | 0.175 | 0.043 | 0.09 | 0.095 | 0.145 | |
mistral/Mistral-7B-Instruct-v0.1 | 0.52 | 0.652 | 0.507 | 0.502 | 0.497 | 0.487 | 0.507 | 0.502 | 0.5 | 0.525 | 0.515 |
mistral/Mistral-7B-Instruct-v0.2 | 0.508 | 0.723 | 0.542 | 0.497 | 0.448 | 0.285 | 0.515 | 0.507 | 0.487 | 0.542 | 0.537 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 0.67 | 0.875 | 0.705 | 0.665 | 0.632 | 0.405 | 0.682 | 0.68 | 0.637 | 0.71 | 0.713 |
tiiuae/falcon-7b-instruct | 0.49 | 0.463 | 0.357 | 0.515 | 0.485 | 0.5 | 0.398 | 0.51 | 0.407 | 0.458 |
Note on bad meta-llama/Llama-2-7b-chat-hf
performance, but also other models on hard tasks such as sl-cer: it is often indecisive (gives no concrete answer), which is considered an incorrect answer, therefore has accuracy lower than 0.5 (random baseline).
Lack of clear response by model and dataset is the following.
system | copa-en.train | copa-hr.train | copa-mk.train | copa-mk.train.trans | copa-sl-cer.train | copa-sl.train | copa-sr-tor.train | copa-sr-tor.train.trans | copa-sr.train | copa-sr.train.trans |
---|---|---|---|---|---|---|---|---|---|---|
bigscience/mt0-xxl | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
CohereForAI/aya-101 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
gpt-3.5-turbo-0125 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
gpt-4-0125-preview | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
meta-llama/Llama-2-7b-chat-hf | 0.105 | 0.735 | 0.927 | 0.895 | 0.95 | 0.665 | 0.905 | 0.835 | 0.807 | 0.667 |
mistral/Mistral-7B-Instruct-v0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
mistral/Mistral-7B-Instruct-v0.2 | 0.028 | 0.05 | 0.072 | 0.14 | 0.42 | 0.052 | 0.065 | 0.07 | 0.035 | 0.043 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 0.018 | 0.01 | 0.013 | 0.028 | 0.122 | 0.01 | 0.013 | 0.02 | 0.015 | 0.015 |
tiiuae/falcon-7b-instruct | 0.02 | 0.052 | 0.3 | 0.033 | 0.072 | 0.05 | 0.18 | 0.035 | 0.172 | 0.075 |
Results on the train section of the data, on models where improvement is observed when 4-shotting with the following prompt (example in Torlak):
You will be given a task. The task definition is in English, but the task itself is in another language. You are to choose the more likely hypothesis given a premise. Take into account that we are either looking for a cause or an effect of the premise. Answer only with "1" or "2". Here are some examples of the task:
Example 1:
Premise: "Čovek odvrnuja slavinu."
Question: "effect"
Hypothesis 1: "Ve-ce se napunija sas vodu."
Hypothesis 2: "Voda ističala od slavinu."
Answer: "2"
Example 2:
Premise: "Devojčica našla bubaljku među njojne žitarice."
Question: "effect"
Hypothesis 1: "Sipala mleko u činiju."
Hypothesis 2: "Izgubila si apetit."
Answer: "2"
Example 3:
Premise: "Žena otišla u penziju."
Question: "effect"
Hypothesis 1: "Primila si penziju."
Hypothesis 2: "Otplatila si hipoteku."
Answer: "1"
Example 4:
Premise: "Teja sam si ušparam struju."
Question: "effect"
Hypothesis 1: "Pomeja sam patos u praznu sobu."
Hypothesis 2: "Ugasija sam svetlo u praznu sobu."
Answer: "2"
Now to your task!
Premise: "Devojka zamislila želju."
Question: "cause"
Hypothesis 1: "Videla crnu mačku."
Hypothesis 2: "Videla zvezdu padalicu."
Answer:
system | N-shot | avg | copa-en.train | copa-hr.train | copa-mk.train | copa-mk.train.trans | copa-sl-cer.train | copa-sl.train | copa-sr-tor.train | copa-sr-tor.train.trans | copa-sr.train | copa-sr.train.trans |
---|---|---|---|---|---|---|---|---|---|---|---|---|
gpt-3.5-turbo-0125 | 0 | 0.756 | 0.922 | 0.82 | 0.745 | 0.67 | 0.547 | 0.802 | 0.693 | 0.745 | 0.787 | 0.83 |
gpt-3.5-turbo-0125 | 10 | 0.793 | 0.935 | 0.84 | 0.77 | 0.77 | 0.53 | 0.845 | 0.772 | 0.802 | 0.805 | 0.858 |
gpt-4-0125-preview | 0 | 0.912 | 0.988 | 0.963 | 0.945 | 0.9 | 0.608 | 0.96 | 0.92 | 0.912 | 0.955 | 0.96 |
gpt-4-0125-preview | 10 | 0.956 | 0.995 | 0.988 | 0.978 | 0.965 | 0.738 | 0.98 | 0.97 | 0.968 | 0.99 | 0.99 |
mistral/Mistral-7B-Instruct-v0.1 | 0 | 0.52 | 0.652 | 0.507 | 0.502 | 0.497 | 0.487 | 0.507 | 0.502 | 0.5 | 0.525 | 0.515 |
mistral/Mistral-7B-Instruct-v0.1 | 4 | 0.59 | 0.745 | 0.593 | 0.578 | 0.56 | 0.527 | 0.598 | 0.565 | 0.542 | 0.603 | 0.595 |
mistral/Mistral-7B-Instruct-v0.1 | 10 | 0.643 | 0.82 | 0.657 | 0.603 | 0.603 | 0.54 | 0.675 | 0.603 | 0.59 | 0.693 | 0.647 |
mistral/Mistral-7B-Instruct-v0.1 | 20 | 0.623 | 0.838 | 0.66 | 0.603 | 0.588 | 0.492 | 0.637 | 0.58 | 0.575 | 0.625 | 0.63 |
mistral/Mistral-7B-Instruct-v0.2 | 0 | 0.508 | 0.723 | 0.542 | 0.497 | 0.448 | 0.285 | 0.515 | 0.507 | 0.487 | 0.542 | 0.537 |
mistral/Mistral-7B-Instruct-v0.2 | 4 | 0.7 | 0.938 | 0.718 | 0.688 | 0.647 | 0.515 | 0.738 | 0.65 | 0.63 | 0.738 | 0.743 |
mistral/Mistral-7B-Instruct-v0.2 | 10 | 0.708 | 0.925 | 0.757 | 0.708 | 0.665 | 0.507 | 0.718 | 0.667 | 0.632 | 0.75 | 0.752 |
mistral/Mistral-7B-Instruct-v0.2 | 20 | 0.714 | 0.935 | 0.755 | 0.688 | 0.68 | 0.512 | 0.738 | 0.68 | 0.652 | 0.762 | 0.743 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 0 | 0.67 | 0.875 | 0.705 | 0.665 | 0.632 | 0.405 | 0.682 | 0.68 | 0.637 | 0.71 | 0.713 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 4 | 0.745 | 0.927 | 0.797 | 0.705 | 0.718 | 0.487 | 0.777 | 0.713 | 0.73 | 0.807 | 0.785 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 10 | 0.755 | 0.932 | 0.818 | 0.703 | 0.682 | 0.5 | 0.802 | 0.723 | 0.748 | 0.848 | 0.795 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 20 | 0.76 | 0.95 | 0.805 | 0.735 | 0.693 | 0.555 | 0.792 | 0.713 | 0.713 | 0.845 | 0.802 |
We also investigate a few variants of the few-shot prompting, for now only on Mixtral, namely:
- en - giving shots in English, thereby measuring the importance of examples in the target language
- blank - giving shots without the response, thereby measuring how important the correct answer is
- list - giving shots as lists of sentences, to inform the model on the language, but not on the task
system | N-shot | variant | avg | copa-en.train | copa-hr.train | copa-mk.train | copa-mk.train.trans | copa-sl-cer.train | copa-sl.train | copa-sr-tor.train | copa-sr-tor.train.trans | copa-sr.train | copa-sr.train.trans |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mistral/Mixtral-8x7B-Instruct-v0.1 | 0 | original | 0.67 | 0.875 | 0.705 | 0.665 | 0.632 | 0.405 | 0.682 | 0.68 | 0.637 | 0.71 | 0.713 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 4 | original | 0.745 | 0.927 | 0.797 | 0.705 | 0.718 | 0.487 | 0.777 | 0.713 | 0.73 | 0.807 | 0.785 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 10 | original | 0.755 | 0.932 | 0.818 | 0.703 | 0.682 | 0.5 | 0.802 | 0.723 | 0.748 | 0.848 | 0.795 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 20 | original | 0.76 | 0.95 | 0.805 | 0.735 | 0.693 | 0.555 | 0.792 | 0.713 | 0.713 | 0.845 | 0.802 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 4 | en | 0.667 | 0.927 | 0.638 | 0.693 | 0.635 | 0.58 | 0.412 | 0.645 | 0.67 | 0.637 | 0.745 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 10 | en | 0.692 | 0.935 | 0.735 | 0.65 | 0.608 | 0.445 | 0.69 | 0.705 | 0.645 | 0.787 | 0.72 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 4 | blank | 0.758 | 0.915 | 0.823 | 0.713 | 0.738 | 0.507 | 0.757 | 0.75 | 0.743 | 0.84 | 0.795 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 10 | blank | 0.758 | 0.927 | 0.805 | 0.72 | 0.685 | 0.492 | 0.818 | 0.765 | 0.745 | 0.823 | 0.802 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 20 | blank | 0.763 | 0.943 | 0.81 | 0.697 | 0.7 | 0.5 | 0.8 | 0.772 | 0.74 | 0.838 | 0.833 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 4 | list | 0.713 | 0.912 | 0.76 | 0.685 | 0.655 | 0.487 | 0.725 | 0.708 | 0.69 | 0.757 | 0.752 |
mistral/Mixtral-8x7B-Instruct-v0.1 | 10 | list | 0.724 | 0.907 | 0.775 | 0.708 | 0.657 | 0.515 | 0.74 | 0.7 | 0.698 | 0.787 | 0.757 |