You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the sacreCOMET signature might still be insufficient. E.g. different optimization levels on CPU/GPU could lead to different results. While the results might not be big, it's still good to be on the safe side and test things end-to-end. The idea is that there is a small public testset and it is known (to the package), that Comet20 correlates 0.1500 with humans on this testset. It is also known that Cometkiwi22 correlates 0.21412 on this testset.
So when someone runs their Comet20 or Cometkiwi22 locally, they can compute the correlations and see if they match. If they don't, something is wrong and needs to be investigated. In addition, if someone runs their own local model, they can also report the correlation in the signature and if someone tries to replicate their work and gets a different correlation, it's also a red flag. This package should make it easier.
I suggest the following:
Add tooling to sacreCOMET to make it possible to quickly run your model on the testset and compute correlations with the numbers there (can be human).
I.e. download the small testset from raw.github and evaluate the correlation given a path to the local model, something like sacrecomet correlation my/comet/model/model.ckpt. The testset can be 1000 sentences. Scoring that even on CPU is still reasonably fast. For the correlation, let's use Pearson(?).
In case the model name is known to us (publicly available models), compare the correlations with what we measure and print out a serious warning that something is wrong. For example The model Unbabel/wmt22-cometkiwi-da is publicly available and you obtained a correlation with human of 0.235102 while the refernce correlation is 0.282311. This is a red flag and something is wrong. Please investigate.
When generating the signature, add an CLI argument + fallback to interactive question that asks for obtained correlation, such as sacrecomet --model unite-mup --prec fp32 --refs 1 --corr 0.12356 and this number should be reflected in the signature, so Python3.11.8|Comet2.2.2|fp32|unite-mup|r1|$\rho$0.12356. I suggest making this optional for now.
Add documentation to all of this.
This idea was originally suggested by Marcin.
The text was updated successfully, but these errors were encountered:
Currently the sacreCOMET signature might still be insufficient. E.g. different optimization levels on CPU/GPU could lead to different results. While the results might not be big, it's still good to be on the safe side and test things end-to-end. The idea is that there is a small public testset and it is known (to the package), that Comet20 correlates 0.1500 with humans on this testset. It is also known that Cometkiwi22 correlates 0.21412 on this testset.
So when someone runs their Comet20 or Cometkiwi22 locally, they can compute the correlations and see if they match. If they don't, something is wrong and needs to be investigated. In addition, if someone runs their own local model, they can also report the correlation in the signature and if someone tries to replicate their work and gets a different correlation, it's also a red flag. This package should make it easier.
I suggest the following:
sacrecomet correlation my/comet/model/model.ckpt
. The testset can be 1000 sentences. Scoring that even on CPU is still reasonably fast. For the correlation, let's use Pearson(?).The model Unbabel/wmt22-cometkiwi-da is publicly available and you obtained a correlation with human of 0.235102 while the refernce correlation is 0.282311. This is a red flag and something is wrong. Please investigate.
sacrecomet --model unite-mup --prec fp32 --refs 1 --corr 0.12356
and this number should be reflected in the signature, soPython3.11.8|Comet2.2.2|fp32|unite-mup|r1|$\rho$0.12356
. I suggest making this optional for now.This idea was originally suggested by Marcin.
The text was updated successfully, but these errors were encountered: