Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extrinsic correlation as end-to-end compatibility detector #2

Open
zouharvi opened this issue Nov 24, 2024 · 0 comments
Open

Extrinsic correlation as end-to-end compatibility detector #2

zouharvi opened this issue Nov 24, 2024 · 0 comments

Comments

@zouharvi
Copy link
Collaborator

zouharvi commented Nov 24, 2024

Currently the sacreCOMET signature might still be insufficient. E.g. different optimization levels on CPU/GPU could lead to different results. While the results might not be big, it's still good to be on the safe side and test things end-to-end. The idea is that there is a small public testset and it is known (to the package), that Comet20 correlates 0.1500 with humans on this testset. It is also known that Cometkiwi22 correlates 0.21412 on this testset.
So when someone runs their Comet20 or Cometkiwi22 locally, they can compute the correlations and see if they match. If they don't, something is wrong and needs to be investigated. In addition, if someone runs their own local model, they can also report the correlation in the signature and if someone tries to replicate their work and gets a different correlation, it's also a red flag. This package should make it easier.

I suggest the following:

  • Add tooling to sacreCOMET to make it possible to quickly run your model on the testset and compute correlations with the numbers there (can be human).
    • I.e. download the small testset from raw.github and evaluate the correlation given a path to the local model, something like sacrecomet correlation my/comet/model/model.ckpt. The testset can be 1000 sentences. Scoring that even on CPU is still reasonably fast. For the correlation, let's use Pearson(?).
    • In case the model name is known to us (publicly available models), compare the correlations with what we measure and print out a serious warning that something is wrong. For example The model Unbabel/wmt22-cometkiwi-da is publicly available and you obtained a correlation with human of 0.235102 while the refernce correlation is 0.282311. This is a red flag and something is wrong. Please investigate.
  • When generating the signature, add an CLI argument + fallback to interactive question that asks for obtained correlation, such as sacrecomet --model unite-mup --prec fp32 --refs 1 --corr 0.12356 and this number should be reflected in the signature, so Python3.11.8|Comet2.2.2|fp32|unite-mup|r1|$\rho$0.12356. I suggest making this optional for now.
  • Add documentation to all of this.

This idea was originally suggested by Marcin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant