Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Training Set Size For Supervised Models #66

Open
LasseMiddendorf opened this issue Feb 3, 2025 · 1 comment
Open

[Question] Training Set Size For Supervised Models #66

LasseMiddendorf opened this issue Feb 3, 2025 · 1 comment

Comments

@LasseMiddendorf
Copy link

Hi,

Thanks for this amazing resource! I have a question about how the average Spearman correlations for supervised models on the leaderboard are generated. How many data points are used to train the supervised models for each target? Is it a fixed spilt for each model or does it vary? Thanks a lot!

@pascalnotin
Copy link
Contributor

Hi @LasseMiddendorf,

We compute the Spearman correlations for supervised models as follows:

  • There are three distinct cross-validation (CV) schemes: random, modulo and contiguous (see the ProteinNPT paper for more details on each scheme).
  • Each CV scheme, for each assay, is itself split in 5 folds (except for rare edge cases where there are 4 positions mutated in the corresponding assay, in which case we use 4 folds only).
  • The fold assignment for each CV scheme, for each assay is provided in the following file: cv_folds_singles_substitutions.zip (see README instructions for download).
  • We then do, for each CV scheme, for each assay, a 5-fold CV training, ie. select one fold as test set, train on all other folds, and repeat num_folds time with a different test fold each time (see the ProteinNPT training script for all details).
  • For each CV scheme, for each assay, the performance is then computed by concatenating the test set predictions across all folds and computing the spearman correlation / MSE with the DMS scores.
  • Note that all DMS scores are standard normalized before training (see details here: https://github.com/OATML-Markslab/ProteinNPT/blob/master/proteinnpt/utils/data_utils.py#L120-L161) and we report the Spearman/MSE of our predictions vs these standard normalized DMS scores (primarily impacts the scale of MSE values / Spearman values are invariant to linear scaling).

Let me know if you have any questions about the above!

Kind regards,
Pascal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants