[Question] Training Set Size For Supervised Models #66

LasseMiddendorf · 2025-02-03T15:47:10Z

Hi,

Thanks for this amazing resource! I have a question about how the average Spearman correlations for supervised models on the leaderboard are generated. How many data points are used to train the supervised models for each target? Is it a fixed spilt for each model or does it vary? Thanks a lot!

pascalnotin · 2025-02-06T15:52:37Z

Hi @LasseMiddendorf,

We compute the Spearman correlations for supervised models as follows:

There are three distinct cross-validation (CV) schemes: random, modulo and contiguous (see the ProteinNPT paper for more details on each scheme).
Each CV scheme, for each assay, is itself split in 5 folds (except for rare edge cases where there are 4 positions mutated in the corresponding assay, in which case we use 4 folds only).
The fold assignment for each CV scheme, for each assay is provided in the following file: cv_folds_singles_substitutions.zip (see README instructions for download).
We then do, for each CV scheme, for each assay, a 5-fold CV training, ie. select one fold as test set, train on all other folds, and repeat num_folds time with a different test fold each time (see the ProteinNPT training script for all details).
For each CV scheme, for each assay, the performance is then computed by concatenating the test set predictions across all folds and computing the spearman correlation / MSE with the DMS scores.
Note that all DMS scores are standard normalized before training (see details here: https://github.com/OATML-Markslab/ProteinNPT/blob/master/proteinnpt/utils/data_utils.py#L120-L161) and we report the Spearman/MSE of our predictions vs these standard normalized DMS scores (primarily impacts the scale of MSE values / Spearman values are invariant to linear scaling).

Let me know if you have any questions about the above!

Kind regards,
Pascal

pascalnotin mentioned this issue Feb 6, 2025

Precise evaluation setup for supervised setting #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Training Set Size For Supervised Models #66

[Question] Training Set Size For Supervised Models #66

LasseMiddendorf commented Feb 3, 2025

pascalnotin commented Feb 6, 2025

[Question] Training Set Size For Supervised Models #66

[Question] Training Set Size For Supervised Models #66

Comments

LasseMiddendorf commented Feb 3, 2025

pascalnotin commented Feb 6, 2025