You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for this amazing resource! I have a question about how the average Spearman correlations for supervised models on the leaderboard are generated. How many data points are used to train the supervised models for each target? Is it a fixed spilt for each model or does it vary? Thanks a lot!
The text was updated successfully, but these errors were encountered:
We compute the Spearman correlations for supervised models as follows:
There are three distinct cross-validation (CV) schemes: random, modulo and contiguous (see the ProteinNPT paper for more details on each scheme).
Each CV scheme, for each assay, is itself split in 5 folds (except for rare edge cases where there are 4 positions mutated in the corresponding assay, in which case we use 4 folds only).
The fold assignment for each CV scheme, for each assay is provided in the following file: cv_folds_singles_substitutions.zip (see README instructions for download).
We then do, for each CV scheme, for each assay, a 5-fold CV training, ie. select one fold as test set, train on all other folds, and repeat num_folds time with a different test fold each time (see the ProteinNPT training script for all details).
For each CV scheme, for each assay, the performance is then computed by concatenating the test set predictions across all folds and computing the spearman correlation / MSE with the DMS scores.
Hi,
Thanks for this amazing resource! I have a question about how the average Spearman correlations for supervised models on the leaderboard are generated. How many data points are used to train the supervised models for each target? Is it a fixed spilt for each model or does it vary? Thanks a lot!
The text was updated successfully, but these errors were encountered: