Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: sample training data after splitting #83

Merged
merged 1 commit into from
May 28, 2024
Merged

Conversation

mlederbauer
Copy link
Owner

image

This now uniformly samples training data after the train-test-split, meaning that we always have a constant test size of 3434 complexes

Only issue I would see here is that the current code does not consider the case where there is a ligand in the test set that is not part of the train set. That causesz the model to "never see such a ligand" and thus decrease performance significantly

For this, we would however need to build our own "iterative stratified sapling" which is rather a hassle to implement. We could just include it in the conclusion of our project for now!

https://stackoverflow.com/questions/45516424/sklearn-train-test-split-on-pandas-stratify-by-multiple-columns

@mlederbauer mlederbauer requested a review from strsamue May 28, 2024 19:13
@kbiniek kbiniek merged commit ca24f17 into main May 28, 2024
1 check passed
@mlederbauer mlederbauer deleted the fix/fixed-testset-size branch May 28, 2024 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants