Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for specification of a train set size #212

Open
jwa7 opened this issue May 28, 2024 · 0 comments
Open

Allow for specification of a train set size #212

jwa7 opened this issue May 28, 2024 · 0 comments
Labels
Discussion Issues to be discussed by the contributors Infrastructure: Data Related to data handling like readers and datasets

Comments

@jwa7
Copy link
Member

jwa7 commented May 28, 2024

For demo purposes, I am trying to train a soap-bpnn on a small subset of qm7 (> 7000 structures) on my laptop.

One can specify in options.yaml the proportionate size of val_set and test_set, but cannot do so for the training set. As far as I understand, the train set size is inferred as the remaining proportion. In my case, I can make training faster by setting test_set: 0.999 for instance, but this of course makes post-training evaluation very slow.

In my case, if I want to train and test on a smaller subset it would require me to manually construct a smaller .xyz to pass as the input file. This is of course trivial, but having a way to specify a training size could be more convenient. For instance, allow setting train_set too, and allow train_set + val_set + test_set < 1.

Suppose I want to generate a learning curve, with randomly shuffled training and validation data of different sizes (i.e. different runs with different random seeds), but a fixed test set. Can I do this with the current setup? Is it possible to point to a different hold out .xyz file as the test set?

@jwa7 jwa7 added the SOAP BPNN SOAP BPNN experimental architecture label May 28, 2024
@frostedoyster frostedoyster added infrastructure and removed SOAP BPNN SOAP BPNN experimental architecture labels May 28, 2024
@PicoCentauri PicoCentauri added Discussion Issues to be discussed by the contributors Infrastructure: Data Related to data handling like readers and datasets and removed infrastructure labels Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Issues to be discussed by the contributors Infrastructure: Data Related to data handling like readers and datasets
Projects
None yet
Development

No branches or pull requests

3 participants