Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prebuild dataset too large for RAM #88

Open
JonathanSchmidt1 opened this issue Sep 11, 2024 · 5 comments
Open

Prebuild dataset too large for RAM #88

JonathanSchmidt1 opened this issue Sep 11, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@JonathanSchmidt1
Copy link

Hi,
thank you for the great package. I am trying to pre-build the graphs for some larger datasets that do not fit into RAM is this already possible (and also the training afterwards)?
best,
Jonathan

@JonathanSchmidt1 JonathanSchmidt1 changed the title Prebuild dataset to large for RAM Prebuild dataset too large for RAM Sep 11, 2024
@YutackPark
Copy link
Member

Thanks for reaching out. The problem can be solved once the on-the-fly graph build feature for training is developed, which is already in my TODO list: #86.

It may replace .sevenn_data as it can dynamically construct a graph with a dataloader, reducing memory requirements.

The core routine is already developed(https://github.com/MDIL-SNU/SevenNet/blob/main/sevenn/train/collate.py), and I'm left to do some extra jobs for leveraging it for training.

@JonathanSchmidt1
Copy link
Author

JonathanSchmidt1 commented Sep 13, 2024

Thank you for the quick reply will this also work for datasets where the number of structures is already too large for memory without the graphs?

@YutackPark
Copy link
Member

Hi @JonathanSchmidt1. Unfortunately, in the case where the number of structures is already too large, even without graph, the method I mentioned will also fail. Some smart OS try to use swap memory to handle the out-of-memory but it is not a good idea.

To overcome this, we need a technically elegant method that uses databases, such as mysql, lmdb, sqlite, and so on. Good news is that ASE already has database interfaces for its Atoms object: https://wiki.fysik.dtu.dk/ase/ase/db/db.html

I'm personally trying to leverage the ASE db to do exactly what you are trying to do, but it gonna take some time.. If you know of any other open-source MLIP package that is relevant to this topic, please let me know. It will help my development.

@JonathanSchmidt1
Copy link
Author

I guess schnetpack would be an example for a package using ase db and https://github.com/IntelLabs/matsciml/tree/main uses ldmb which I generally prefer. I think Alignn should also have a branch using lmdb.

@YutackPark YutackPark added the enhancement New feature or request label Sep 14, 2024
@YutackPark
Copy link
Member

Thanks! I'll look around those repos. By the way, could you tell me the reason that you prefer lmdb over the ase db? I don't have experience with lmdb but have some with ase db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants