Prebuild dataset too large for RAM #88

JonathanSchmidt1 · 2024-09-11T09:53:16Z

Hi,
thank you for the great package. I am trying to pre-build the graphs for some larger datasets that do not fit into RAM is this already possible (and also the training afterwards)?
best,
Jonathan

YutackPark · 2024-09-12T06:06:48Z

Thanks for reaching out. The problem can be solved once the on-the-fly graph build feature for training is developed, which is already in my TODO list: #86.

It may replace .sevenn_data as it can dynamically construct a graph with a dataloader, reducing memory requirements.

The core routine is already developed(https://github.com/MDIL-SNU/SevenNet/blob/main/sevenn/train/collate.py), and I'm left to do some extra jobs for leveraging it for training.

JonathanSchmidt1 · 2024-09-13T06:06:18Z

Thank you for the quick reply will this also work for datasets where the number of structures is already too large for memory without the graphs?

YutackPark · 2024-09-13T08:14:38Z

Hi @JonathanSchmidt1. Unfortunately, in the case where the number of structures is already too large, even without graph, the method I mentioned will also fail. Some smart OS try to use swap memory to handle the out-of-memory but it is not a good idea.

To overcome this, we need a technically elegant method that uses databases, such as mysql, lmdb, sqlite, and so on. Good news is that ASE already has database interfaces for its Atoms object: https://wiki.fysik.dtu.dk/ase/ase/db/db.html

I'm personally trying to leverage the ASE db to do exactly what you are trying to do, but it gonna take some time.. If you know of any other open-source MLIP package that is relevant to this topic, please let me know. It will help my development.

JonathanSchmidt1 · 2024-09-13T10:01:38Z

I guess schnetpack would be an example for a package using ase db and https://github.com/IntelLabs/matsciml/tree/main uses ldmb which I generally prefer. I think Alignn should also have a branch using lmdb.

YutackPark · 2024-09-14T10:52:29Z

Thanks! I'll look around those repos. By the way, could you tell me the reason that you prefer lmdb over the ase db? I don't have experience with lmdb but have some with ase db.

JonathanSchmidt1 changed the title ~~Prebuild dataset to large for RAM~~ Prebuild dataset too large for RAM Sep 11, 2024

YutackPark added the enhancement New feature or request label Sep 14, 2024

YutackPark mentioned this issue Sep 25, 2024

Issues related to data preprocessing of datasets #61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prebuild dataset too large for RAM #88

Prebuild dataset too large for RAM #88

JonathanSchmidt1 commented Sep 11, 2024

YutackPark commented Sep 12, 2024

JonathanSchmidt1 commented Sep 13, 2024 •

edited

Loading

YutackPark commented Sep 13, 2024

JonathanSchmidt1 commented Sep 13, 2024

YutackPark commented Sep 14, 2024

Prebuild dataset too large for RAM #88

Prebuild dataset too large for RAM #88

Comments

JonathanSchmidt1 commented Sep 11, 2024

YutackPark commented Sep 12, 2024

JonathanSchmidt1 commented Sep 13, 2024 • edited Loading

YutackPark commented Sep 13, 2024

JonathanSchmidt1 commented Sep 13, 2024

YutackPark commented Sep 14, 2024

JonathanSchmidt1 commented Sep 13, 2024 •

edited

Loading