-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues related to data preprocessing of datasets #61
Comments
SevenNet tries to read 'free_energy' first, and if 'free_energy' is not available, use 'energy'. Internally, 'free_energy' is obtained from E = atoms.get_potential_energy(force_consistent=True) For the MPF dataset, you can check the consistency of your preprocessing script, which converts MPF dataset entry to ASE atoms, by comparing its results with other ASE atoms instance initialized from VASP OUTCAR. It makes sense as the author of MPF dataset says values are raw outputs of VASP. You can create ASE atoms instance from energy, force, and stress like: from ase.atoms import Atoms
from ase.calculators.singlepoint import SinglePointCalculator
atom = Atoms(species, pos, cell=cell, pbc=True)
calc_results = {"energy": energy,
"free_energy": energy,
"forces": force,
"stress": stress}
calculator = SinglePointCalculator(atom, **calc_results)
atom = calculator.get_atoms() The MPF dataset is a special case, because it is not originated from MD software.
Write your ASE atoms object to 'extxyz' format. It can be directly passed to sevenn_graph_build --format ase my_data.extxyz 5.0 First positional argument is a file name, and the second is a cutoff radius of the model. We're planning to write tutorials with pure python! Before that, I think it is better not to close this issue. |
Hello!
When you have free time, please criticize and correct my code, and once again express my gratitude to you. Appendix Print Output
|
MPF dataset samples three structures per relaxation trajectory. I recommend you to check their paper for details.
Instead of multiplying "-0.1", try this "stress = -1 * stress / 1602.1766208 # to eV/Angstrom^3" Standard ASE atoms instance may have eV/Angstrom^3 units for its stress. As I mentioned, the best way to ensure the script is to compare your result with the outputs of SevenNet log file seems good. As the dataset has ~188K structures the slow training speed you observed is expected. Sadly, pre-training SevenNet-0 is a computationally demanding task. I recommend using multi-GPU training. Cheers! |
Thank you very much for your explanation and suggestions! |
hi @YutackPark Thank you so much. |
Here's the code: Note that while the code splits dataset into train, valid, and test, SevenNet-0 used all the data in MPTrj without splitting. Handling large dataset is another problem. (#88) A preprocessed graph (.sevenn_data) might not fit into the memory. There is an experimental feature I'm currently working on: https://github.com/MDIL-SNU/SevenNet/tree/ase_db |
Hello.
After downloading the dataset from the website https://figshare.com/articles/dataset/MPF_2021_2_8/19470599 and merging the two .p files, you are on the right track to prepare the data for model training by converting it into a list of Atoms (ase) type and then using the sevenn_graph_build command to generate a sevenn_data type file. In this process, after parsing the dictionary to ['structure ',' energy ',' force ',' stress', 'id'], there is some ambiguity when using these field information to instantiate Atoms objects. May I ask if there are any relevant documentation or programs that can generate datasets that can be processed by the seven_graph_build program.
Thank you.
The text was updated successfully, but these errors were encountered: