MLstructureMining Workflow is a comprehensive package tailored for working with crystal structures, with a particular focus on simulating and analyzing Pair Distribution Function (PDF) data. The package encompasses everything from obtaining CIF files from renowned databases like the Crystallography Open Database (COD), preparing and simulating PDF data, to training a XGBoost classifier for structure suggestions based on the simulated PDFs. With a combination of well-known techniques, including Bayesian optimization and adversarial attacks using the Zeroth Order Optimization (ZOO) technique from the Adversarial Robustness Toolbox (ART), this toolkit offers an integrated solution for researchers working with crystal structures and their corresponding PDFs.
To run MLstructureMining Workflow please follow the steps below:
- Make sure that you are running a Python 3.7 on a Linux or MacOS as DiffPy-CMI requires that. Let us first create a new environment:
conda create --name ciff-env python=3.7
- To install DiffPy-CMI used for simulating Pair Distribution Function (PDF) data, please follow the steps found HERE or run the following code:
conda config --add channels diffpy
conda install diffpy-cmi
- When the installation is completed, then install the required packages.
pip install -r requirements.txt
- Congratulations! You are now ready to training your own XGBoost model for structure suggestion of PDF data.
- Get structure from Crystallograhy Open Database (COD).
- This section explains how to download COD, construct a local library and search through the CIFs to obtain a desired selection of structures.
- This step is optional as CIFs can be obtain via several databases, Inorganic Crystal Structure Database (ICSD), American Mineralogist Crystal Structure Database (AMCSD), Crystal Structure Database for Minerals (MINCRYST) and many more.
- Prepare data and simulate.
- When a desired selection of CIFs have been obtained this will check that the CIFs are compatible with DiffPy-CMI, simulates Pair Distribution Function (data) data and constructs a structure catalog with similar PDFs using the Pearson Correlation Coefficient (PCC).
- Train model.
- Trains, validates, and tests an XGBoost classifier using the simulated PDFs.
- Bayesian optimization can be used for performing hyperparameter optimization.
- After training, the models are further evaluated against adversarial attacks using the Zeroth Order Optimization (ZOO) technique from the Adversarial Robustness Toolbox (ART).
- Utilities.
- Contains the functionalities of the package.
- Tests.
- Contains the tests of the package. Execute the following command to run the tests:
pytest ./tests
If you use our code or our results, please consider citing our paper. Thanks in advance!
Emil T. S. Kjaer - [email protected].
This project is licensed under the Apache License Version 2.0, January 2004 - see the LICENSE file for details.