Automating RASPA simulations and analysis for adsorption studies using a minimalist workflow with JSON input/output, CoRE MOF database integration, isotherm generation, prospecting tools and plot examples.
This package provides a demonstrative workflow for automating RASPA simulations for adsorption studies. It allows users to specify simulation parameters and default settings through a JSON file, and generates directories with input files and submission scripts for RASPA. The program retrieves CIF files from the CoRE MOF database and stores isotherms as CSV or JSON files. This workflow aims to streamline the simulations and simplify the analysis process, making it easier to study adsorption phenomena.
Part of the code presented here was heavily inspired by RASPA python wrapper, and been adapted to meet the workflow architecture.
For a ready-to-use solution of this workflow, one can use an apptainer image (similar to Docker), further information can be found on the DIAMOND website.
For development purposes, use the following instructions :
- Create a conda environment with Python 3.9:
conda create -n simple-adsorption-workflow python=3.9
conda activate simple-adsorption-workflow
- Build and install RASPA from source (using version 2.0.48)
git clone https://github.com/iRASPA/RASPA2.git
cd RASPA2
git checkout v2.0.48
See the install section in the original documentation to install from source.
- Install Openbabel using conda
conda install -c conda-forge openbabel
- Use pip to install the rest of the libraries listed in
requirements.txt
into the conda environment:
pip install -r requirements.txt
- Download and compile Zeo++ at http://www.zeoplusplus.org/download.html. To inform the workflow of the executable
network
path, modify the corresponding line inset_environment
:
export ZEO_DIR=/opt/zeo++-0.3
- Dependencies for PACMOF2
git clone https://github.com/tdpham2/pacmof2
conda install -c conda-forge pymatgen=2023.10.4
conda install -c conda-forge scikit-learn=1.3.2
cd pacmof2
wget -P pacmof2/models/ https://zenodo.org/records/12747095/files/PACMOF2_neutral.gz
pip install build
pip install .
- Define the environment variables:
source set_environment
Optionally, you can set up this recovering of the environment variables at each activation of the conda environment:
mkdir -p $CONDA_PREFIX/etc/conda/activate.d/
printf "%s\n" "#/bin/sh" "source $PWD/set_environment" > $CONDA_PREFIX/etc/conda/activate.d/simple-adsorption-workflow_set_env.sh
chmod +x $CONDA_PREFIX/etc/conda/activate.d/simple-adsorption-workflow_set_env.sh
In the case you are running in a fish shell, set_environment_fish
should be used in the commands above instead of set_environment
.
This is the standard format of the input.json
file to be provided by the user:
{
"parameters":
{
"structure":["MIBQAR","VOGTIV"],
"molecule_name": ["N2", "CO2"],
"pressure": [10,1E6],
"npoints":5,
"temperature": [298.15],
"charge_method":["EQeq"]
}
"defaults":
{
"forcefield":"ExampleMOFsForceField",
"init_cycles":10,
"cycles":20,
"print_every":5,
"grid_use":"yes"
}
}
In the field parameters
, the keywords refer to the specific workflow inputs :
structure
: six-letter CSD codes to identify the material (e.g.,MIBQAR
)molecule_name
: guest molecules (e.g., N2).temperature
: temperatures in Kpressure
: a pressure range, minimum and maximum pressuresnpoints
: number of pressure points linearly distributed in the pressure rangecharge_method
: methods for charge assignment (actual : 'None' or 'EQeq')
In the field defaults
, the keywords refer to RASPA default parameters, see more details about RASPA input file in a further section.
Note :
parameters
anddefaults
have been divided for the following purpose : if there are lists of parameters in theparameters
field (e.g., multiple materials), the program will generate simulation folders for each unique combination ofparameters
.
In this workflow, there are certain restrictions in order to keep the simulations simple and make some assumptions:
- The material must be available in the CoreMOF database, accessible through the MOFXDB database (https://github.com/n8ta/mofdb-client).
- The guest molecule should be a rare gas such as argon (Ar) or xenon (Xe), or it should have a spherical model, like nitrogen (N2), methane (CH4), or sulfur hexafluoride (SF6).
- No electrostatic interactions are considered in the simulations.
These restrictions and assumptions aim to streamline the simulations and simplify the modeling process.
python $PACKAGE_DIR/saw.py run
To specify input and output locations :
python $PACKAGE_DIR/saw.py run -i <path/to/myinput>.json -o <path/to/data/directory>
By default, when the flag -o
is not provided, a new directory will be created with the following formatting name : ./<Date>_<Time>_<Runtype>/
.
The run type is data
for normal runs and <name_of_test>
for tests cases.
python $PACKAGE_DIR/saw.py merge -i run<index1>.json run<index2>.json -o ./
Two files will be created :
run_merged.json
: the workflow file with all single-point outputsisotherms.json
: the isotherms file
python $PACKAGE_DIR/saw.py input
The schematic diagram (Fig. 1) outlines the primary functions executed within the workflow:
-
Parsing Input File: The input file undergoes parsing and interpretation to generate simulation input files. Error messages are generated if the structure names are not found in the database or if gas names do not match the default ones in RASPA.
-
Calculations: Two types of calculations are performed:
- Grand Canonical Monte Carlo simulations using RASPA.
- Pore characterization of the structures using Zeo++.
-
Storage of Output Data: The resulting gas uptakes, along with input data and metadata from RASPA, are stored in a common
run<runID>.json
file. -
Data Transformation: Output data can be converted into more user-friendly array-like tables, facilitating the plotting of isotherms.
-
Data Merging: Various workflow outputs can be merged, enabling the retrieval of isotherms based on group selections using both data and metadata.
Figure 1: Diagram of the workflow.
It runs 20 simulations on RASPA and compute geometric features using ZEO++, then stores the results in CSV files. It then reconstructs the isotherms curves from the simulation results and compares line by line all isotherms files from pre-computed data found in the package repository. The geometrical features are also stored in a CSV format, and
To run it, use -t
or --test-isotherms-csv
flags:
python $PACKAGE_DIR/saw.py run -t
The input file used here is located in $PACKAGE_DIR/tests/test_isotherms_csv/
.
It runs 20 simulations on RASPA then stores the results in a single JSON file. It then reconstructs the isotherms and store the results in JSON format.
To run it, use the -t2
or --test-isotherm-json
flag :
python $PACKAGE_DIR/saw.py run -t2
The input file used here is located in $PACKAGE_DIR/tests/test_isotherms_json/
.
To run it, use the -t3
or --test-merge-json
flag :
python $PACKAGE_DIR/saw.py merge -t3
The json files containing the data to be merged (single pressure data points) are located in $PACKAGE_DIR/tests/test_merge_json/gcmc/
.
python $PACKAGE_DIR/saw.py run --test-charges
The output of the EQeq code should appear with the atom types and the atomic experimental properties used for the calibration of the method.
python $PACKAGE_DIR/saw.py run --test-charges-pacmof2
python $PACKAGE_DIR/saw.py run --test-cif-local-directory
This test will copy the corresponding input files to run simulations from the CIF files found in the current directory (or subdirectory)
structure
: a list of MOF crystal identifiers. The input parser employs MOFXDB API, so for now the only structures that can be used are structures found in CoREMOF (2014 or 2019) and h-MOF. By default, the input parser reads only a 6-letter refcode (CSD format). It then filters out only the structures found incoremof-2019
database. The CIF corresponding files will be downloaded in the./cif
directory at the root of the output directory and will contains files suffixes that indicates the level of curation (e.g. in VOGTIV_clean_h_coremof-2019.cif,_clean
refers to the deletion of adsorbates species from the original X-crystallography characterization, and_h
refers to adjustment of the hydrogen atoms)
Note : one six-letter identifier may result in two structures in the original CORE-MOF database, then all corresponding structures will be added to the list of cifs for further calculations, e.g. : structure=["AHUTIH"] will generate calculations for both files AHUTIH01_clean_coremof-2019.cif and AHUTIH_clean_coremof-2019.cif.
Users can also supply CIF files in a local archive ./cif
. This option is enabled by setting database
=local
in the input JSON file. The ./cif
directory must be located in the current directory.
Note : The user is responsible for cleaning the structures he provides. To use the automatic EQeq charge assignment calculation, the CIF format must check the program specifications.
-
molecule_name
: a list of gas molecule names. By default, the force field for adsorbate molecules is ExampleDefinitions. Therefore a file of the same name must exist in$RASPA_DIR/share/raspa/molecules/ExampleDefinitions
. -
forcefield
: the key name of the predefined force field in RASPA. The force field are stored in$RASPA_DIR/share/raspa/forcefield/
and here the list of possible key names :- CoreShellSchroderSauer
- Dubbeldam2007FlexibleIRMOF-1
- Dubbeldam2007FlexibleIRMOF-10
- Dubbeldam2007FlexibleIRMOF-16
- ExampleMOFsForceField
- ExampleMoleculeForceField
- ExampleZeolitesForceField
- Nicholas
-
others parameters in section
defaults
: all other parameters that can be modified through a template. The workflow uses by default the following template for RASPA inputs, all keys in brackets can be modified from JSON input file:
SimulationType {simulation_type}
NumberOfCycles {cycles}
NumberOfInitializationCycles {init_cycles}
PrintEvery {print_every}
RestartFile no
Forcefield {forcefield}
CutOff 12
ChargeMethod Ewald
EwaldPrecision 1e-6
UseChargesFromCIFFile {charges_from_cif}
Framework 0
FrameworkName {structure}
InputFileType {input_file_type}
UnitCells {a} {b} {c}
HeliumVoidFraction {helium_void_fraction}
ExternalTemperature {temperature}
ExternalPressure {pressure}
Movies no
WriteMoviesEvery 100
NumberOfGrids {grid_n_atoms}
GridTypes {grid_atoms}
SpacingVDWGrid {grid_spacing}
SpacingCoulombGrid {grid_spacing}
UseTabularGrid {grid_use}
Component 0 MoleculeName {molecule_name}
StartingBead 0
MoleculeDefinition ExampleDefinitions
IdealGasRosenbluthWeight 1.0
TranslationProbability 1.0
RotationProbability 1.0
ReinsertionProbability 1.0
SwapProbability 1.0
CreateNumberOfMolecules 0
Definitions :
{molecule_name}
: The molecule to test for adsorption. A file of the same name must exist in$RASPA_DIR/share/raspa/molecules/ExampleDefinitions
.{temperature}
: The temperature of the simulation, in Kelvin.{pressure}
: The pressure of the simulation, in Pascals.{helium_void_fraction}
: The helium void fraction of the input structure. Required for excess adsorption back-calculation.{unit_cells}
: The number of unit cells to use, by dimension.{simulation_type}
: The type of simulation to run, defaults to "MonteCarlo".{cycles}
: The number of simulation cycles to run.{init_cycles}
: The number of initialization cycles to run. Defaults to the minimum of cycles / 2 and 10,000.{forcefield}
: The forcefield to use. Name must match a folder in$RASPA_DIR/share/raspa/forcefield
, which contains the properly named.def
files.{input_file_type}
: The type of input structure. Assumes cif.{grid_}
: All parameters which refer to pre-calculation of energies on grids.
Note : In the future, we might let the user provide its own custom templates files to be able to take into account other parameters.
In the first version of the workflow, the outputs were stored in CSV files. Since this format is not appropriate to store metadata, we added routines to export all simulation data in a more suitable format, a JSON format. We also add a routine that transform single-pressure-point adsorption to isotherms.
location : ./gcmc/run<runID>.json
{
"input": {
"parameters": {
"structure": ["MIBQAR", "VOGTIV"],
"molecule_name": ["N2", "CO2"],
"pressure": [10, 1000000.0],
"npoints": 5,
"temperature": [298.15]
},
"defaults": {
"unit_cells": [1, 1, 1],
"forcefield": "ExampleMOFsForceField",
"init_cycles": 10,
"cycles": 20,
"print_every": 5
}
},
"metadata": {
"timestamp": "2024-01-15 17:30:42",
"python_version": "3.9.16",
"python_compiler": "GCC 11.3.0",
"os_platform": "Linux-x86_64",
"os_version": "#40~22.04.1-Ubuntu",
"os_system": "Linux",
"os_release": "6.2.0-39-generic",
"machine": "x86_64",
"processor": "x86_64",
"workflow_package_git_hash": "b505d5e63fee21be57836e7041c429b48e7c26dc",
"cif_source": {
"database": "mofxdb",
"version": "dc8a0295db"
}
},
"results": [
{
"cycles": 20.0,
"forcefield": "ExampleMOFsForceField",
"init_cycles": 10.0,
"molecule_name": "N2",
"npoints": 5.0,
"pressure": 10.0,
"print_every": 5.0,
"simkey": "simca3ee227",
"structure": "MIBQAR16_clean_coremof-2019",
"temperature": 298.15,
"unit_cells": "[2, 2, 2]",
"Pressure(Pa)": 10.0,
"uptake(cm^3 (STP)/cm^3 framework)": 0.0
},
{
...
}
]
}
location : ./gcmc/isotherms.json
This is just a transformation in the way data is printed in JSON keys, all data are grouped by a set of identical parameters defined as follows : Pressure(Pa)
, uptake(cm^3 (STP)/cm^3 framework)
,simkey
,pressure
,npoints
.
{
"isotherms": [
{
"simkey": [
"sim36c5696b",
"sim50f808f6",
"simc4bc79e4",
"sim8d5c22ad",
"sim887bbdbe"
],
"isokey": "iso191854b1",
"molecule_name": "CO2",
"structure": "MIBQAR16_clean_coremof-2019",
"unit_cells": "[2, 2, 2]",
"temperature": 298.15,
"cycles": 20.0,
"forcefield": "ExampleMOFsForceField",
"init_cycles": 10.0,
"npoints": 5.0,
"print_every": 5.0,
"pressure": [10.0, 250008.0, 500005.0, 750002.0, 1000000.0],
"Pressure(Pa)": [10.0, 250008.0, 500005.0, 750002.0, 1000000.0],
"uptake(cm^3 (STP)/cm^3 framework)": [0.0, 16.2411601942, 27.2677672825, 33.5143673572, 40.2498317856]
},
{
"simkey": [
...
],
...
},
{
...
}
]
}
Notes : For now, all metadata are not including in the isotherms JSON database. The JSON isotherms is nevertheless useful to quickly visualize the data results. To store long-term data, we recommend to use the database JSON obtained merging from a previous file.
In the future, we would like to update these routines to use a existent JSON format, for instance the one used by MOFXDB for interoperable operations between databases.
What other data/metadata should be added ?
- units
- cif
- identifiers (e.g. : common names for MOF; InChiKey for adsorbates)
- all metadata from CSD (dois, authors, solvent, ...)
- composition to deal with co-adsorption
- more simulation-related parameters from RASPA (e.g. SimulationType, UseChargesFromCIFFile, RASPA warnings, ...)
The partial charges can be calculated automatically given the atomic positions. To use this option ,add the following key with the corresponding keyword in the JSON input :
...
"parameters":
{
...
"charge_method":<keyword>
}
...
The possible keywords are :
All atoms have a null partial charge.
It calculates the partial charges using the EQeq method from the python wrapper pyeqeq.
First each CIF file is passed through Openbabel to correct format not compatible with EQeq (e.g. CIFs with columns in wrong order). A file with _openbabel
prefix is written in the same directory.
It will then duplicate the CIF files present in ./cif
directory with a suffix name related to the method; e.g.: MIBQAR16_clean_coremof-2019_openbabel.cif_EQeq_ewald_1.20_-2.00.cif
contains an extra column for the partial charges calculated with Ewald Coulombic interaction, a dielectric parameter of 1.2 and the hydrogen electron affinity is -2.
Other default parameters can be found in the original code page. For instance, one important implementation of EQeq (with respect to the Rappé and Goddard method) is the use of non-zero centered charge (parameter chargecenters
), the charges are hence equilibrated around its number of oxidation and the oxidation number of the atom are fixed by default. If you use the same default parameters for the whole screening, a wrong oxidation number could be assign, and the calculation of the partial charges will be affected. To solve this, one can recalculate oxidation numbers from the structure file using ref1 or ref2 (MOSAEC code by Woo et al., not available yet).
This charge assigment method here (PACMOF v.2), use a descritor-based Machine Learning model to predict partial charges from the structure of the materials. The training set is based on DFT calculations in the GCA approximations with PBE fonctionals and the reference structures comes from QMOF database.
In RASPA, one can speed up GCMC calculations by computing energy grids. It stores energies (Van der Waals and electrostatic) of all host atoms for a given framework in $RASPA_DIR/share/raspa/grids/
. The position of a randomly inserted host molecule during GCMC is marked in the grid, then the energy of the host molecule is interpolated from the energy values on the neighboring grid nodes.
It can speed up calculations if a consequent number of simulations use the same framework-host (different pressures, temperatures).
To use this option, one can to pass these parameters in the default
fields :
...
"default":
{
...
"grid_use":<yes/no>
"grid_spacing":<value_in_angstrom>
}
...
If grid_use
is set to 'yes', all GCMC simulations running in the workflow will use grids, these latter are calculated in a previous step during the workflow.
TODO : There is a bug to fix with this option ! for now,when this option is on, it recalculates for each GCMC simulations the energy grid, even if the grid has already been computed previously (same pair of adsorbate atom and adsorbent material).
- If the user wants to run calculation on its own structures, several verification must be performed to be used in a GCMC simulation which is out of the scope of the present tool (curate CIF, check presence of force field parameters for the new atoms name defined, ...)
- Use a user-defined force field