Link to the pdf version of this document
Binary packages available on linux and macOS conda-forge:
conda config --add channels conda-forge #if not already done
conda config --set channel_priority strict #if not already done
conda install analisi
Name | Downloads | Version | Platforms |
---|---|---|---|
or by compiling from source following this section
Fastest possible example usage:
- mean square displacement
./analisi -i tests/lammps.bin -Q > output_file
- spherical harmonics correlation functions
./analisi -i tests/lammps.bin -Y 4 -F 0.7 3.5 > output_file
- g(r), with time lags too
./analisi -i tests/lammps.bin -g 200 -F 0.7 3.5 > output_file
- green kubo autocorrelation integrals
./analisi -l tests/gk_integral.txt -H -a 'c_flux[1]' 'c_vcm[1][1]' > output_file
and many others...
#read trajectory
import numpy as np
pos = np.load( 'tests/data/positions.npy')
vel = np.load( 'tests/data/velocities.npy')
box = np.load( 'tests/data/cells.npy')
#atomic types: load an array with length equal to the number of atoms
#in this example I set all the atoms to type 0 but the last 16,
#wich are 8 of type 1 and 8 of type 2
types = np.zeros(pos.shape[1], dtype = np.int32)
types[-16:-8]=1
types[-8:]=2
print('position array shape is {}'.format(pos.shape))
print('first cell matrix is {}'.format(box[0]))
#create trajectory object
import pyanalisi
analisi_traj = pyanalisi.Trajectory(
pos, #shape (ntimesteps, natoms, 3)
vel, #same shape as pos
types, #shape (natoms)
box, #shape (ntimesteps, 3,3)
# or (ntimesteps, 6)
# or (ntimesteps, 9)
pyanalisi.BoxFormat.CellVectors,
# input box format is
# matrix of cell vectors
False, # don't wrap atoms around the center of
# the cell
False # don't save rotation matrix that is used
# internally if the cell is triclinic
)
#do the calculation that you want
#see MSD section
msd=pyanalisi.MeanSquareDisplacement(analisi_traj,10,4000,4,True,False,False)
msd.reset(3000)
msd.calculate(0)
#the calculation object can be used as an array
result=np.array(msd,copy=False)
#other calculations
#...
#...
Heat transport coefficient calculation: correlation functions and gk integral for a multicomponent fluid example
import numpy as np
#read first line, that is an header, and the column formatted data file
with open(filepath_tests + '/data/gk_integral.dat', 'r') as flog:
headers = flog.readline().split()
log = np.loadtxt(filepath_tests + '/data/gk_integral.dat', skiprows=1)
import pyanalisi
#wrapper for the pyanalisi interface
traj = pyanalisi.ReadLog(log, headers)
#see Green Kubo section
gk = pyanalisi.GreenKubo(analisi_log,'',
1,['c_flux[1]','c_vcm[1][1]'],
False, 2000, 2,False,0,4,False,1,100)
gk.reset(analisi_log.getNtimesteps()-2000)
gk.calculate(0)
#the calculation object can be used as a python array
result = np.array(gk,copy=False)
If you use this program and you like it, spread the word around and give it credits! Implementing stuff that is already implemented can be a waste of time. And why don't you try to implement something that you like inside it? You will get for free MPI parallelization and variance calculation, that are already implemented in a very generic and abstracted way.
This is a framework for computing averages on molecular dynamics trajectories and block averages. An mpi parallelization over the blocks is implemented in a very abstract and general way, so it is easy to implement a new calculation that can be parallelized with no effort using MPI. The code has two interfaces: a command line interface that is parallelized with MPI and threads, and a python interface that is parallelized on threads only. With such many parallelization levels more accurate averages can be performed, as described in details in the next sections.
Every MPI processes uses the same amount of memory by loading the part of the trajectory that it is used to do the calculation in every block. For this reason it is suggested to use one MPI process per machine, and parallelize by using threads only inside the single machine.
Features:
- python interface (reads numpy array)
- command line interface (reads binary lammps-like files or time series in column format)
- multithreaded ( defaults to number of threads specifies by the shell variable OMP_NUM_THREADS )
- command line interface has MPI too (for super-heavy multi-machine calculations)
- command line calculates variance of every quantity and every function (in python you can do it by yourselves with numpy.var )
- jupyter notebook example of the python interface
Calculations:
- g of r ( and time too!)
- vibrational spectrum with FFTW3
- histogram of number of neighbours
- mean square displacement
- green kubo integral of currents
- multicomponent green kubo time domain formula (MPI here can be useful...)
- spherical harmonics number density time correlation analysis (MPI here can be useful...)
- atomic position histogram
- and more ...
- ...
Dependencies:
- C++17 compatible compiler (GCC 7+, clang 6+, intel 19.0.1+, MSVC 19.29 source )
- cmake 3.8+
- linux or macOS for full functionalities, that include the command line tool and reading lammps binary files and full test suite
- windows can use only the python library, and only python arrays can be used as trajectory data. Some functionalities are not tested. Reading of lammps trajectory is not supported (this would require rewriting some of the code in the core library, it is doable)
- FFTW3 (included in the package)
- Eigen3 (included in the package)
- Boost (included in the package)
- Mpi (optional)
- libxdrfile (for gromacs file conversion -- included in the package)
- python (optional)
Documentation build:
- r-rsvg
- pandoc
- rsvg-convert
- texlive
- readme2tex (pip)
mkdir build
cd build
cmake ../ -DUSE_MPI=ON
make
mkdir build
cd build
cmake ../
make
After compiling the code you will find the executable analisi
and the shared library pyanalisi.*.so
(with a part of the name that depends on your particular python installation) in the build folder. To install the python library, simply copy the pyanalisi folder it in your site-library folder, and then copy inside it the pyanalisi.*.so library
. This is done by the script install/install_python.sh
after exporting the variables SP_DIR
, setted to your python's package directory, BUILD_DIR
, setted to the build directory, and SOURCE_DIR
, setted to the main directory of the repository. The python's packages folders can be found with the command python -m site
export SP_DIR="/path/to/python/package/folder"
export SOURCE_DIR="/path/to/repository/dir"
export BUILD_DIR="/path/to/build/directory"
install/install_python.sh
Be careful to choose the correct python executable, the same that you used to compile the library.
To test that the library was compiled correctly and that there are no regressions, you can run the (small) test suite with the command
make test
Then you can run the
test_cli.sh
script inside the tests
folder, that tests the command line interface. You have to export the same variables used by install/install_python.sh
.
Then after the python library is installed, you can test it with
pytest -sv
in the tests folder. If all tests passed, your installation probably is working correctly. Testing is strongly suggested, you should always do it, since often you can find unstable environments or buggy compilers on you super-machine.
-DBUILD_TESTS=OFF
skips the building of the C++ tests-DSYSTEM_FFTW3=ON
use system's FFTW3 library-DSYSTEM_EIGEN3=ON
use installed system's eigen3 library-DSYSTEM_BOOST=ON
use detected system's boost library-DSYSTEM_XDRFILE=ON
use detected system's xdrfile library. If not found simply disable the support for gromacs file conversion-DPYTHON_EXECUTABLE=/path/to/python
use this python installation to build the python interface-DPYTHON_INTERFACE=OFF
don't build any python interface (the building process will not depend on python)-DBUILD_MMAP=OFF
don't use unix's mmap . This prevents the compilation of the c++ test suite and the command line tool. Only the python lib can be built.
This document is better rendered in the pdf version. Link to the pdf version. Note that is generated by the script build_pdf.sh
, and the original file is README_.md
.
The command line utility is able to read only binary trajectory files in the LAMMPS format, specified with the command line option -i input_file
, or a time series in a column formatted text file with a header, specified with the command line option -l input_file
. The trajectory file can be generated in many ways:
-
by LAMMPS :-)
-
by using the command line utility with the command line options
-i input_file -binary-convert output_file
, whereinput_file
is the name of a plain text trajectory in the format:[natoms] [xlo] [xhi] [ylo] [yhi] [zlo] [zhi] [id_1] [type_1] [x_1] [y_1] [z_1] [vx_1] [vy_1] [vz_1] . . . . . . . . . . . . . . . . . . . . . . . . [id_natoms] [type_natoms] [x_natoms] [y_natoms] [z_natoms] [vx_natoms] [vy_natoms] [vz_natoms] . . .
That is: for every step you have to provide the number of atoms, low and high coordinates of the orthorombic cell, and then for every atom its id, type id, positions and velocities.
-
by using the command line utility with the command line options
-i input_file -binary-convert-gromacs output_file typefile
and a gromacs trajectory (you have to provide the xdr library in the building process) -
by using the python interface: see section Buffer protocol interface
Where meaningful, the following arguments are shared by all calculation performed by the command line utility
-i input_file
specify input trajectory binary file. See lammps documentation for the documentation of how to produce a binary dump from lammps. The order of the dumped atomic quantities must beid type xu yu zu vx vy vz
-l input_file
specify input time series in text column format with headers-N thread_number
specify the number of threads to use where parallelization is implemented with threads-B number_of_block
when the code calculates variances of the quantity, it splits the trajectory in many blocks. With this option you can specify the number of blocks. If MPI parallelization is used, since different blocks are calculated in parallel, you may want to spacify a number of blocks that is a multiple of the number of MPI processes. In this way, at the final iteration over the blocks, all processes will be busy.-s timestep_number_skip
usually neighbour configurations in the trajectory are strongly correlated. With this option you specify how far must be two consecutive configuration used to calculate averages.-S timestep_number_length
specify how long must be the function that the code calculates, measured in number of timesteps. (for example the length of the MSD plot)
You can create a trajectory python object to be used in this library in two ways:
- start from python arrays generated by other code
- use a LAMMPS binary file that you have on the filesystem. This is the same file that is used by the command line interface.
to access the positions, velocities and cell data you can use in both the python buffer protocol interface and the lammps binary one the functions get_positions_copy()
, get_velocities_copy()
and get_box_copy()
. Note that each call of these functions allocates more memory, as needed.
Internally the format used for storing the cell information is:
[x_low, y_low, z_low, lx/2, ly/2, lz/2, xy, xz, yz]
and is accessible by using the python function get_box_copy()
common to the trajectory objects. Note that internally the first cell vector is always in the same direction of the x axis, the second one is on the xy plane while the third has an arbitrary direction. This is equivalent to requiring that the cell matrix is triangular. If necessary, this is achieved by doing a QR decomposition of the input cell matrix and then rotating everything with the Q matrix.
You must have 4 arrays. In general the interface supports any object that supports python's buffer protocol. This include, among others, numpy arrays. Suppose you have a trajectory of t
timesteps and n
atoms. You need the following arrays:
- position array, of shape
(t,n,3)
- velocity array, of shape
(t,n,3)
- cell array, of shape
(t,3,3)
or if a lammps formated cell is provided(t,6)
for orthorombic and(t,9)
for triclinic. The lammps format simply list the low and high coordinates for each dimension and the tilts factors as described below - types array, of shape
(n)
(integer array)
In general no particular units of measure are required, and the output will reflect the units that you used in the input. The calculations that the program does are reported exactly in the following sections. From those formulas you can deduce the units of the output given the units that you used in the input.
Then you must decide if you want that the coordinates are rewrapped around the center of the cell or not. This is not equivalent to wrapping the coordinates inside the cell for the triclinic case, but generates a compact list of positions suitable for an efficient calculation of the minimum image distance.
The lammps format for the cell is [x_lo, x_hi, y_lo, y_hi, z_lo, z_hi, xy, xz, yz]
, as described in the lammps documentation.
If the plain cell vectors are provided in the cell matrix and this matrix is not diagonal, a QR decomposition is performed to get a triangular cell matrix and the achieve the desidered internal format. In this case all velocities and positions vector are rotated with the rotation matrix Q, that can be obtained with the function get_rotation_matrix()
.
The syntax for creating the object is
import pyanalisi
analisi_traj = pyanalisi.Trajectory(positions_array,
velocity_array,
types_array,
box_array,
input_box_format_id,
wrap_atomic_coordinates,
save_Q_rotation_matrix
)
where input_box_format_id
is one of pyanalisi.BoxFormat.CellVectors
, pyanalisi.BoxFormat.LammpsOrtho
, pyanalisi.BoxFormat.LammpsTriclinic
and describe the format of the array box_array
. If wrap_atomic_coordinates
is True
the code will wrap all the atomic coordinates around the center of the cell (that does not mean inside the cell in the triclinc case). This makes the code for computing the minimum image distance more efficient if the atoms were far away out of the cell, but invalidates all the calculations were the atoms are not supposed to be wrapped back inside the cell.
You can write a LAMMPS binary trajectory (that can be used by the command line interface with mpi, for example) by calling
analisi_traj.write_lammps_binary('output_path', start_timestep, end_timestep)
where start_timestep
is the first timestep index to dump (indexes start from 0) and end_timestep
is the first timestep that will not be written. If end_timestep == -1
, it will dump everything till the end of the trajectory. This is a very convenient way of moving heavy computations on a cluster where MPI can be used, or more in general to convert a generic trajectory format in the format used by the command line tool. For example
#read trajectory. It can come from everywhere
import numpy as np
pos = np.load( 'tests/data/positions.npy') #shape (N_timesteps, N_atoms, 3)
vel = np.load( 'tests/data/velocities.npy') #shape (N_timesteps, N_atoms, 3)
box = np.load( 'tests/data/cells.npy') #shape (N_timesteps, 3, 3)
types = np.load( 'tests/data/types.npy') #shape (N_atoms), dtype=np.int32
#create trajectory object and dump to file
import pyanalisi
analisi_traj = pyanalisi.Trajectory(pos, vel, types, box,
True, # matrix format for the box array
False, # don't wrap the coordinates
False # not interested in Q matrix
)
analisi_traj.write_lammps_binary("output_filename.bin"
, 0, # starting timestep
-1 # last timestep:
) # -1 dumps full trajectory
This interface is a little more complicated, since it was designed for computing block averages of very big files. It can read the same files that the command line program reads. The object is created with
analisi_traj = pyanalisi.Traj('path_of_binary_file')
Then you have to call some more functions, BEFORE calling the compute routines :
analisi_traj.setWrapPbc(True) #optional: if you want to wrap positions inside the cell
analisi_traj.setAccessWindowSize(size_in_number_of_steps_of_the_reading_block)
analisi_traj.setAccessStart(first_timestep_to_read)
The first line is to set the pbc wrapping (don't use this if you have to compute the MSD!). Since the access to the trajectory is done in blocks, the second line sets the size of the reading block. The bigger the block the bigger the memory allocated to store the positions and the velocities. The last call sets the index of the first timestep that is going to be read, and then read it. In this moment the selected chunk of the trajectory is loaded into your memory. At this point you are able to call the needed compute routine, making sure that it is not going to read past the last timestep of the block. Later you can call again setAccessStart
to load a different part of the trajectory and compute the quantity again. Only the data of the current block is stored in the memory, and if evenutally there is some overlap with the previous block, the data is copied from memory and the overlapped part is not read again from the filesystem.
The Traj
class has a normal buffer protocol interface, so you can use it as a numpy array. By defaults the array interfaced with python is the position's one. If you want to read the velocities, you can do
analisi_traj.toggle_pos_vel() # to change from positions to velocities and vice-versa
#do whatever you want
if not analisi_traj.serving_pos():
np.sum(analisi_traj) # sum all velocities
else:
np.sum(analisi_traj) # sum all positions
To access the atomic lammps ids and the atomic lamms types, you can use the following two functions:
analisi_traj.get_lammps_id()
analisi_traj.get_lammps_type()
don't call them too often since every time a new numpy array of ints is created
Some functions are common to both the LAMMPS and the numpy interfaces:
get_positions_copy()
that returns a copy of the loaded positionsget_velocities_copy()
that returns a copy of the loaded velocitiesget_box_copy()
that returns a copy of the cell data stored in the intenal formatwrite_lammps_binary(fname, start, stop)
that writes a dump of the selected part of the trajectory in the lammps format. This is useful also in the lammps interface for extracting a part of the full trajectory.
Before analyzing a time series, for example to calculate the integral of the autocorrelation function, it is necessary to create a time series object. This object will hold the data that a different function can analyze. It is created with the following:
time_series = pyanalisi.ReadLog(data_array, headers_array)
where data_array
and header_array
are python objects that support the buffer protocol interface, for example numpy arrays, or a plain python list.
The data_array
is a two dimensional array where the first index is the timestep index and the second index is the column index.
The header_array
object must describe the columns that are stored in the bidimensional floating point array data_array
using Python Strings.
Given a trajectory where is the atomic index and is the timestep index, defining the center of mass position of the atomic species at the timestep as
where is the number of atoms of the specie , the code computes the following If the option `--mean-square-displacement-self` is provided in the command line or in the python interface the documented argument is set to `True`, the atomic mean square displacement for each atomic species is calculated in the reference system of the center of mass of that particular atomic specie. That is, in this case the following is computed: In the output you have many columns, one for each of the atomic species, first the block of the atomic MSD and then eventually the block of the center of mass MSD if asked to compute. The center of mass MSD is computed only if the command line option `-Q` is provided or the documented argument is set to `True` in the python interface constructor. The output is the following: In the command line output after each column printed as described in the line above you will find the variance calculated with a block average over the specified number of blocks.The options that you can use for this calculation, in summary, are:
./analisi -i lammp_binary [-N number of threads] [-S stop timestep] [-s skip every] [-B number of blocks] -q | -Q [--mean-square-displacement-self]
where any option of -q
, -Q
, --mean-square-displacement-self
activates the calculation of the MSD.
To use this calculation in the python interface, you have to first create a trajectory object (that can be both from a lammps binary or from python arrays objects - described in the section creating a trajectory object), then you have to create the following instance if the numnpy array trajectory is used:
msd_calculation = pyanalisi.MeanSquareDisplacement(
trajectory_instance, # Trajectory instance
skip, # in calculating averages skip this amount of timestep,
# as in -s option
tmax, # size of the MSD(t) plot in number of timesteps
nthreads, # number of parallel threads to use in the computation
center_of_mass, # if True calculates center of mass displacement too
use_cm_reference, # use the reference system of the center of mass
# of all atoms of the same type
debug_flag # if True produces some dump files...
)
if the lammps binary trajectory instance is used, you must use the pyanalisi.MeanSquareDisplacement_lammps
, with exactly the same arguments. The calculation will be exactly the same. After initializing the object, the calculation can be initialized with
msd_calculation.reset(block_size)
that sets the number of steps that will be used to calculate the average over the trajectory. If tmax
is greater than block_size
, then tmax
will be setted to block_size
. The calculation is started with:
msd_calculation.calculate(first_timestep)
where first_timestep
is the index of the first timestep that will be used to calculate the averages. After the calculation over this block is finished, the result can be collected in a numpy array with:
result = np.array(msd_calculation, copy=False)
In general the object msd_calculation supports the buffer protocol interface, so to get the result you can use anything that can interface with the buffer.
Given vector time series of length , , , implements an expression equivalent to the following formula:
but with the trapezoidal rule in place of the sums marked with . Note that is a matrix. To get the correct units of measure, you have still to multiply all the quantities but the s by the integration timestep. is the number of timesteps on which the code runs the average. Every quantity is written in the output in the following order: If the command line tool is used, the variance of the block average is automatically computed, and after each column you find its variance. Moreover you find in the output a useful description of the columns with their indexes.Given a central atom of type at timestep , calculates the histogram of the minimum image's radial distance of atoms of type at timestep . The histogram can be of the same atom (you now, it spreads arond while time is passing) or of all atoms different from the original one. Everything is averaged over and atoms of the same type. In particular, defining the index of a pair of atoms and and a timelag at a timestep as
the histogram between specie and is defined as where is the Kronecker delta. In practice the program, for each atoms, it adds 1 to the corresponding bin of the histogram.For each , with , the position of the histograms in the memory is (0 is the first). Given this order of the histograms, the layout in the memory is the following:
The layout of the command line output is a gnuplot-friendly one, where the output is organized in blocks, one for each , separated by two blank lines, and every line corresponds to a histogram bin for every combination of :
where we collapsed the indexes into a single number, the position order of the histogram. As usual in the command line output every column is an average over all the blocks and it is followed by the variance of the mean.
The implemented formula for the real spherical harmonics is the following:
Where , , are calculated using cartesian components: and , are evaluated using Chebyshev polynomials with a recursive definition: and are the associated Legendre polynomials, calculated with the following set of recursive definition: that allows us to calculate every and every element of the values. Then we have the recursion to go up in , for any value of it:The program, given a number and a triplet , is able to calculate every value of and for all allowed values of with a single recursion. Let
for some timestep in some volume taken as a the difference of two concentric spheres centered on the atom of radius and , and where is the atomic density of the species . Since the densities are taken as sums of dirac delta functions, it is sufficient to evaluate the spherical harmonics functions at the position of the atoms. Then we calculate the following: where is an average operator, and we do an additional average over all the central atoms of the type . The average is implemented as an average over the starting timestep.The implemented equation is:
where is the type of atom and is one of the spatial direction x,y,z. The diffusivity can be computed as half of the zero value of D.
The columns of the output are ordered like the following:
where is the number of atomic species and the number of timesteps. Note that in the command line versione each column but the first is followed by its variance
The options that you can use for this calculation are simply:
./analisi -i lammp_binary [-N number of threads] [-V] [-B number of blocks]
the code will generate a file with a number of line equal to the number of frequencies, and with ntypes_atoms*2+1
columns where:
- the first column rappresent the index of the frequencies
- then there are the block average and variance of the spectrum for each atomic type
The code is trasparent ot units of measuares of the quantities. If a user wants the diffusivity in the correct units ( e.g. metal) must porcede in the following way:
- the first column can be multiplied by
1/(nstep*dt)
to obtain the frequencies in multiples of Hz - the other columns can be multiplied by
dt/nstep
;
where nstep
is the total number of step of the block used to compute VDOS, dt
is the time difference of two consecutive molecular dynamis steps.
There are two optimal methods of computing the VDOS with the python interface depending on which trajectory are you using.
The two function can be found in the common.py
file
-
with the lammps
Traj
:analyze_vdos_lammps(traj,nstep=None,start=0,resetAccess=True,print=print)
.traj
is the lammps trajectory file;nstep
is the number of step to use in the computation of the VDOS; ifNone
all steps are included;start
is the starting step to use as starting point of the trajectory;resetAccess
if true resetstraj.setAccessWindowSize(traj.getNtimesteps())
andtraj.setAccessStart(0)
-
with numpy
Trajectory
interface:analyze_vdos_numpy(pos,vel,types,box,nstep=ltot,start=start)
.pos
: positions matrix (N_timesteps, N_atoms, 3)vel
: velocities matrix (N_timesteps, N_atoms, 3)types
: types vector (N_atoms)box
: box matrixmatrix_format
:bool matrix format for the box arraywrap
: boolnstep
: intstart
: intTrajectory
interface
Both functions returns a numpy array with dimensions: (ntypes, freq, Cartesian_coordinate)
.
The units are the same of the command line version
, thus the matrix must be multiplied by dt/nstep
.
The CalcolaMultiThread
class tries to abstract away common features of most of the calculations over the trajectory, allowing to easily write parallel calculations that take advantage of the modern multicore architectures of every computer.
All the logic about MPI and block averages is self contained in mediablocchi.h
, a small file of only ~200 lines, that take care of doing block averages of both the MPI and the non MPI case. Calculations that needs (or take advantage of) block averages and want to use this tool must implement few functions, as found in all the classes, and must use the Traiettoria class to read the trajectory. In this way the user can concentrate on implementing the actual calculation without thinking of the boilerplate code needed to compute block averages and MPI-parallelize it. The calculation class with the data that must be averaged over the blocks must be a class derived from OperazioniSuLista
Written by Riccardo Bertossa during his lifetime at SISSA