Skip to content

Using LOOS for Principal Component Analysis

Grossfield Lab edited this page Oct 12, 2020 · 3 revisions

Principal component analysis is one of the most commonly used techniques for dimension reduction of simulations of macromolecules. This article will run through the tools and workflows for performing, understanding, and visualizing PCA results using LOOS.

How do I run the PCA?

There are 2 tools to actually perform the PCA: svd and big-svd; in this context, svd stands for "singular value decomposition".

We'll start by focusing on svd, and on the simplest case where you're analyzing 1 trajectory. Before one can actually do the PCA or svd, the first step is to align the structures. Otherwise, any systematic rotation and translation undergone during the simulation will be convoluted into the computed modes. The svd tool will do this automatically -- by default, it will optimally align the frames using an iterative algorithm using the atoms named "CA", although you can specify a different selection using the --align flag. Similarly, by default, the PCA will also be done using all atoms named "CA"; one can override this with the --svd flag. If you've already aligned the trajectory, you can skip that step with the --noalign flag.

svd will write several files (you can change "output" to something more useful using the --prefix flag):

  • output_s.asc: single column of numbers representing the singular values (square roots of the eigenvalues). In units of angstroms, this represents the RMS displacement along each mode in the trajectory.
  • output_U.asc: matrix whose column vectors (left singular vectors) are the normalized eigenvectors of the trajectory. These are the directions of motion, sorted from largest RMS displacement to smallest.
  • output_V.asc: matrix whose column vectors (right singular vectors) are the time series of the projection of the trajectory onto each eigenvector
  • output.map: mapping of the atom selections onto the rows of the output matrices (for convenience in identifying which rows are which)
  • output_avg.pdb: PDB file containing the iteratively determined average structure for only the atoms used in the SVD.

Note: the U and V columns are normalized, so to recover realistic atomic displacements you'll need to multiple them by the number of atoms and the number of frames, respectively.

If you have multiple trajectories that you want to process in a single calculation, you can simply specify them on the command line. If you want to track their time series independently (e.g. so you can plot the displacements on the first and second modes and color code by trajectory), you can use the --splitv argument to write the v-matrix in separately for the frames in each trajectory; the --autoname flag will give these files more intelligible names.

If your system is big or has a lot of frames, you may wish to reduce the memory requirements by using the --terms N flag to tell the program to only output the first N terms of the eigendecomposition. If that's not sufficient, you can use the big-svd tool; this tool skips the alignment (it assumes you've already done it), and will let you specify how many right singluar vectors you want using the --rsv flag. It only takes 1 trajectory, but uses a more memory-efficient algorithm.

What do I do with the results?

Often, the most important quantity from PCA is the eigenvectors: we want to understand the primary directions of motion. LOOS gives you 2 primary ways to do it. To make a static structure, you can use the porcupine tool, which writes a PDB file containing extra particles that will let you draw "sticks" pointing along the directions of the modes. The enmovie tool will create a short trajectory in DCD format to let you visualize a given mode or modes.

The other obvious thing to do with PCA results is the look at the displacement of the trajectory along the modes by plotting the right singular vectors (output_V.asc); you can plot a single mode vs. time (the index), or look at the distribution of two modes (x-axis is mode 1, y-axis is mode 2). This is a useful way to compare how similar a set of trajectories are -- using the --splitv option will give you a separate file for each trajectory, making it easy to distinguish them on a single plot.

It's important to note that for this comparison to be meaningful, the two trajectories must have been part of the same svd calculation; otherwise, they will have different eigenvectors, and so the axes would be meaningless.

Using covariance overlap to compare trajectories

The covariance overlap, first proposed by Hess in Phys Rev E, 2002, is a powerful method for evaluating the similarity of 2 eigendecompositions of the same system. It gives a normalized value saying how similar or different the statistical sampling is in 2 trajectories, so in the limit of perfect sampling the value would be 1. For this calculation, one must perform the svd separately on the two trajectories. However, they must be aligned to a common reference structure, because the eigenvectors are computed relative the average structure, and there's no way post-calculation to infer the rotation. Probably the very best way to do it is merge the trajectories into 1 (using merge-trap or catdcd from NAMD), optimally align all of the structures using aligner, then split the aligned trajectory back into pieces using subsetter using the --range option to pick out the correct frames.

The coverlap tool in LOOS will perform this calculation (as well as simpler subspace overlap calculations). This tool has many options related to comparing eigendecompositions from network models as well as svd, but the flags aren't necessary if both eigendecompositions came from svd.

How do I use PCA to compare the fluctuations of similar (but not identical) systems?

Often one wants to compare the fluctuations of two related systems (e.g. wild type and mutant forms of a protein, or one system with a post-translational modification). PCA/svd is a good tool to do this, but the svd tool in LOOS can't do it directly; it takes one system file as input and one set of selection strings, so there's no way to read 2 different systems that might have different numbers of atoms.

In this case, we suggest performing the calculation in 2 parts. First, prepare trajectories for the two systems containing only the atoms you want to use in the PCA using the subsetter tool. You will need to specify your selections carefully such that the resulting systems are identical (same atoms in the same order); using just the CA's or the backbone would remove issues due to a point mutation, while residue ranges can be used if one chain is slightly longer than the other. Once this is done, the system file from either subsetted trajectory can be used for the svd calculation. Generating the extra trajectory files is a minor nuisance, but they'll generally be 1-2 orders of magnitude smaller than the originals, since they won't contain solvent or most of the atoms from the macromolecule.