A structural alignment attempts to establish equivalences between two or more polymer structures based on their shape and three-dimensional conformation. In contrast to simple structural superposition (see below), where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions.
A structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Therefore, a structural alignment can be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be exercised when using the results as evidence for shared evolutionary ancestry, because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.
A structural alignment of other biological polymers can also be made in BioJava. For example, nucleic acids can be structurally aligned to find common structural motifs, independent of sequence similarity. This is specially important for RNAs, because their 3D structure arrangement is important for their function.
For more info see the Wikipedia article on structure alignment.
BioJava comes with a number of algorithms for aligning structures. The following five options are displayed by default in the graphical user interface (GUI), although others can be accessed programmatically using the methods in StructureAlignmentFactory.
- Combinatorial Extension (CE)
- Combinatorial Extension with Circular Permutation (CE-CP)
- FATCAT - rigid
- FATCAT - flexible.
- Smith-Waterman superposition
CE and FATCAT both use structural similarity to align the structures, while Smith-Waterman performs a local sequence alignment and then displays the result in 3D. See below for descriptions of the algorithms.
Since BioJava version 4.1.0, multiple structures can be compared at the same time in a multiple structure alignment, that can later be visualized in Jmol. The algorithm is described in detail below. As an overview, it uses any pairwise alignment algorithm and a reference structure to perform an alignment of all the structures. Then, it runs a Monte Carlo optimization to determine the residue equivalencies among all the structures, identifying conserved structural motifs.
Before going the details how to use the algorithms programmatically, let's take a look at the user interface that comes with the biojava-structure-gui module.
Generating an instance of the GUI is just one line of code:
AlignmentGui.getInstance();
This code shows the following user interface:
You can manually select structure chains, domains, or custom files to be aligned. Try to align 2hyn vs. 1zll. This will show the results in a graphical way, in 3D:
and also a 2D display, that interacts with the 3D display
Because of the inherent difference between multiple and pairwise alignments, a separate GUI is used to trigger multiple structural alignments. Generating an instance of the GUI is analogous to the pairwise alignment GUI:
MultipleAlignmentGUI.getInstance();
This code shows the following user interface:
The input format is a free text field, where the structure identifiers are indicated, space separated. A structure identifier is a String that uniquely identifies a structure. It is basically composed of the pdbID, the chain letters and the ranges of residues of each chain. For the formal description visit StructureIdentifier.
As an example, a multiple structure alignment of 6 globins is shown here. Their structure identifiers are shown in the previous figure of the GUI. The results are shown in a graphical way, as for the pairwise alignments:
The only difference with the Pairwise Alignment View is the possibility to show a subset of structures to be visualized, by checking the boxes under the 3D window and pressing the Show Only button afterwards.
A sequence alignment panel that interacts with the 3D display can also be shown.
Explore the coloring options in the Edit menu, and through the View menu for alternative representations of the alignment.
The functionality to perform and visualize these alignments can also be used from your own code. Let's first have a look at the alignment algorithms.
The Combinatorial Extension (CE) algorithm was originally developed by
Shindyalov and Bourne in
1998 .
It works by identifying segments of the two structures with similar local
structure, and then combining those to try to align the most residues possible
while keeping the overall root-mean-square deviation (RMSD) of the superposition low.
CE is a rigid-body alignment algorithm, which means that the structures being compared are kept fixed during superposition. In some cases it may be desirable to break large proteins up into domains prior to aligning them (by manually inputting a subrange, using the SCOP or CATH databases, or by decomposing the protein automatically using the Protein Domain Parser algorithm).
BioJava class: org.biojava.bio.structure.align.ce.CeMain
CE and FATCAT both assume that aligned residues occur in the same order in both
structures (e.g. they are both sequence-order dependent algorithms). In proteins
related by a circular permutation, the N-terminal part of one protein is related
to the C-terminal part of the other, and vice versa. CE-CP allows circularly
permuted proteins to be compared. For more information on circular
permutations, see the
Wikipedia or
Molecule of the Month
articles .
For proteins without a circular permutation, CE-CP results look very similar to CE results (with perhaps some minor differences and a slightly longer calculation time). If a circular permutation is found, the two halves of the proteins will be shown in different colors:
CE-CP was developed by Spencer E. Bliven, Philip E. Bourne, and Andreas Prlić.
BioJava class: org.biojava.nbio.structure.align.ce.CeCPMain
This is a Java implementation of the original FATCAT algorithm by Yuzhen Ye
& Adam Godzik in
2003
.
It performs similarly to CE for most structures. The 'rigid' flavor uses a
rigid-body superposition and only considers alignments with matching sequence
order.
BioJava class: org.biojava.nbio.structure.align.fatcat.FatCatRigid
FATCAT-flexible introduces 'twists' between different parts of the structures which are superimposed independently. This is ideal for proteins which undergo large conformational shifts, where a global superposition cannot capture the underlying similarity between domains. For instance, the structures of calmodulin with and without calcium bound can be much better aligned with FATCAT-flexible than with one of the rigid alignment algorithms. The downside of this is that it can lead to additional false positives in unrelated structures.
BioJava class: org.biojava.nbio.structure.align.fatcat.FatCatFlexible
This aligns residues based on Smith and Waterman's 1981 algorithm for local
sequence alignment . No structural information is included in the alignment, so
this only works for structures with significant sequence similarity. It uses the
Blosum65 scoring matrix.
The two structures are superimposed based on this alignment. Be aware that errors locating gaps can lead to high RMSD in the resulting superposition due to a small number of badly aligned residues. However, this method is faster than the structure-based methods.
BioJava Class: org.biojava.nbio.structure.align.ce.CeCPMain
The following methods are not presented in the user interface by default:
- BioJavaStructureAlignment A structure-based alignment method able of returning multiple alternate alignments. It was written by Andreas Prlić and based on the PSC++ algorithm provided by Peter Lackner.
- CeSideChainMain A variant of CE using CB-CB distances, which sometimes improves alignments in proteins with parallel sheets and helices.
- OptimalCECPMain An alternate (much slower) algorithm for finding circular permutations.
Additional methods can be added by implementing the StructureAlignment interface.
This Java implementation for multiple structure alignments, named MultipleMC, is based on the original CE-MC implementation by Guda C, Scheeff ED, Bourne PE & Shindyalov IN in 2001
.
The idea remains unchanged: perform all-to-all pairwise alignments of the structures, choose the reference as the most similar structure to all others and run a Monte Carlo optimization of the multiple residue equivalencies (EQRs) to minimize a score function that depends on the inter-residue distances.
However, some details of the implementation have been changed in the BioJava version. They are described in the main class, as a summary:
- It accepts any pairwise alignment algorithm (instead of being attached to CE), so any of the algorithms described before is suitable for generating a seed for optimization. Note that this property allows non-topological and flexible multiple structure alignments, always restricted by the pairwise alignment algorithm limitations.
- The moves in the Monte Carlo optimization have been simplified to 3.
- A new move to insert and delete individual gaps has been added.
- The scoring function has been modified to a continuous function, maintaining the properties that the authors described.
- The probability function is normalized in synchronization with the optimization progression, to improve the convergence into a maximum score after some random exploration of the multidimensional alignment space.
The algorithm performs similarly to other multiple structure alignment algorithms for most protein families. The parameters both for the pairwise aligner and the MC optimization can have an impact on the final result. There is not a unique set of parameters, because they usually depend on the specific use case. Thus, trying some parameter combinations, keeping in mind the effect they produce in the score function, is a good practice when doing any structure alignment.
BioJava class: org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain
The pairwise structure alignment algorithms in BioJava implement the
StructureAlignment
interface, and are usually accessed through
StructureAlignmentFactory
. Here's an example of how to create a CE-CP
alignment and print some information about it.
// Fetch CA atoms for the structures to be aligned
String name1 = "3cna.A";
String name2 = "2pel";
AtomCache cache = new AtomCache();
Atom[] ca1 = cache.getAtoms(name1);
Atom[] ca2 = cache.getAtoms(name2);
// Get StructureAlignment instance
StructureAlignment algorithm = StructureAlignmentFactory.getAlgorithm(CeCPMain.algorithmName);
// Perform the alignment
AFPChain afpChain = algorithm.align(ca1,ca2);
// Print text output
System.out.println(afpChain.toCE(ca1,ca2));
To display the alignment using Jmol, use:
GuiWrapper.display(afpChain, ca1, ca2);
// Or using the biojava-structure-gui module
StructureAlignmentDisplay.display(afpChain, ca1, ca2);
Note that these require that you include the structure-gui package and the Jmol binary in the classpath at runtime.
For creating multiple structure alignments, the code is a little bit different, because the returned data structure and the number of input structures are different. Here is an example of how to create and display a multiple alignment:
//Specify the structures to align: some ASP-proteinases
List<String> names = Arrays.asList("3app", "4ape", "5pep", "1psn", "4cms", "1bbs.A", "1smr.A");
//Load the CA atoms of the structures and create the structure identifiers
AtomCache cache = new AtomCache();
List<Atom[]> atomArrays = new ArrayList<Atom[]>();
List<StructureIdentifier> identifiers = new ArrayList<StructureIdentifier>();
for (String name:names) {
atomArrays.add(cache.getAtoms(name));
identifiers.add(new SubstructureIdentifier(name));
}
//Generate the multiple alignment algorithm with the chosen pairwise algorithm
StructureAlignment pairwise = StructureAlignmentFactory.getAlgorithm(CeMain.algorithmName);
MultipleMcMain multiple = new MultipleMcMain(pairwise);
//Perform the alignment
MultipleAlignment result = multiple.align(atomArrays);
// Set the structure identifiers, so that each atom array can be identified in the outputs
result.getEnsemble().setStructureIdentifiers(identifiers);
//Output the FASTA sequence alignment
System.out.println(MultipleAlignmentWriter.toFASTA(result));
//Display the results in a 3D view
MultipleAlignmentJmolDisplay.display(result);
Many of the alignment algorithms are available in the form of command line tools. These can be accessed through the main methods of the StructureAlignment classes.
Example:
runCE.sh -pdb1 4hhb.A -pdb2 4hhb.B -show3d
Using the command line tool it is possible to run pairwise alignments, several alignments in batch mode, or full database searches. Some additional parameters are available which are not exposed in the GUI, such as outputting results to a file in various formats.
For details about the structure alignment data models in BioJava, see Structure Alignment Data Model
Thanks to P. Bourne, Yuzhen Ye and A. Godzik for granting permission to freely use and redistribute their algorithms.
Navigation: Home | Book 3: The Structure Modules | Chapter 8 : Structure Alignments