Skip to content

Commit

Permalink
sequence alignment operators
Browse files Browse the repository at this point in the history
  • Loading branch information
0x00b1 committed Apr 24, 2024
1 parent 9317f67 commit 25c46fb
Showing 1 changed file with 317 additions and 0 deletions.
317 changes: 317 additions & 0 deletions notebooks/sequence_alignment.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,317 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a2a15b8f-a7a9-4a33-b9b1-f6f22387dd44",
"metadata": {},
"source": [
"# Sequence alignment"
]
},
{
"cell_type": "markdown",
"id": "796a7f6c-5ae8-47f0-a845-d94dae2af9cb",
"metadata": {},
"source": [
"## Idea\n",
"\n",
"- No “best practice” for computing MSAs\n",
"- Learn an alignment jointly with downstream task (e.g., structure prediction)\n",
"- Consider multiple alignments simultaneously\n",
"- `logsumexp` rather than `max`, i.e., DAG traversal: \n",
"\n",
"$$f^{S}(v) = \\log{\\sum_{u \\in N^{-}(v)}\\exp f^{S}(u) + w(u \\to v)}$$\n",
"\n",
"- Each vertex on the DAG has a score from a scoring function, use the scores to define possible paths, $s \\to t$ (i.e., alignments):\n",
"\n",
"$$\\mathbb{T} (u \\to v) = \\frac{\\exp{(f^{S}(u) + w(u \\to v)})}{\\sum_{u' \\in N^{-}(v)} \\exp{(f^{S}(u') + w(u' \\to v))}}$$\n",
"\n",
"*See Blondel, et al. for proof.*"
]
},
{
"cell_type": "markdown",
"id": "37c39397-5f96-4b65-974e-6baff2f41fc2",
"metadata": {},
"source": [
"## `beignet.operators.smith_waterman`\n",
"\n",
"```Python\n",
"def smith_waterman(\n",
" input: Tensor,\n",
" lengths: (int, int),\n",
" *,\n",
" gap_penalty: float | None = 0.0,\n",
" temperature: float | None = 1.0,\n",
") -> Tensor:\n",
" \"\"\"\n",
" Differentiable pairwise Smith-Waterman local sequence alignment.\n",
"\n",
" Parameters\n",
" ----------\n",
" input : Tensor, shape=(..., N, N)\n",
" Similarity matrix for two sequences.\n",
"\n",
" lengths : (int, int)\n",
" Lengths of the two sequences.\n",
"\n",
" gap_penalty : float, optional\n",
" Penalty for creating a gap in the alignment, default `0.0`.\n",
"\n",
" temperature : float, optional\n",
" Sharpness of the score distribution, default `1.0`.\n",
"\n",
" Returns\n",
" -------\n",
" output : Tensor\n",
" Smith-Waterman alignment score.\n",
"\"\"\"\n",
"```\n",
"\n",
"The `smith_waterman` operator constructs a probability distribution over sequence alignments, $\\text{alignment}_{ij}$, the probability that position $i$ of $\\text{sequence}_1$ is aligned to position $j$ of $\\text{sequence}_2$. `input` has the shape:\n",
"\n",
"$$\\dots \\times \\text{maximum-length}_{1} \\times \\text{maximum-length}_{2}$$\n",
"\n",
"where `input[i, :, :]` is the similarity matrix of the $i^{\\text{th}}$ pair. The `lengths` parameter tells the algorithm where the sequences end. The first and second sequences must have a fixed size (i.e., they should be padded). `sizes` has the shape:\n",
"\n",
"$$\\dots \\times 2$$\n",
"\n",
"where $\\text{sizes}_{1}$ and $\\text{sizes}_{2}$ are the lengths of $\\text{sequence}_1$ and $\\text{sequence}_2$ of the $i^{\\text{th}}$ pair."
]
},
{
"cell_type": "markdown",
"id": "61604a22-f8bb-4e61-b71f-c38a34a08819",
"metadata": {},
"source": [
"$\\text{input}$ is a similarity matrix where $\\text{input}_{ij}$ is the degree of similarity between position $i$ of $\\text{sequence}_1$ and position $j$ of $\\text{sequence}_2$. It could, for example, represent the BLOSUM score between the residues at position $i$ in $\\text{sequence}_1$ and position $j$ in $\\text{sequence}_2$:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ffeb0032-fca5-4539-823b-6eecc8a6491b",
"metadata": {},
"outputs": [],
"source": [
"import beignet.constants\n",
"import torch\n",
"\n",
"sequence_1 = \"NHKFRTDK\"\n",
"sequence_2 = \"DWHYWMMK\"\n",
"\n",
"substitution_matrix, vocabulary = beignet.constants.BLOSUM62, beignet.constants.BLOSUM_VOCABULARY\n",
"\n",
"sequence_1_tokens = []\n",
"sequence_2_tokens = []\n",
"\n",
"for k_1, k_2 in zip(sequence_1, sequence_2):\n",
" sequence_1_tokens = [*sequence_1_tokens, vocabulary[k_1]]\n",
" sequence_2_tokens = [*sequence_2_tokens, vocabulary[k_2]]\n",
"\n",
"input = torch.empty([1, 8, 8], dtype=torch.float32)\n",
"\n",
"for i, k_1 in enumerate(sequence_1_tokens):\n",
" for j, k_2 in enumerate(sequence_2_tokens):\n",
" input[0, i, j] = substitution_matrix[k_1][k_2]\n",
"\n",
"print(input)\n",
"\n",
"lengths = torch.tensor([[8, 8]], dtype=torch.uint8)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ac2e09d-a98a-4c55-86c9-bdef9d2f0644",
"metadata": {},
"outputs": [],
"source": [
"alignment = beignet.operators.smith_waterman(input, lengths, gap_penalty=-1.0, temperature=torch.finfo(torch.float32).eps)\n",
"\n",
"matplotlib.pyplot.imshow(alignment[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d6bf8f2-6252-4337-9b31-c3ed8079895d",
"metadata": {},
"outputs": [],
"source": [
"alignment = beignet.operators.smith_waterman(input, lengths, gap_penalty=-1.0, temperature=32.0)\n",
"\n",
"matplotlib.pyplot.imshow(alignment[0])"
]
},
{
"cell_type": "markdown",
"id": "34f7624d-ce34-4ed1-9889-d04d36ed26e0",
"metadata": {},
"source": [
"Alternatively, it might reflect the dot product of vector representations corresponding to these positions in each sequence:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "308e02f0-921e-483b-8e68-9a4f1e94a6e3",
"metadata": {},
"outputs": [],
"source": [
"import beignet.constants\n",
"import torch\n",
"\n",
"sequence_1 = \"NHKFRTDK\"\n",
"sequence_2 = \"DWHYWMMK\"\n",
"\n",
"substitution_matrix, vocabulary = beignet.constants.BLOSUM62, beignet.constants.BLOSUM_VOCABULARY\n",
"\n",
"substitution_matrix = torch.tensor(substitution_matrix)\n",
"\n",
"sequence_1_tokens = []\n",
"sequence_2_tokens = []\n",
"\n",
"for k_1, k_2 in zip(sequence_1, sequence_2):\n",
" sequence_1_tokens = [*sequence_1_tokens, vocabulary[k_1]]\n",
" sequence_2_tokens = [*sequence_2_tokens, vocabulary[k_2]]\n",
"\n",
"sequence_1_vectors = torch.empty([8, 25], dtype=torch.int8)\n",
"sequence_2_vectors = torch.empty([8, 25], dtype=torch.int8)\n",
"\n",
"for index, (i, j) in enumerate(zip(sequence_1_tokens, sequence_2_tokens)):\n",
" sequence_1_vectors[index] = substitution_matrix[i]\n",
" sequence_2_vectors[index] = substitution_matrix[j]\n",
"\n",
"input = torch.empty([1, 8, 8], dtype=torch.float32)\n",
"\n",
"for i, v_1 in enumerate(sequence_1_vectors):\n",
" for j, v_2 in enumerate(sequence_2_vectors):\n",
" input[0, i, j] = torch.dot(v_1, v_2)\n",
"\n",
"print(input)\n",
"\n",
"lengths = torch.tensor([[8, 8]], dtype=torch.uint8)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4ddab374-bfb6-4069-843a-146697630355",
"metadata": {},
"outputs": [],
"source": [
"alignment = beignet.operators.smith_waterman(input, lengths, gap_penalty=-1.0, temperature=32.0)\n",
"\n",
"matplotlib.pyplot.imshow(alignment[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1ef0d775-de31-426b-8631-7031f700dc16",
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"\n",
"import beignet.operators\n",
"import ipywidgets\n",
"import matplotlib.pyplot\n",
"import torch.nested\n",
"\n",
"gap_penalty = -1.0\n",
"temperature = +1.0\n",
"\n",
"minimum_size = 32\n",
"maximum_size = 64\n",
"\n",
"pairs = 16\n",
"\n",
"sequences = []\n",
"\n",
"sizes = torch.zeros(pairs, 2, dtype=torch.uint8)\n",
"\n",
"for index in range(pairs):\n",
" sequence_1_size = random.randint(minimum_size, maximum_size)\n",
" sequence_2_size = random.randint(minimum_size, maximum_size)\n",
"\n",
" sizes[index, 0] = sequence_1_size\n",
" sizes[index, 1] = sequence_2_size\n",
"\n",
" sequence = torch.rand(sequence_1_size, sequence_2_size)\n",
"\n",
" sequences = [*sequences, sequence]\n",
"\n",
"sequences = torch.nested.nested_tensor(sequences)\n",
"\n",
"sequences = torch.nested.to_padded_tensor(sequences, 0.0)\n",
"\n",
"alignment = beignet.operators.smith_waterman(sequences, sizes, gap_penalty=gap_penalty, temperature=temperature)\n",
"\n",
"figure, axes = matplotlib.pyplot.subplots(4, 4, figsize=[8, 8])\n",
"\n",
"figure.suptitle(f\"gap_penalty = {gap_penalty}, temperature = {temperature}\")\n",
"\n",
"for index, axes in enumerate(axes.flat):\n",
" if index < len(alignment):\n",
" axes.imshow(alignment[index])\n",
"\n",
" axes.axis(\"off\")\n",
" else:\n",
" axes.axis(\"off\")\n",
"\n",
"matplotlib.pyplot.tight_layout()"
]
},
{
"cell_type": "markdown",
"id": "eca13ab0-e75b-41e8-aade-ee829840e7db",
"metadata": {},
"source": [
"## Glossary\n",
"\n",
"- **AHo numbering scheme**: A residue numbering system used in protein sequence alignments, focusing on conserving structural features across related proteins rather than maintaining sequential integrity.\n",
"- **Alignment score**: A numerical value that quantifies the similarity between two or more sequences, calculated based on a scoring system that rewards matches, penalizes mismatches, and may include gap penalties.\n",
"- **Kabat numbering scheme**: Primarily used for immunoglobulins, this scheme assigns numbers to amino acids in an antibody sequence, accommodating insertions and deletions to maintain alignment of conserved regions.\n",
"- **Protein sequence numbering**: Method of assigning ordinal numbers to amino acids in a protein sequence, providing a reference for structural and functional analysis.\n",
"- **Local alignment**: Identifies the highest-scoring alignment for a subsequence within larger sequences; useful for finding regions of similarity among sequences with dispersed conserved regions. The Smith-Waterman algorithm is the canonical local alignment algorithm. The algorithm fills a matrix with scores representing match, mismatch, and gap penalties, and then traces back from the highest scoring cell to determine the optimal local alignment.\n",
"- **Gap penalty**: A negative score applied in sequence alignments to account for introducing gaps (insertions or deletions), used to discourage excessive gapping in the alignment.\n",
"- **Global alignment**: Aligns entire lengths of two or more sequences from start to finish, optimizing for the best possible match across the entire region. Needleman-Wunsch algorithm is the canonical global alignment algorithm and is based on dynamic programming. The algorithm involves filling a matrix with scores and backtracking to find the optimal global alignment.\n",
"- **Multiple sequence alignment (MSA)**: Alignment of three or more biological sequences, identifying similar regions that may indicate functional, structural, or evolutionary relationships.\n",
"- **Optimal alignment**: The highest scoring sequence alignment under specified scoring rules, which may include penalties for mismatches and gaps.\n",
"- **Pairwise sequence alignment**: The process of aligning two sequences to achieve maximal levels of similarity, used as a basis for inferring structural, functional, or evolutionary relationships.\n",
"- **Point accepted mutation (PAM)**: A unit representing a single evolutionary change in a protein sequence over a specified fraction of accepted mutations.\n",
"- **Point accepted mutation (PAM) matrix**: A matrix used in sequence alignment that quantifies amino acid substitutions, calculated from evolutionary data on acceptable mutations over specific evolutionary periods.\n",
"- **Query sequence**: A sequence used as the reference in database searches to find similar sequences, often in the context of sequence alignment or biological database searching.\n",
"- **Scoring function**: An algorithmic approach to assign a numerical score to alignments or arrangements of data, guiding optimal choices in computational processes like sequence alignment. One common scoring function used in sequence alignment is the affine gap penalty, which incorporates both a gap opening and a gap extension penalty. The score might be calculated as $(S = \\text{matches} \\times m + \\text{mismatches} \\times s + \\text{gap openings} \\times o + \\text{gap extensions} \\times e)$, where $m$ is the score for matches, $s$ is the penalty for mismatches, $o$ is the penalty for opening a gap, and $e$ is the penalty for extending a gap.\n",
"- **Scoring matrix**: A grid of values used in sequence alignment that quantifies the similarity or difference between each pair of elements (like amino acids), guiding alignment decisions.\n",
"- **Semi-global alignment**: Aligns entire sequences but allows overhangs at the beginning or end without penalty, useful for aligning sequences where one is a subsequence of the other.\n",
"- **Similarity matrix**: A matrix displaying scores reflecting the degree of similarity between sequences, often used to compare and visualize relationships in a dataset.\n",
"- **Sequence database**: A collection of biological sequence data, typically nucleotide or protein sequences, organized for efficient retrieval and analysis.\n",
"- **Substitution matrix**: Used in sequence alignment, it provides scores for replacing one amino acid with another, reflecting the likelihood of such substitutions in evolutionary terms. BLOSUM (Blocks Substitution Matrix) matrices are commonly used substitution matrices in protein sequence alignment. BLOSUM62, one of the most popular in this family, scores substitutions based on observed replacements in blocks of local alignments of protein sequences, helping to guide alignment by favoring more likely substitutions."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 25c46fb

Please sign in to comment.