Skip to content
This repository has been archived by the owner on Jul 20, 2021. It is now read-only.

Commit

Permalink
Merge pull request #93 from gregcaporaso/0.1.0-rc1
Browse files Browse the repository at this point in the history
0.1.0 release
  • Loading branch information
gregcaporaso committed Mar 27, 2015
2 parents eb894d6 + ab68d4d commit 1d33d1c
Show file tree
Hide file tree
Showing 4 changed files with 69 additions and 73 deletions.
12 changes: 2 additions & 10 deletions Index.ipynb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"metadata": {
"name": "",
"signature": "sha256:5fc7b5198a3fdef06c4cd47a8083aa86dab5b7dc70eb652ef5edd545cbb2b735"
"signature": "sha256:d2aaeb936946d20bcd915fed579c1222561be4246a56903890bd3c3c451a64a7"
},
"nbformat": 3,
"nbformat_minor": 0,
Expand All @@ -20,7 +20,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<div style=\"float: right; margin-left: 30px; width: 200px\"><img title=\"Logo by @gregcaporaso.\" style=\"float: right;margin-left: 30px;\" src=\"https://raw.github.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/images/logo.png\" align=right height=50/></div>\n",
"<div style=\"float: right; margin-left: 30px; width: 200px\"><img title=\"Logo by @gregcaporaso.\" style=\"float: right;margin-left: 30px;\" src=\"https://raw.github.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/images/logo.png\" align=right height=250/></div>\n",
"\n",
"Bioinformatics, as I see it, is the application of the tools of computer science (things like programming languages, algorithms, and databases) to address biological problems (for example, inferring the evolutionary relationship between a group of organisms based on fragments of their genomes, or understanding if or how the community of microorganisms that live in my gut changes if I modify my diet). Bioinformatics is a rapidly growing field, largely in response to the vast increase in the quantity of data that biologists now grapple with. Students from varied disciplines (e.g., biology, computer science, statistics, and biochemistry) and stages of their educational careers (undergraduate, graduate, or postdoctoral) are becoming interested in bioinformatics.\n",
"\n",
Expand Down Expand Up @@ -69,14 +69,6 @@
" 1. [Studying biological diversity](applications/biological-diversity.ipynb)\n",
"3. Wrapping up"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
Expand Down
61 changes: 30 additions & 31 deletions fundamentals/multiple-sequence-alignment.ipynb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"metadata": {
"name": "",
"signature": "sha256:f383f1072392952b7b5389d597f80f7492454531ef74bfc6da0be869a178f19f"
"signature": "sha256:4760d0f817697b8b03d8addd60b88dfe2cecb94766621c19139a0c8161d6c283"
},
"nbformat": 3,
"nbformat_minor": 0,
Expand All @@ -22,41 +22,40 @@
"source": [
"It's possible to generalize Smith-Waterman and Needleman-Wunsch, the dynamic programming algorithms that we explored for pairwise sequence aligment, to identify the optimal alignment of more than two sequences. Remember that our scoring scheme for pairwise alignment with Smith-Waterman looked like the following:\n",
"\n",
"$$\n",
"\\begin{align}\n",
"& F(0, 0) = 0\\\\\n",
"& F(i, 0) = F(i-1, 0) - d\\\\\n",
"& F(0, j) = F(0, j-1) - d\\\\\n",
"\\\\\n",
"& F(i, j) = max \\begin{pmatrix}\n",
"& F(i-1, j-1) + s(c_i, c_j)\\\\\n",
"& F(i-1, j) - d\\\\\n",
"& F(i, j-1) - d)\\\\\n",
"\\end{pmatrix}\n",
"\\end{align}\n",
"$$\n",
"\n",
"$F(0, 0) = 0$\n",
"\n",
"$F(i, 0) = F(i-1, 0) - d$\n",
"\n",
"$F(0, j) = F(0, j-1) - d$\n",
"\n",
"$\n",
"F(i, j) = max \\left(\\begin{align}\n",
"F(i-1, j-1) + s(c_i, c_j)\\\\\n",
"F(i-1, j) - d\\\\\n",
"F(i, j-1) - d)\n",
"\\end{align}\\right)$\n",
"\n",
"To generalize this to three sequences, we could create 3x3 score, dynamic programming, and traceback matrices. Our scoring scheme would then look like the following:\n",
"\n",
"\n",
"$F(0, 0, 0) = 0$\n",
"\n",
"$F(i, 0, 0) = F(i-1, 0, 0) - d$\n",
"\n",
"$F(0, j, 0) = F(0, j-1, 0) - d$\n",
"\n",
"$F(0, 0, k) = F(0, 0, k-1) - d$\n",
"To generalize this to three sequences, we could create $3 \\times 3$ scoring, dynamic programming, and traceback matrices. Our scoring scheme would then look like the following:\n",
"\n",
"$\n",
"F(i, j, k) = max \\left(\\begin{align}\n",
"$$\n",
"\\begin{align}\n",
"& F(0, 0, 0) = 0\\\\\n",
"& F(i, 0, 0) = F(i-1, 0, 0) - d\\\\\n",
"& F(0, j, 0) = F(0, j-1, 0) - d\\\\\n",
"& F(0, 0, k) = F(0, 0, k-1) - d\\\\\n",
"\\\\\n",
"& F(i, j, k) = max \\begin{pmatrix}\n",
"F(i-1, j-1, k-1) + s(c_i, c_j) + s(c_i, c_k) + s(c_j, c_k)\\\\\n",
"F(i, j-1, k-1) + s(c_j, c_k) - d\\\\\n",
"F(i-1, j, k-1) + s(c_i, c_k) - d\\\\\n",
"F(i-1, j-1, k) + s(c_i, c_j) - d\\\\\n",
"F(i, j, k-1) - 2d\\\\\n",
"F(i, j-1, k) - 2d\\\\\n",
"F(i-1, j, k) - 2d\\\\\n",
"\\end{align}\\right)$\n",
"\\end{pmatrix}\n",
"\\end{align}\n",
"$$\n",
"\n",
"However the complexity of this algorithm is much worse than for pairwise alignment. For pairwise alignment, remember that if aligning two sequences of lengths $m$ and $n$, the runtime of the algorithm will be proportional to $m \\times n$. If $n$ is longer than or as long as $m$, we simplify the statement to say that the runtime of the algorithm will be be proportional to $n^2$. This curve has a pretty scary trajectory: runtime for pairwise alignment with dynamic programming is said to scale quadratically."
]
Expand Down Expand Up @@ -800,17 +799,17 @@
"\n",
"For example, if we want to align the alignment column from $aln1$:\n",
"\n",
"``\n",
"```\n",
"A\n",
"C\n",
"``\n",
"```\n",
"\n",
"to the alignment column from $aln2$:\n",
"\n",
"``\n",
"```\n",
"T\n",
"G\n",
"``\n",
"```\n",
"\n",
"we could compute the subsitution score using the matrix $m$ as:\n",
"\n",
Expand Down
67 changes: 36 additions & 31 deletions fundamentals/pairwise-alignment.ipynb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"metadata": {
"name": "",
"signature": "sha256:296651182a216272a5bda0c9c27a14f21621e84338b38a032feb74d38bb9ddbd"
"signature": "sha256:63b4f660fec9abeef40537f13e87660973a53d00457c3ddd6e1b61491e720359"
},
"nbformat": 3,
"nbformat_minor": 0,
Expand Down Expand Up @@ -793,14 +793,15 @@
"In the next step we determine the best alignment given the sequences and scoring scheme in what we'll call the **dynamic programming matrix**, and then define programmatically how to transcribe the alignment in what we'll call the **traceback matrix** to yield a pair of aligned sequences. \n",
"\n",
"\n",
"For the convenience of coding this algorithm, it helps to define the dynamic programming matrix with one extra row and one extra column relative to the score matrix, and make these the first column and row of the matrix. These then represent the beginning of the alignment position $(0, 0)$. The score $F$ for cell $(i, j)$, where $i$ represents the row number and $j$ represents the column number, is defined for the first row and column as follows. \n",
"For the convenience of coding this algorithm, it helps to define the dynamic programming matrix with one extra row and one extra column relative to the score matrix, and make these the first column and row of the matrix. These then represent the beginning of the alignment position $(0, 0)$. The score $F$ for cell $(i, j)$, where $i$ represents the row number and $j$ represents the column number, is defined for the first row and column as follows:\n",
"\n",
"\n",
"$F(0, 0) = 0$\n",
"\n",
"$F(i, 0) = F(i-1, 0) - d$\n",
"\n",
"$F(0, j) = F(0, j-1) - d$\n",
"$$\n",
"\\begin{align}\n",
"& F(0, 0) = 0\\\\\n",
"& F(i, 0) = F(i-1, 0) - d\\\\\n",
"& F(0, j) = F(0, j-1) - d\\\\\n",
"\\end{align}\n",
"$$\n",
"\n",
"This matrix, pre-initialization, would look like the following. As an exercise, try computing the values for the cells in the first four rows in column zero, and the first four columns in row zero. As you fill in the value for a cell, for all cells with a score based on another score in the matrix (i.e., everything except for $F(0, 0)$), draw an arrow from that cell to the cell whose score it depends on. \n",
"\n",
Expand Down Expand Up @@ -897,13 +898,13 @@
"\n",
"In a Needleman-Wunsch alignment, the score $F$ for cell $(i, j)$ (where $i$ is the row number and $j$ is the column number, and $i > 0$ and $j > 0$) is computed as the maximum of three possible values.\n",
"\n",
"\n",
"$\n",
"$$\n",
"F(i, j) = max \\left(\\begin{align}\n",
"F(i-1, j-1) + s(c_i, c_j)\\\\\n",
"F(i-1, j) - d\\\\\n",
"F(i, j-1) - d)\n",
"\\end{align}\\right)$\n",
"& F(i-1, j-1) + s(c_i, c_j)\\\\\n",
"& F(i-1, j) - d\\\\\n",
"& F(i, j-1) - d\n",
"\\end{align}\\right)\n",
"$$\n",
"\n",
"In this notation, $s$ refers to the substitution matrix, $c_i$ and $c_j$ refers to characters in `seq1` and `seq2`, and $d$ again is the gap penalty. Describing the scoring function in English, we score a cell with the maximum of three values: either the value of the cell up and to the left plus the score for the substitution taking place in the current cell (which you find by looking up the substitution in the substitution matrix); the value of the cell above minus the gap penalty; or the value of the cell to the left minus the gap penalty. In this way, you're determining whether the best (highest) score is obtained by inserting a gap in sequence 1 (corresponding to $F(i-1, j) - d$), inserting a gap in sequence 2 (corresponding to $F(i, j-1) - d$), or aligning the characters in sequence 1 and sequence 2 (corresponding to $F(i-1, j-1) + s(c_i, c_j)$).\n",
"\n",
Expand Down Expand Up @@ -1415,12 +1416,13 @@
"The Smith-Waterman algorithm is used for performing pairwise local alignment. It is nearly identical to Needleman-Wunsch, with three small important differences. \n",
"\n",
"First, initialization is easier:\n",
"\n",
"$F(0, 0) = 0$\n",
"\n",
"$F(i, 0) = 0$\n",
"\n",
"$F(0, j) = 0$"
"$$\n",
"\\begin{align}\n",
"& F(0, 0) = 0\\\\\n",
"& F(i, 0) = 0\\\\\n",
"& F(0, j) = 0\n",
"\\end{align}\n",
"$$"
]
},
{
Expand Down Expand Up @@ -1500,12 +1502,14 @@
"source": [
"Next, there is one additional term in the scoring function:\n",
"\n",
"$\n",
"$$\n",
"F(i, j) = max \\left(\\begin{align}\n",
"0\\\\\n",
"F(i-1, j-1) + s(c_i, c_j)\\\\\n",
"F(i-1, j) - d\\\\F(i, j-1) - d)\n",
"\\end{align}\\right)$"
"& 0\\\\\n",
"& F(i-1, j-1) + s(c_i, c_j)\\\\\n",
"& F(i-1, j) - d\\\\\n",
"& F(i, j-1) - d)\n",
"\\end{align}\\right)\n",
"$$\n"
]
},
{
Expand Down Expand Up @@ -1996,14 +2000,15 @@
"source": [
"The second limitation of the our simple alignment algorithm, and one that is also present in our version of Smith-Waterman as implemented above, is that all gaps are scored equally whether they represent the opening of a new insertion/deletion, or the extension of an existing insertion/deletion. This isn't ideal based on what we know about how insertion/deletion events occur (see [this discussion of replication slippage](http://www.ncbi.nlm.nih.gov/books/NBK21114/)). Instead, **we might want to incur a large penalty for opening a gap, but a smaller penalty for extending a gap**. To do this, **we need to make two small changes to our scoring scheme**. When we compute the score for a gap, we should incurr a *gap open penalty* if the previous max score was derived from inserting a gap character in the same sequence. If we represent our traceback matrix as $T$, our gap open penalty as $d^0$, and our gap extend penalty as $d^e$, our scoring scheme would look like the following:\n",
"\n",
"$\n",
"$$\n",
"F(i, j) = max \\left(\\begin{align}\n",
" 0\\\\\n",
" F(i-1, j-1) + s(c_i, c_j)\\\\\n",
" \\left\\{\\begin{array}{l l} F(i-1, j) - d^e \\quad \\text{if $T(i-1, j)$ is gap}\\\\ F(i-1, j) - d^o \\quad \\text{if $T(i-1, j)$ is not gap} \\end{array} \\right\\} \\\\\n",
" \\left\\{\\begin{array}{l l} F(i, j-1) - d^e \\quad \\text{if $T(i, j-1)$ is gap}\\\\ F(i, j-1) - d^o \\quad \\text{if $T(i, j-1)$ is not gap} \\end{array} \\right\\}\n",
"& 0\\\\\n",
"& F(i-1, j-1) + s(c_i, c_j)\\\\\n",
"& \\left\\{\\begin{array}{l l} F(i-1, j) - d^e \\quad \\text{if $T(i-1, j)$ is gap}\\\\ F(i-1, j) - d^o \\quad \\text{if $T(i-1, j)$ is not gap} \\end{array} \\right\\} \\\\\n",
"& \\left\\{\\begin{array}{l l} F(i, j-1) - d^e \\quad \\text{if $T(i, j-1)$ is gap}\\\\ F(i, j-1) - d^o \\quad \\text{if $T(i, j-1)$ is not gap} \\end{array} \\right\\}\n",
" \\end{align}\\right)\n",
"$\n",
"$$\n",
"\n",
"\n",
"Notice how we only use the gap extend penalty if the previous max score resulted from a gap in the same sequence (which we know by looking in the traceback matrix) because it represents the continuation of an existing gap in that sequence. This is why we check for a specific type of gap in $T$, rather than checking whether $T$ `!= '\\'`. \n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# http://creativecommons.org/licenses/by-nc-sa/4.0/.
# -----------------------------------------------------------------------------

__version__ = '0.0.0-dev'
__version__ = '0.1.0'

from setuptools import find_packages, setup

Expand Down

0 comments on commit 1d33d1c

Please sign in to comment.