Coursework for UCB BioE 231 - Lab 2

This lab analyzes and plots the data given in the sequence file seq.fa

Command

A step by step explanation of Lab 2

Generating a phylogenetic tree

To align multiple sequences and generate multiple sequence alignment (MSA), we used the command:

be231-13@jitterbug ~/lab2 [5:21pm]> muscle -in seqss.fa -out seqss.aligned.fa

The result:

MUSCLE v3.8.31 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

seqss 48 seqs, max length 2217, avg  length 2209
00:00:00     11 MB(1%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00     11 MB(1%)  Iter   1  100.00%  K-mer dist pass 2
00:00:07     83 MB(5%)  Iter   1  100.00%  Align node
00:00:07     83 MB(5%)  Iter   1  100.00%  Root alignment
00:00:11     83 MB(5%)  Iter   2  100.00%  Refine tree
00:00:11     83 MB(5%)  Iter   2  100.00%  Root alignment
00:00:11     83 MB(5%)  Iter   2  100.00%  Root alignment
00:00:26     83 MB(5%)  Iter   3  100.00%  Refine biparts
00:00:42     83 MB(5%)  Iter   4  100.00%  Refine biparts
00:00:42     83 MB(5%)  Iter   5  100.00%  Refine biparts
00:00:42     83 MB(5%)  Iter   5  100.00%  Refine biparts

Comparison of input and output of the command:

The seqs.fa file The seqs.aligned.fa file

After this, we utilized Morgan Price’s excellent tool FastTree turn multiple alignment into a Newick-formatted tree with the following command:

be231-13@jitterbug ~/lab2 [5:41pm]> FastTree -nt < seqss.aligned.fa > trees.nwk

The result:

FastTree Version 2.1.10 SSE3
Alignment: standard input
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Jukes-Cantor, CAT approximation with 20 rate categories
Initial topology in 0.07 seconds
Refining topology: 22 rounds ME-NNIs, 2 rounds ME-SPRs, 11 rounds ML-NNIs
Total branch-length 0.954 after 0.87 sec2, 1 of 46 splits
ML-NNI round 1: LogLk = -15621.453 NNIs 8 max delta 16.59 Time 1.37
Switched to using 20 rate categories (CAT approximation)20 of 20
Rate categories were divided by 0.740 so that average rate = 1.0
CAT-based log-likelihoods may not be comparable across runs
Use -gamma for approximate but comparable Gamma(20) log-likelihoods
ML-NNI round 2: LogLk = -14251.186 NNIs 2 max delta 0.00 Time 1.69
Turning off heuristics for final round of ML NNIs (converged)
ML-NNI round 3: LogLk = -14251.086 NNIs 0 max delta 0.00 Time 2.04 (final)
Optimize all lengths: LogLk = -14251.086 Time 2.15
Total time: 2.81 seconds Unique: 48/48 Bad splits: 0/45

The resulting newwick will be drawn in the iPython file.

Identifying sequences by BLAST & Calculating sequences statistics for each cluster

All the details are explained in the iPython file.

Here we show how the tree was clustered.

Cluster 1
hu.31
hu.32
hu.14
Cluster 2
hu.44
hu.46
hu.43
hu.48
Cluster 3
pi.3
pi.1
pi.2
Cluster 4
rh.43
Cluster 5
rh.58
rh.57
hu.39
rh.49
rh.51
rh.61
rh.52
rh.50
rh.53
rh.64
Cluster 6
bb.1
bb.2
rh.10
hu.17
hu.6
Cluster 7
rh.2
rh.40
hu.41
rh.38
hu.66
hu.67
hu.42
hu.37
hu.40
Cluster 8
cy.2
rh.54
rh.55
rh.48
rh.62
Cluster 9
rh.35
rh.36
rh.37
Cluster 10
cy.4
cy.6
cy.3
cy.5
rh.13

The cluster representatives are the first specimen of each cluster in the list. Their BLAST analysis reveals that:

The hu.31 sequence corresponds to the Adeno-associated virus isolate hu.31 capsid protein VP1 (cap) gene, complete dds.
The hu.44 sequence corresponds to the Adeno-associated virus isolate hu.44 capsid protein VP1 (cap) gene, complete cds.
The pi.3 sequence corresponds to the Adeno-associated virus isolate pi.3 capsid protein VP1 (cap) gene, complete cds.
The rh.58 sequence corresponds to the Adeno-associated virus isolate rh.58 capsid protein VP1 (cap) gene, complete cds.
The rh.58 sequence corresponds to the Adeno-associated virus isolate rh.58 capsid protein VP1 (cap) gene, complete cds.
The bb.1 sequence corresponds to the Non-human primate Adeno-associated virus isolate AAVbb.1 capsid protein (VP1) gene, complete cds.
The rh.2 sequence corresponds to the Non-human primate Adeno-associated virus isolate AAVrh.2 capsid protein (VP1) gene, complete cds.
The cy.2 sequence corresponds to the Non-human primate Adeno-associated virus isolate AAVcy.2 capsid protein (VP1) gene, complete cds.
The rh.35 sequence corresponds to the Non-human primate Adeno-associated virus isolate AAVrh.35 capsid protein (VP1) gene, complete cds.
The cy.4 sequence corresponds to the Non-human primate Adeno-associated virus isolate AAVcy.4 capsid protein (VP1) gene, complete cds.

Built With

Matplotlib - Python 2D plotting library
BioPhython - a set of freely available tools for biological computation

Authors

Thomas Galeon - Initial work - TheGaga
Zach Lyu - Initial work - bijiuni

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Clusters.odt		Clusters.odt
Lab2.ipynb		Lab2.ipynb
README.md		README.md
seqs.aligned.fa		seqs.aligned.fa
seqs.fa		seqs.fa
tree.nwk		tree.nwk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coursework for UCB BioE 231 - Lab 2

Command

Generating a phylogenetic tree

Identifying sequences by BLAST & Calculating sequences statistics for each cluster

Built With

Authors

About

Releases

Packages

Languages

TheGaga/Phylogenetic-Tree

Folders and files

Latest commit

History

Repository files navigation

Coursework for UCB BioE 231 - Lab 2

Command

Generating a phylogenetic tree

Identifying sequences by BLAST & Calculating sequences statistics for each cluster

Built With

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages