-
Notifications
You must be signed in to change notification settings - Fork 0
/
FASTA_01_problemset.rtf
210 lines (194 loc) · 8.2 KB
/
FASTA_01_problemset.rtf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
{\rtf1\ansi\ansicpg1252\cocoartf1404\cocoasubrtf470
{\fonttbl\f0\fswiss\fcharset0 Helvetica;\f1\fmodern\fcharset0 Courier;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue233;\red66\green1\blue120;\red118\green0\blue2;
}
{\*\listtable{\list\listtemplateid1\listhybrid{\listlevel\levelnfc4\levelnfcn4\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{lower-alpha\}}{\leveltext\leveltemplateid1\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listlevel\levelnfc2\levelnfcn2\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{lower-roman\}}{\leveltext\leveltemplateid2\'01\'01;}{\levelnumbers\'01;}\fi-360\li1440\lin1440 }{\listname ;}\listid1}
{\list\listtemplateid2\listhybrid{\listlevel\levelnfc2\levelnfcn2\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{lower-roman\}}{\leveltext\leveltemplateid101\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid2}}
{\*\listoverridetable{\listoverride\listid1\listoverridecount0\ls1}{\listoverride\listid2\listoverridecount0\ls2}}
\margl1440\margr1440\vieww13920\viewh13060\viewkind0
\deftab720
\pard\pardeftab720\partightenfactor0
{\field{\*\fldinst{HYPERLINK "http://fasta.bioch.virginia.edu/bims6000/"}}{\fldrslt
\f0\b\fs42\fsmilli21333 \cf2 \expnd0\expndtw0\kerning0
\ul \ulc2 fasta.bioch.virginia.edu/bims6000}}
\f0\b\fs42\fsmilli21333 \expnd0\expndtw0\kerning0
\'a0\
\
\pard\pardeftab720\partightenfactor0
\b0\fs28 \cf0 10/24/2017 FASTA ProblemSet\
\
\pard\pardeftab720\partightenfactor0
\b\fs29\fsmilli14667 \cf0 1.
\b0 Use the {\field{\*\fldinst{HYPERLINK "http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=select&pgm=fa&query=295842263&db=p&annot_seq2=5"}}{\fldrslt \cf3 \ul \ulc3 FASTA search page [pgm]}} to compare Honey bee glutathione transferase D1 NP_001171499/ H9KLY5_APIME {\field{\*\fldinst{HYPERLINK "http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&list_uids=295842263&dopt=fasta"}}{\fldrslt \cf3 \ul \ulc3 [seq]}} (gi|295842263) to the PIR1 Annotated protein sequence database. Be sure to press , not .\
\
\pard\tx220\tx720\pardeftab720\li720\fi-720\partightenfactor0
\ls1\ilvl0\cf0 \kerning1\expnd0\expndtw0 {\listtext a }\expnd0\expndtw0\kerning0
Take a look at the output. \
\pard\tx940\tx1440\pardeftab720\li1440\fi-1440\partightenfactor0
\ls1\ilvl1\cf0 \kerning1\expnd0\expndtw0 {\listtext i }\expnd0\expndtw0\kerning0
How long is the query sequence?
\f1\fs26 217 aa
\f0\fs29\fsmilli14667 \
\ls1\ilvl1\kerning1\expnd0\expndtw0 {\listtext ii }\expnd0\expndtw0\kerning0
How many sequences are in the PIR1 database?
\f1\fs26 13144 sequences
\f0\fs29\fsmilli14667 \
\ls1\ilvl1\kerning1\expnd0\expndtw0 {\listtext iii }\expnd0\expndtw0\kerning0
What scoring matrix was used?
\f1\fs26 BL50 matrix
\f0\fs29\fsmilli14667 \
\ls1\ilvl1\kerning1\expnd0\expndtw0 {\listtext iv }\expnd0\expndtw0\kerning0
What were the gap penalties? (what is the penalty for a one-residue gap? two residues?)
\f1\fs26 open/ext: -10/-2 (gap open penalty is -10 but that is empty (non-gap), with the first residue -2) so first residue, penalty score is -12
\f0\fs29\fsmilli14667 \
\pard\tx940\tx1440\tx2160\pardeftab720\li1440\fi-1440\partightenfactor0
\ls1\ilvl1\cf0 \kerning1\expnd0\expndtw0 for 2nd residue, penalty score is -14\
\pard\tx720\tx1440\tx2160\pardeftab720\partightenfactor0
\cf0 \
\pard\tx720\tx1440\pardeftab720\partightenfactor0
\cf0 v. \expnd0\expndtw0\kerning0
What are each of the numbers after the description of the library sequence? Which one is best for inferring homology? \
\
\pard\pardeftab720\partightenfactor0
\f1\fs26 \cf0 opt \
bits \
\cf4 E(13144)\cf0 e-value\
%_id \
%_sim \
alen alignment length
\f0\fs29\fsmilli14667 \
\pard\tx720\tx1440\pardeftab720\partightenfactor0
\cf0 \
vi. \
How similar is the highest scoring sequence?
\fs24 81.1%
\fs29\fsmilli14667 \
\
What is the difference between %_id and %_sim?\
\
\fs24 %id is whether amino acid sequence matches exactly\
%s is testing if it is similar (+ score from Blosum matrix)
\fs29\fsmilli14667 \
\
Why is there no 100% identity match?
\fs24 input sequence is not in the database
\fs29\fsmilli14667 \
\
\kerning1\expnd0\expndtw0 vii. \expnd0\expndtw0\kerning0
Looking at an alignment, where are the boundaries of the alignment (the best local region)? \
\
\fs24 start: 3rd aa (I for query; V for
\f1 GSTT1_DROME)\
end: ~212 aa
\fs26 \
\f0\fs29\fsmilli14667 \
How many gaps are in the best alignment? The second best?\
\
\fs24 one gap in
\f1 GSTT1_DROME\
in GSTF1_MAIZE, there are 4 gaps but the input has some gaps too to make it align properly
\fs26 \
\pard\tx720\tx1440\pardeftab720\partightenfactor0
\cf0 \kerning1\expnd0\expndtw0 \
\
b.
\f0\fs29\fsmilli14667 \expnd0\expndtw0\kerning0
Homologs, non-homologs, and the statistical control.\
\pard\pardeftab720\partightenfactor0
\cf0 \
\pard\tx720\pardeftab720\partightenfactor0
\cf0 i. \
\
What is the highest scoring non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) \
\
\fs24 should take a candidate non-homolog and search it in another database\
\
sp|P08355|GB_SUHVF Envelope glycoprotein \
\
align > general re-search > change search database to Swissprot (contains viral proteins)\
\
this didn\'92t show any glutathione transferase\
\
try another one!\
\
\pard\pardeftab720\partightenfactor0
\f1\fs26 \cf0 sp|Q9SI20|EF1D2_ARATH Elongation factor\
yields a lot of elongation factors, no glutathione transferases\
\
try another one!\
\
sp|P09211|GSTP1_HUMAN Glutathione S-transferase P\
\
\cf4 E=0.12\cf0 \
\
shows tons of glutathione transferases, so this is most-likely the closest non-homolog
\f0\fs24 \
\pard\tx720\pardeftab720\partightenfactor0
\cf0 \
\fs29\fsmilli14667 If the statistical estimates are accurate, what should the E()-value for the highest non-homolog (the highest score by chance) be? \
\
\fs24 should be ~1 (should be showing up frequently)
\fs29\fsmilli14667 \
\
(This is a control for statistical accuracy.) You can use the domain diagrams (colors) to identify distant homologs, and, by elimination, the highest scoring non-homolog. \
\uc0\u8232 ii. What is the E()-value of the most distant homolog shown (based on displayed domain content)? Could there be more distant homologs?\
\
\fs20 GSTA4_RAT Glutathione S-transferase
\fs29\fsmilli14667 \
\fs24 yes there could be more distant homologs
\fs29\fsmilli14667 \
\
iii.\
\
How would you confirm that your candidate non-homolog was truly unrelated? (
\i Hint
\i0 - compare your candidate non-homolog with
\b SwissProt
\b0 or
\b QFO78/Uniprot Ref
\b0 for a more comprehensive test.)
\f1\fs26 \kerning1\expnd0\expndtw0 \
\pard\tx720\tx1440\pardeftab720\partightenfactor0
\cf0 \
\
\
\
\
\
c.
\f0\fs29\fsmilli14667 \expnd0\expndtw0\kerning0
Domains and alignment regions\
\pard\pardeftab720\partightenfactor0
\cf0 \
\pard\tx220\tx720\pardeftab720\li720\fi-720\partightenfactor0
\ls2\ilvl0\cf0 \kerning1\expnd0\expndtw0 {\listtext i }\expnd0\expndtw0\kerning0
There are three parts to the domain display, the domain structure of the query (top) sequence (if available), the domain structure of the library (bottom) sequence, and the domain alignment boundaries in the middle (inside the alignment box). The boundaries and color of the alignment domain coloring match the
\f1 Region:
\f0 sub-alignment scores.\
\pard\tx720\pardeftab720\partightenfactor0
\cf0 \
ii. Note that the alignment of Honey bee
\f1 GSTD1
\f0 and
\f1 SSPA_ECO57
\f0 includes portions of both the N-terminal and C-terminal domains, but neither domain is completely aligned. Why do you think the alignments do not include the complete domains?\
\
\fs24 if organisms are too distantly related, they probably won\'92t come up as significant\
only can do partial alignment\
if you were to force the alignment, the score would be worse\
so the aligner is only aligning a portion of the query to the output
\fs29\fsmilli14667 \
\fs24 \
\fs29\fsmilli14667 \
iii. Is your explanation for the partial domain alignment consistent the the argument that domains have a characteristic length? How might you test whether a complete domain is present? In the subalignment scores, the
\f1 Q
\f0 value is -10 *log(p) for the sub-alignment score, so Q=30.0 means p < 0.001.\
\
\
\
\
\
\
PARSING BLAST RESULTS\
\
}