-
Notifications
You must be signed in to change notification settings - Fork 8
/
README
193 lines (133 loc) · 7.75 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
DISOPRED RELEASE NOTES
======================
DISOPRED Version 3.1
Copyright 2014 D. Jones, D. Cozzetto & J. Ward. All Rights Reserved.
Here are some brief notes on using the DISOPRED 3.1 software.
Please see the LICENSE file for the license terms for the software.
Basically it is free to academic users as long as you don't want to sell
the software or, for example, store the results obtained with it in a
database and then try to sell the database. If you do wish to sell the
software or use it commercially, then please contact [email protected]
to discuss licensing terms.
What is new in DISOPRED3
========================
DISOPRED3 represents the latest release of our successful machine-learning
based approach to the detection of intrinsically disordered regions. The
method was originally trained on evolutionarily conserved sequence features
of disordered regions from missing residues in high-resolution X-ray structures.
DISOPRED2 mainly addressed the marked class imbalance between ordered and
disordered amino acids as well as the different sequence patterns associated
with terminal and internal disordered regions using SVMs.
DISOPRED3 extends the previous architecture with two independent predictors of
intrinsic disorder - a neural network and a nearest neighbour classifier - which
were trained to identify long intrinsically disordered regions using data from
the PDB and DisProt databases. The intermediate results are integrated by an
additional neural network.
To provide insights into the biological roles of proteins, DISOPRED3 also predicts
protein binding sites within disordered regions using a SVM that examines patterns
of evolutionary sequence conservation, positional information and amino acid
composition of putative disordered regions.
INSTALLATION VIA GITHUB AND ANSIBLE
===================================
First ensure that ansible is installed on your system, then clone the github
repo
% pip install ansible
% git clone https://github.com/psipred/disopred.git
% cd disopred/ansible_installer
Next edit the the config_vars.yml to reflect where you would like disopred and
its underlying data to be installed. You can choose a version of uniref but
we recommend uniref90 for most accurate results.
You can now run ansible as per
% ansible-playbook -i hosts install.yml
You can edit the hosts file to install disopred on one or more machines.
Installing DISOPRED3
====================
The program is supplied in source code form - some components must be
compiled before they can be used. On a standard Unix or Linux system,
DISOPRED can be compiled and installed from the src/ directory with:
make clean
make
make install
The process will place the executables in the DISOPRED bin/ directory, where
the script "run_disopred.pl" expects to find them. A copy of the svm-predict
program from the LIBSVM package Version 3.17 is also included for the prediction
of protein binding sites within disordered regions. Full details of LIBSVM,
including the licence, can be found at:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
You will additionally need to download the disored library to a directory
called dso_lib. From the main disopred directory run
wget http://bioinfadmin.cs.ucl.ac.uk/downloads/DISOPRED/dso_lib.tar.gz
tar -zxvf dso_lib.tar.gz
You must also set the ENVIRONMENT variable for the DSO_LIB_PATH to the path to
newly untarred dso_lib/
Configuring DISOPRED3
=====================
A simple Perl script called "run_disopred.pl" allows to predict intrinsically
disordered regions and protein binding sites within them. The script assumes
that the NCBI BLAST binaries and appropriate sequence databases have been
installed locally. Their location is specified through the variables:
my $NCBI_DIR = "/home/bin/blast-2.2.26/bin/"; # directory where the BLAST binaries are
my $SEQ_DB = "/home/uniref/uniref90"; # the path to the formatdb'ed sequence database
The NCBI executables can be obtained from ftp://ftp.ncbi.nih.gov/blast
Suitable sequence data banks are available from ftp://ftp.ncbi.nih.gov/blast/db/
and ftp://ftp.ebi.ac.uk/pub/databases/uniprot/
******************** IMPORTANT NOTE ON BLAST+ *****************
NCBI are encouraging users to switch over from the classic BLAST package
to the new BLAST+ package. On the one hand this is a cleaner and nicer
version of BLAST, but on the other hand, it omits some useful features.
In particular, BLAST+ no longer offers the facility to extract more precise
PSSM scores from checkpoint files in a "supported" way (i.e. using the
makemat utility for this purpose).
Eventually, we will probably switch over to BLAST+ as the preferred way of
searching for similar sequences, but for the time being no interface to
BLAST+ is provided.
***************************************************************************
The Perl script also expects to find the directories bin/, data/ and dso_lib/ at the same path.
If you need to move these directories somewhere else, please change the values of the variables
with the new full paths
my $EXE_DIR = abs_path(join '/', dirname($0), "bin"); # the path of the bin directory
my $DATA_DIR = abs_path(join '/', dirname($0),"data"); # the path of the data directory
$ENV{DSO_LIB_PATH} = join '', abs_path("./dso_lib"), '/'; # the path of the library directory used by the nearest neighbour classifier
Running DISOPRED3
=================
The script "run_disopred.pl" requires as input a text file containing one
amino acid sequence for which predictions are sought. A few parameters can
be tuned from inside the script, including the PSI-BLAST search options and
the DISOPRED2 SVM specificity level. During the execution, a number of
temporary files will be generated (e.g. PSI-BLAST output files, the PSSM file,
the intermediate disordered residue prediction files, the input file to
svm-predict), which are identified by concatenating the input file name, the
process id of the Perl job and the numeric identifier for the host. These
files are removed after the final output has been generated in the same
directory as the input.
Here is the output of a successful DISOPRED run for the file examples/example.fasta:
./run_disopred.pl examples/example.fasta
Running PSI-BLAST search ...
Generating PSSM ...
Predicting disorder with DISOPRED2 ...
Running neural network classifier ...
Running nearest neighbour classifier ...
Combining disordered residue predictions ...
Predicting protein binding residues within disordered regions ...
Cleaning up ...
Finished
Disordered residue predictions in absolute-path/examples/example.diso
Protein binding disordered residue predictions in absolute-path/examples/example.pbdat
OUTPUT FILE FORMAT
==================
Results are saved in plain ASCII text format. Disordered region predictions are presented
in tabular format with four fields on each line representing the amino acid position, the
residue single letter code, the order/disorder assignment code, and the corresponding
confidence level. Ordered residues are marked with dots (.) and have scores in [0.00, 0.49];
disordered residues are labelled with asterisks (*) and are scored in [0.50, 1.00].
Putative disordered protein binding sites are annotated in a similar way, with one row for
each amino acid and four fields representing the sequence position, the single letter code,
the assignment code, and the confidence level. Ordered residues are labelled with dots (.)
and have no score associated, so the value in last field is "NA". Protein-binding disordered
residues are indicated by carets (^) and their confidence scores are in [0.50, 1.00], while
all other unstructured positions are tagged with dashes (-) and are scored in [0.00, 0.49].
Citing DISOPRED3
================
Please cite:
Jones, D.T. and Cozzetto, D. (2014) DISOPRED3: Precise disordered region
predictions with annotated protein binding acrivity, Bioinformatics