Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing for long sequences #4

Open
ErikMarklund opened this issue Apr 12, 2022 · 4 comments
Open

Failing for long sequences #4

ErikMarklund opened this issue Apr 12, 2022 · 4 comments

Comments

@ErikMarklund
Copy link

run_disopred.pl fails for the three titin variants A2ASS6, E9Q8K5, and E9Q8N1 (uniprot accession codes). These are very long sequences, >30000 aa. No non-standard amino acids can be found in the sequences. See the output below.

E9Q8N1
Running PSI-BLAST search ...

Generating PSSM ...

Predicting disorder with DISOPRED2 ...

/domus/h1/marklund/src/disopred3/disopred/bin/disopred2 /domus/h1/marklund/calc_disorder/wd/E9Q8N1 /domus/h1/marklund/calc_disorder/wd/E9Q8N1_26496_12accd0b.mtx /domus/h1/marklund/src/disopred3/disopred/data/ 5 
mv /domus/h1/marklund/calc_disorder/wd/E9Q8N1.diso /domus/h1/marklund/calc_disorder/wd/E9Q8N1.diso2 
Running neural network classifier ...

/domus/h1/marklund/src/disopred3/disopred/bin/diso_neu_net /domus/h1/marklund/src/disopred3/disopred/data/weights.dat.nmr_nonpdb /domus/h1/marklund/calc_disorder/wd/E9Q8N1_26496_12accd0b.mtx > /domus/h1/marklund/calc_disorder/wd/E9Q8N1.nndiso 
Running nearest neighbour classifier ...

/domus/h1/marklund/src/disopred3/disopred/bin/diso_neighb /domus/h1/marklund/calc_disorder/wd/E9Q8N1_26496_12accd0b.mtx /domus/h1/marklund/src/disopred3/disopred/data/dso.lst > /domus/h1/marklund/calc_disorder/wd/E9Q8N1.dnb 
Combining disordered residue predictions ...

/domus/h1/marklund/src/disopred3/disopred/bin/combine /domus/h1/marklund/src/disopred3/disopred/data/weights_comb.dat /domus/h1/marklund/calc_disorder/wd/E9Q8N1.diso2 /domus/h1/marklund/calc_disorder/wd/E9Q8N1.nndiso /domus/h1/marklund/calc_disorder/wd/E9Q8N1.dnb > /domus/h1/marklund/calc_disorder/wd/E9Q8N1.diso 
[/home/marklund/src/disopred3/disopred/run_disopred.pl] ERROR: Different numbers of elements in the profile data structure and the array of disordered region lengths

My perl skills are just too weak to figure out what goes wrong, but the sequence length seem like a likely culprit. All other >55000 proteins in my dataset worked fine.

@DanBuchan
Copy link
Contributor

DanBuchan commented Apr 22, 2022

Hi,

Sorry for the time it's taken for me to get back to you (been on holiday and moved house). This is unlikely to be an issue with perl. The perl code is just checking that the outputs for the various C programs are ok before it proceeds. If you look in the src/ directory you'll find assorted .c and .cpp files the problem likely arises in one or more of them. If you can find some places where the sequence length is a set number then you can likely fix this by monkey patching your disopred and recompiling the files

For instance I see in both disordcomb_pred.c and diso_neighb.c contain the line #define MAXSEQLEN 50000 I'd guess you can just change that for some other big number (i.e. 70000) and it should work. So if you track down as many similar Sequence Length (maxseqlen, seqlen, sequence_length) type things you can find, increase their sizes and then just recompile with:

cd src
make clean
make
make install

And, fingers crossed, it should work

@ErikMarklund
Copy link
Author

C I know! Will try to change the macros to accommodate the titin sequences. Many thanks!

@ErikMarklund
Copy link
Author

I will report back once I've tried it. Just need to wait a week or so until my current calculations end. Don't want to recompile mid-analysis.

@ErikMarklund
Copy link
Author

Hi again,

I realised that the buffers were probably long enough, since they are 50000 by default and the sequences in question are about 32500 aa each. Because I reran my entire analysis using a larger reference database (uniref90), I also reran these long sequences too under the same conditions. This time I get another error:
ERROR: Different numbers of elements in the profile data structure and the array of disordered region lengths
which occurs for all three proteins.

It is not obvious to me what I can do to fix this, and I can live without these three proteins. But in case you are interested in digging deeper, the proteins in question have uniprot IDs A2ASS6, E9Q8K5, and E9Q8N1. I'd be happy to answer any questions about what I did to get this error, but I think it is pretty straightforward since I have not used anything unorthodox or modified anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants