-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More rigorous implementation of findMissingResidues needed (e.g., doesn't work for antibodies) #170
Comments
Ooops, just saw someone else wrote up something similar #154 |
Thanks, that's a good writeup. There's one more important complication to add: chain IDs are usually not unique. Pick any PDB file, and you'll probably find it has more than one chain with the same ID. So you can't rely on them for matching up ATOM records to SEQRES records or REMARKs. Currently it does a sequence alignment to try to guess which chain each record actually corresponds to. |
Hi @peastman , here's my implementation. It looks like the easiest approach for me was just to modify
so that basically, I could substitute my own value for
It makes the following assumptions:
If you'd support it, I could prepare this as a Pull Request. Regardless, if you're interested at least in some suggestions:
|
Just found another bug. |
I am also experiencing issues with how PDBFixer's For example, chain H has the following missing residues:
But none of them are returned by Can we fix this bug? Using @nitroamos's code, I am able to identify the correct missing residues, but a bit more work is needed for the missing residues to actually be added. Also, it doesn't seem like his changes #178 were ever merged. |
Thanks! I can reproduce the problem. I'm looking into it now. |
This looks to me like an error in the PDB file. Either that, or I don't understand what's going on. Here's the relevant series of residues:
Notice there are two alternatives for residue 52. So the sequence of this span could be either ILE ILE VAL or else ILE PRO VAL. Now look at the corresponding SEQRES records. These residues span the end of line 4 and the start of line 5:
Here it lists the sequence as ILE ILE PRO VAL. It includes both alternatives for residue 52 as if they were two separate residues in the sequence. This leads PDBFixer to conclude the sequences don't match and it doesn't try to add missing residues for that chain. |
Thanks for looking into this! I was interpreting 52A as an insertion, not an alternative to residue 52. Would it be possible to implement changes in PDBFixer that handles insertion codes better, i.e. renumber the residues if insertion codes exist and then output a dictionary mapping old to new residue numbers? Or instead of a dictionary, just returning a PDB with the same residue numbers as the original but with missing residues added? |
Ok, I see. So given the set of residues for which there are ATOM records, and the residue indices and insertion codes for them, how do I determine the index into the SEQRES sequence that each one corresponds to? The documentation is very unclear about it. I assume the correct algorithm is something like this: for every residue that has an insertion code, add 1 to the index of that residue and every other residue that comes after it? |
Yes, I think you would add 1 to the index of the residue with the insertion code and every other residue after it, but I think you also need to maintain the gaps in subsequent residue numbers. I used the following code to renumber the residues in mdtraj before loading into PDBFixer:
|
@ivyzhang999 the fix for your issue is in #184. Could you try it and make sure it works for you? |
@peastman The fix works for me, thanks! |
Thanks! I've merged it. |
While writing up my response to openmm/openmm#2087 I went poking around in
findMissingResidues
to figure out how it's been implemented, and I found a problem.First, to recap that discussion:
You can't rely on residue numbers alone to determine where gaps in the structures are. Residues numbers are often chosen on the basis of an alignment with a family of proteins and as such can jump up or down. If you want to take an approach like this, you should use
r.index
instead ofr.id
. Example.You can't rely on
SEQRES
to find gaps in the structure. For example, ifSEQRES
says the sequence isAAA
and the coordinate section saysAA
, there are 3 places you could put the missing A.I'm not a PDB historian so I don't know why it's in a REMARK section, but REMARK 465 is supposed to be where you look to find where the missing residues are and given the template they provide, it is intended to be machine readable.
I went looking in the code for
findMissingResidues
to understand it. As far as I can tell, it makes these assumptions:REMARK 465
. This means it could be fooled bySEQRES
resulting in misthreaded structures.REMARK 465
, it doesn't know if the missing residues are already named (i.e., id and insertionCode)Take a look at this line:
the key used for residues
int(r.id)-minResidue
is not the correct index intoSEQRES
when the structure has insertion codes or when jumps in residue numbers are intended. For example, this code won't work for any antibody structures where the complementarity determining regions (CDRs) all have insertion codes. Take for example 1hzh, one of the PDB-101 Molecule of the Month structures.In its chain K, it has many missing residues:
but
findMissingResidues
doesn't find any of them.I wouldn't know for sure until trying, but it seems like addressing this could work like this if the PDB has a
REMARK 465
section:REMARK 465
section of the PDB file when read inSEQRES
SSSEQ
. At least in this example, it looks likeSSSEQ
is indexed in residue number space. Perhaps we must assume sane residue numbering near these gaps. In this example, it looks like we'll still be missing residues K128 and K129.SSSEQ
and the provided insertion code are supposed to be used in the new residues.SEQRES
and throw an exception or at least a warning if they don't match.If it doesn't have a
REMARK 465
section, then at the very least,findMissingResidues
should be refactored to user.index
(i.e., the index in the chain) rather thanr.id
.The text was updated successfully, but these errors were encountered: