Skip to content

Latest commit

 

History

History
508 lines (459 loc) · 36.5 KB

README.md

File metadata and controls

508 lines (459 loc) · 36.5 KB

PDB_cleaner

A Python3 script to clean up the PDB file

Most of time, the PDB files are complicated, which have lots of redundant information as shown below.

  • ANISOU (data copied from 1lk2.pdb)
ATOM      1  N   GLY A   1      66.440  45.780   5.177  1.00 14.10           N  
ANISOU    1  N   GLY A   1     1908   1789   1659     99    -37     -3       N
ATOM      2  CA  GLY A   1      65.947  45.284   3.863  1.00 12.08           C  
ANISOU    2  CA  GLY A   1     1484   1486   1620     47    -39     75       C   
ATOM      3  C   GLY A   1      64.961  46.275   3.303  1.00 10.99           C  
ANISOU    3  C   GLY A   1     1471   1204   1500     36     50    108       C  
ATOM      4  O   GLY A   1      64.683  47.291   3.943  1.00 11.91           O  
ANISOU    4  O   GLY A   1     1390   1542   1593    -61     88    -19       O   

The simplest way is to delete the ANISOU lines.

  • HETATM

  • non-standard amino acid residues (data copied from 2o2x.pdb)

ATOM    821  OD1 ASP A 112      25.580  11.019  35.906  1.00 12.28           O  
ATOM    822  OD2 ASP A 112      24.586   9.016  35.848  1.00 11.81           O  
HETATM  823  N   MSE A 113      25.018  10.050  30.641  1.00  9.26           N  
HETATM  824  CA  MSE A 113      25.494  10.026  29.262  1.00  9.59           C  
HETATM  825  C   MSE A 113      24.291   9.758  28.359  1.00  8.63           C  
HETATM  826  O   MSE A 113      23.362   9.026  28.750  1.00  9.51           O  
HETATM  827  CB  MSE A 113      26.563   8.959  29.078  1.00  8.81           C  
HETATM  828  CG  MSE A 113      27.157   8.896  27.700  1.00  8.23           C  
HETATM  829  SE  MSE A 113      28.681   7.732  27.499  0.75 12.65          SE  
HETATM  830  CE  MSE A 113      30.013   8.895  28.258  1.00 18.85           C  
ATOM    831  N   VAL A 114      24.306  10.362  27.178  1.00  7.56           N  
ATOM    832  CA  VAL A 114      23.308  10.072  26.129  1.00  7.93           C  

Since PDBSlicer could not deal with non-standard residues, the simplest way is to delete them.

  • Missing Residue(s) or so-called Sequence Gap(s) (data copied from 1nzj.pdb)
ATOM   1756  O   LEU A 222      48.274   3.534  34.949  1.00 27.98           O  
ANISOU 1756  O   LEU A 222     3531   3513   3584     47    -26      6       O  
ATOM   1757  CB  LEU A 222      45.906   2.133  33.476  1.00 26.02           C  
ANISOU 1757  CB  LEU A 222     3274   3295   3315     29     58     -5       C  
ATOM   1758  N   ASN A 223      47.050   5.216  34.027  1.00 28.96           N  
ANISOU 1758  N   ASN A 223     3698   3595   3707     53      7     21       N  
ATOM   1759  CA  ASN A 223      47.326   6.262  35.028  1.00 29.62           C  
ANISOU 1759  CA  ASN A 223     3782   3735   3737     11     12    -25       C  
ATOM   1760  C   ASN A 223      48.230   5.851  36.192  1.00 30.19           C  
ANISOU 1760  C   ASN A 223     3871   3833   3767     45     -3     -6       C  
ATOM   1761  O   ASN A 223      47.951   6.165  37.354  1.00 31.23           O  
ANISOU 1761  O   ASN A 223     4074   3965   3824     76     26    -70       O  
ATOM   1762  CB  ASN A 223      46.003   6.831  35.561  1.00 29.98           C  
ANISOU 1762  CB  ASN A 223     3798   3785   3804     36     15      8       C  
ATOM   1763  N   ALA A 237      50.141  13.856  28.172  1.00 30.51           N  
ANISOU 1763  N   ALA A 237     3895   3875   3821     32     28    -17       N  
ATOM   1764  CA  ALA A 237      50.857  13.904  26.900  1.00 30.22           C  
ANISOU 1764  CA  ALA A 237     3816   3837   3827      7      2      7       C  
ATOM   1765  C   ALA A 237      52.347  13.656  27.124  1.00 30.06           C  
ANISOU 1765  C   ALA A 237     3809   3808   3803     26     11    -16       C  
ATOM   1766  O   ALA A 237      52.869  13.962  28.189  1.00 30.54           O  
ANISOU 1766  O   ALA A 237     3901   3866   3834     34    -52     -8       O  
ATOM   1767  CB  ALA A 237      50.648  15.254  26.254  1.00 30.32           C  
ANISOU 1767  CB  ALA A 237     3814   3832   3871     20      0      0       C  
ATOM   1768  N   LEU A 238      53.035  13.117  26.121  1.00 29.79           N  
ANISOU 1768  N   LEU A 238     3760   3773   3785      9     -4    -12       N  
ATOM   1769  CA  LEU A 238      54.470  12.845  26.250  1.00 29.52           C  
ANISOU 1769  CA  LEU A 238     3743   3740   3733      9      9    -18       C  

The bold font lines indicate the discontinuous sequence numbers (223 ...empty... 237) due to the missing residues. We called this case as sequence gap. It is a very serious problem because the Ramachandran subunit is defined by three adjacent residues. It is immpossible to directly choose residue number series (222, 223, 237) and (2233, 237, 238) as the members of the Ramachandran subunit. The solution is that treat the peptide as segments, e.g. from beginning to residue number 223, then from residue number 237 to the end. If the PDB file has more than one gap, we divide it into several segments based on the locations of the gaps. Note: the discontinuous sequence number between different chains also treated as 'gap', just because it is easy for programming.

Improvement (Nov. 16, 2017) In the printing and report format, the chain ID was added aside to the sequence number, e.g. ('A:223', 'A:237'). Previously, only the sequence numbers between gap(s) were showed.

  • alternate locations (data copied from 3ife.pdb)
ATOM     21  CE1 PHE A  -4      40.991  47.856  19.364  1.00 27.65           C  
ATOM     22  CE2 PHE A  -4      41.948  49.936  20.068  1.00 28.56           C  
ATOM     23  CZ  PHE A  -4      40.841  49.190  19.686  1.00 28.08           C  
ATOM     24  N  AGLN A  -3      46.967  45.549  21.004  0.50 23.13           N  
ATOM     25  N  BGLN A  -3      46.998  45.555  20.982  0.50 22.90           N  
ATOM     26  CA AGLN A  -3      48.373  45.164  21.046  0.50 23.16           C  
ATOM     27  CA BGLN A  -3      48.400  45.139  20.949  0.50 22.72           C  
ATOM     28  C  AGLN A  -3      48.567  43.661  20.812  0.50 22.63           C  
ATOM     29  C  BGLN A  -3      48.554  43.631  20.764  0.50 22.37           C  
ATOM     30  O  AGLN A  -3      49.384  43.259  19.986  0.50 21.47           O  
ATOM     31  O  BGLN A  -3      49.344  43.191  19.930  0.50 21.20           O  
ATOM     32  CB AGLN A  -3      49.002  45.601  22.384  0.50 23.61           C  
ATOM     33  CB BGLN A  -3      49.160  45.591  22.201  0.50 23.04           C  
ATOM     34  CG AGLN A  -3      48.488  44.854  23.614  0.50 25.25           C  
ATOM     35  CG BGLN A  -3      50.631  45.172  22.185  0.50 23.58           C  
ATOM     36  CD AGLN A  -3      46.975  44.863  23.719  0.50 25.86           C  
ATOM     37  CD BGLN A  -3      51.390  45.619  23.424  0.50 24.16           C  
ATOM     38  OE1AGLN A  -3      46.364  43.846  24.056  0.50 23.20           O  
ATOM     39  OE1BGLN A  -3      50.935  46.485  24.167  0.50 26.65           O  
ATOM     40  NE2AGLN A  -3      46.361  45.990  23.375  0.50 25.35           N  
ATOM     41  NE2BGLN A  -3      52.563  45.035  23.640  0.50 27.37           N  
ATOM     42  N   SER A  -2      47.792  42.842  21.521  1.00 21.72           N  
ATOM     43  CA  SER A  -2      47.888  41.386  21.401  1.00 22.23           C  
ATOM     44  C   SER A  -2      47.402  40.921  20.036  1.00 19.65           C  
ATOM     45  O   SER A  -2      48.008  40.034  19.456  1.00 20.72           O  
  • special cases in alternate locations (data copied from 5DXX.pdb)
ATOM    448  N   MET A  61      48.127   9.414  21.012  1.00  8.02           N  
ANISOU  448  N   MET A  61      952    878   1219    -50    501     95       N  
ATOM    449  CA AMET A  61      47.494   8.918  22.231  0.58  8.39           C  
ANISOU  449  CA AMET A  61     1091    827   1271     24    428    219       C  
ATOM    450  CA BMET A  61      47.420   8.922  22.202  0.42  8.88           C  
ANISOU  450  CA BMET A  61     1144    895   1334    -61    457    185       C  
ATOM    451  C   MET A  61      47.346   7.404  22.267  1.00  8.78           C  
ANISOU  451  C   MET A  61     1223    782   1330     59    378    169       C  
ATOM    452  O   MET A  61      46.991   6.766  21.272  1.00 10.08           O  
ANISOU  452  O   MET A  61     1398    943   1491     14     75    122       O  
ATOM    453  CB AMET A  61      46.118   9.546  22.410  0.58  8.06           C  
ANISOU  453  CB AMET A  61      903    838   1320    372    380    212       C  
ATOM    454  CB BMET A  61      45.980   9.455  22.241  0.42  8.97           C  
ANISOU  454  CB BMET A  61      930    991   1486      2    458    168       C  
ATOM    455  CG AMET A  61      46.138  11.063  22.501  0.58  8.72           C  
ANISOU  455  CG AMET A  61     1253    809   1251    307    330    165       C  
ATOM    456  CG BMET A  61      45.805  10.973  22.171  0.42  9.66           C  
ANISOU  456  CG BMET A  61     1110   1045   1516     57    274    111       C  
ATOM    457  SD AMET A  61      44.516  11.746  22.852  0.58  9.87           S  
ANISOU  457  SD AMET A  61     1393   1136   1221    329    227    121       S  
ATOM    458  SD BMET A  61      44.071  11.452  21.925  0.42 11.25           S  
ANISOU  458  SD BMET A  61     1357   1344   1573    -97    206    -11       S  
ATOM    459  CE AMET A  61      43.632  11.262  21.374  0.58  8.79           C  
ANISOU  459  CE AMET A  61      912   1125   1304    448    193     62       C  
ATOM    460  CE BMET A  61      43.308  10.818  23.419  0.42 10.70           C  
ANISOU  460  CE BMET A  61     1120   1371   1573     26    296    -15       C  
...
ATOM   2041  N   ARG A 268      68.983  -6.030  20.233  1.00 12.62           N  
ANISOU 2041  N   ARG A 268     1676    819   2299    101    -35    523       N  
ATOM   2042  CA BARG A 268      68.988  -4.603  20.530  0.60 12.88           C  
ANISOU 2042  CA BARG A 268     1398    984   2513    141    107    402       C  
ATOM   2043  CA CARG A 268      68.989  -4.603  20.527  0.40 12.83           C  
ANISOU 2043  CA CARG A 268     1483    920   2471     82    157    473       C  
ATOM   2044  C   ARG A 268      67.641  -3.953  20.247  1.00 11.56           C  
ANISOU 2044  C   ARG A 268     1170    935   2286    -23     70    342       C  
ATOM   2045  O   ARG A 268      66.930  -4.345  19.316  1.00 12.73           O  
ANISOU 2045  O   ARG A 268     1496   1160   2181    -23     37    354       O  
ATOM   2046  CB BARG A 268      70.061  -3.890  19.701  0.60 15.09           C  
ANISOU 2046  CB BARG A 268     1382   1451   2901    308    108    405       C  
ATOM   2047  CB CARG A 268      70.065  -3.894  19.699  0.40 14.76           C  
ANISOU 2047  CB CARG A 268     1631   1189   2787    164    317    597       C  
ATOM   2048  CG BARG A 268      71.428  -4.538  19.755  0.60 20.70           C  
ANISOU 2048  CG BARG A 268     2380   2183   3300    506    138    227       C  
ATOM   2049  CG CARG A 268      71.458  -4.466  19.860  0.40 18.82           C  
ANISOU 2049  CG CARG A 268     2367   1677   3108    350    438    578       C  
ATOM   2050  CD BARG A 268      72.280  -3.968  20.869  0.60 24.96           C  
ANISOU 2050  CD BARG A 268     3408   2535   3540    600    301    -84       C  
ATOM   2051  CD CARG A 268      72.378  -3.477  20.542  0.40 22.04           C  
ANISOU 2051  CD CARG A 268     3150   1893   3329    467    727    531       C  
ATOM   2052  NE BARG A 268      73.616  -4.559  20.871  0.60 27.23           N  
ANISOU 2052  NE BARG A 268     3846   2843   3658    816    402   -233       N  
ATOM   2053  NE CARG A 268      73.461  -3.031  19.670  0.40 25.28           N  
ANISOU 2053  NE CARG A 268     3900   2201   3505    545    882    478       N  
ATOM   2054  CZ BARG A 268      74.606  -4.169  20.074  0.60 29.73           C  
ANISOU 2054  CZ BARG A 268     4396   3111   3790   1084    535   -418       C  
ATOM   2055  CZ CARG A 268      74.657  -3.607  19.612  0.40 28.07           C  
ANISOU 2055  CZ CARG A 268     4528   2513   3625    423    993    395       C  
ATOM   2056  NH1BARG A 268      74.412  -3.180  19.206  0.60 30.54           N  
ANISOU 2056  NH1BARG A 268     4601   3217   3787   1268    624   -504       N  
ATOM   2057  NH1CARG A 268      74.925  -4.665  20.369  0.40 29.81           N  
ANISOU 2057  NH1CARG A 268     4898   2709   3720    472    964    299       N  
ATOM   2058  NH2BARG A 268      75.794  -4.766  20.144  0.60 30.36           N  
ANISOU 2058  NH2BARG A 268     4511   3196   3828   1150    633   -562       N  
ATOM   2059  NH2CARG A 268      75.586  -3.125  18.795  0.40 28.14           N  
ANISOU 2059  NH2CARG A 268     4497   2583   3613    248   1151    380       N  

In this case (5DXX.pdb), there are three different types of the alternative locations, A, B, and C high-lighted with the bold font. However, they distribute with irregular way. For instance, in sequence 61, A and B appeared, whereas in sequence 268, B and C emerged. As a result, it is impossible to simply use the pdb_info[(pdb_info.Alt_Loc == ' ') | (pdb_info.Alt_Loc == 'A')] because that would delete all B and C labeled atoms in sequence 268!

Improvement or Debug (Sep. 04, 2017) By using pandas df.groupby() on the ['Seq_Num', 'ChainID'] columns, we can focus on each specific residue and keep the first alternative location, no matter the first one is 'A' or 'B' or 'C'. The code is show as following

    #### delete the redundant alternate locations, only keep the first apperance
    if altloc:
        groups = pdb_info.groupby(['Seq_Num', 'ChainID'], sort=False)
        pdb_info = groups.apply(lambda x:
                                x.drop_duplicates(subset=["AtomTyp"],
                                                  keep='first')
                                if len(groups['Alt_Loc']) >= 2 else x)
  • insertion codes
ATOM   1258  CD1 ILE A 185       4.002  11.557  18.921  1.00 19.47           C  
ANISOU 1258  CD1 ILE A 185     2567   2632   2200    -66   -252    125       C  
ATOM   1259  N   PRO A 186       6.584  15.226  16.396  1.00 16.95           N  
ANISOU 1259  N   PRO A 186     2324   2351   1766    -93   -218    271       N  
ATOM   1260  CA  PRO A 186       6.984  16.463  15.718  1.00 17.27           C  
ANISOU 1260  CA  PRO A 186     2382   2394   1786   -103   -219    330       C  
ATOM   1261  C   PRO A 186       6.139  17.642  16.167  1.00 19.26           C  
ANISOU 1261  C   PRO A 186     2626   2598   2094    -86   -245    374       C  
ATOM   1262  O   PRO A 186       4.907  17.532  16.301  1.00 18.40           O  
ANISOU 1262  O   PRO A 186     2500   2480   2011    -67   -280    374       O  
ATOM   1263  CB  PRO A 186       6.742  16.159  14.234  1.00 20.29           C  
ANISOU 1263  CB  PRO A 186     2785   2831   2092   -115   -244    345       C  
ATOM   1264  CG  PRO A 186       6.728  14.695  14.124  1.00 25.31           C  
ANISOU 1264  CG  PRO A 186     3421   3497   2701   -115   -240    282       C  
ATOM   1265  CD  PRO A 186       6.252  14.151  15.432  1.00 19.86           C  
ANISOU 1265  CD  PRO A 186     2702   2765   2078   -100   -238    244       C  
ATOM   1266  N   ASP A 186A      6.812  18.768  16.413  1.00 16.88           N  
ANISOU 1266  N   ASP A 186A    2335   2266   1814    -93   -227    410       N  
ATOM   1267  CA  ASP A 186A      6.193  20.046  16.803  1.00 18.33           C  
ANISOU 1267  CA  ASP A 186A    2517   2396   2051    -76   -248    453       C  
ATOM   1268  C   ASP A 186A      5.389  19.957  18.110  1.00 21.71           C  
ANISOU 1268  C   ASP A 186A   2920   2782   2548    -46   -251    420        C  
ATOM   1269  O   ASP A 186A      4.477  20.754  18.337  1.00 23.99           O  
ANISOU 1269  O   ASP A 186A    3201   3034   2879    -21   -276    447       O  
ATOM   1270  CB  ASP A 186A      5.342  20.626  15.640  1.00 20.86           C  
ANISOU 1270  CB  ASP A 186A    2848   2731   2345    -71   -295    510       C  
ATOM   1271  CG  ASP A 186A      6.138  20.870  14.377  1.00 27.21           C  
ANISOU 1271  CG  ASP A 186A    3681   3578   3078   -102   -290    551       C  
ATOM   1272  OD1 ASP A 186A      7.316  21.272  14.485  1.00 27.14           O  
ANISOU 1272  OD1 ASP A 186A    3686   3563   3064   -125   -254    561       O  
ATOM   1273  OD2 ASP A 186A      5.578  20.677  13.277  1.00 34.63           O  
ANISOU 1273  OD2 ASP A 186A    4630   4560   3967   -104   -324    575       O  
ATOM   1274  N   SER A 186B      5.742  18.999  18.983  1.00 16.28           N  
ANISOU 1274  N   SER A 186B    2218   2098   1871    -47   -223    364       N  
ATOM   1275  CA  SER A 186B      5.050  18.813  20.239  1.00 16.16           C  
ANISOU 1275  CA  SER A 186B    2178   2050   1911    -22   -220    332       C  
ATOM   1276  C   SER A 186B      6.014  18.876  21.407  1.00 16.84           C  
ANISOU 1276  C   SER A 186B    2267   2109   2024    -28   -181    302       C  
ATOM   1277  O   SER A 186B      7.167  18.490  21.277  1.00 17.17           O  
ANISOU 1277  O   SER A 186B    2317   2170   2035    -52   -156    289       O  
ATOM   1278  CB  SER A 186B      4.378  17.452  20.244  1.00 17.47           C  
ANISOU 1278  CB  SER A 186B    2323   2250   2066    -18   -229    294       C  
ATOM   1279  OG  SER A 186B      3.785  17.181  21.503  1.00 16.37           O  
ANISOU 1279  OG  SER A 186B    2158   2085   1978      2   -220    264       O  
ATOM   1280  N   LYS A 187       5.518  19.323  22.546  1.00 14.62           N  
ANISOU 1280  N   LYS A 187     1974   1786   1795     -5   -177    290       N  

As shown above, the same sequence number (186) labeled with two insertion codes (A and B), however, there are two kinds of residues ASP and SER! The simplest way is to delete the residues labeled by insertion codes.

  • When I save the cleaned results, I found another alignment issue... (data copied from 1BTY.pdb)
ATOM      1  N   ILE A  16      35.700  19.589  20.234  1.00 10.94           N  
ATOM      2  CA  ILE A  16      35.550  20.497  19.066  1.00 10.97           C  
ATOM      3  C   ILE A  16      36.807  20.237  18.234  1.00  9.79           C  
ATOM      4  O   ILE A  16      37.894  20.256  18.772  1.00 10.26           O  
ATOM      5  CB  ILE A  16      35.544  21.989  19.514  1.00 11.47           C  
ATOM      6  CG1 ILE A  16      34.399  22.321  20.484  1.00 12.32           C  
ATOM      7  CG2 ILE A  16      35.560  22.968  18.278  1.00 12.30           C  
ATOM      8  CD1 ILE A  16      33.034  22.335  19.785  1.00 13.18           C  
ATOM      9  HA  ILE A  16      34.673  20.230  18.499  1.00 10.47           H  
ATOM     10  HB  ILE A  16      36.473  22.161  20.042  1.00 11.57           H  
ATOM     11 HG12 ILE A  16      34.396  21.655  21.334  1.00 11.90           H  
ATOM     12 HG13 ILE A  16      34.579  23.313  20.881  1.00 12.02           H  
ATOM     13 HG21 ILE A  16      34.717  22.818  17.621  1.00 11.98           H  
ATOM     14 HG22 ILE A  16      35.548  23.994  18.620  1.00 12.00           H  
ATOM     15 HG23 ILE A  16      36.462  22.839  17.694  1.00 11.56           H  
ATOM     16 HD11 ILE A  16      32.786  21.397  19.326  1.00 12.90           H  
ATOM     17 HD12 ILE A  16      32.266  22.577  20.509  1.00 12.70           H  
ATOM     18 HD13 ILE A  16      33.010  23.114  19.032  1.00 12.55           H
ATOM     19  N   VAL A  17      36.640  20.021  16.964  1.00 11.69           N  
ATOM     20  CA  VAL A  17      37.785  19.760  16.052  1.00 10.93           C  
ATOM     21  C   VAL A  17      37.896  21.020  15.170  1.00  9.18           C  
ATOM     22  O   VAL A  17      36.905  21.499  14.639  1.00 11.67           O  
ATOM     23  CB  VAL A  17      37.466  18.517  15.170  1.00 12.02           C  
ATOM     24  CG1 VAL A  17      38.603  18.296  14.156  1.00 11.39           C  
ATOM     25  CG2 VAL A  17      37.202  17.225  16.050  1.00 13.94           C  
ATOM     26  H   VAL A  17      35.748  20.036  16.564  1.00 11.21           H  
ATOM     27  HA  VAL A  17      38.694  19.634  16.621  1.00 10.27           H  
ATOM     28  HB  VAL A  17      36.577  18.735  14.593  1.00 11.73           H  
ATOM     29 HG11 VAL A  17      39.545  18.156  14.663  1.00 11.37           H  
ATOM     30 HG12 VAL A  17      38.402  17.438  13.536  1.00 11.87           H  
ATOM     31 HG13 VAL A  17      38.686  19.156  13.503  1.00 11.42           H  
ATOM     32 HG21 VAL A  17      38.046  16.986  16.679  1.00 12.88           H  
ATOM     33 HG22 VAL A  17      36.338  17.370  16.683  1.00 13.55           H  
ATOM     34 HG23 VAL A  17      36.989  16.368  15.427  1.00 13.73           H  
ATOM     35  N   GLY A  18      39.101  21.479  15.085  1.00 10.03           N  
ATOM     36  CA  GLY A  18      39.440  22.677  14.271  1.00 12.83           C  
ATOM     37  C   GLY A  18      38.928  24.015  14.824  1.00 14.65           C  
ATOM     38  O   GLY A  18      38.710  24.947  14.072  1.00 14.74           O  
ATOM     39  H   GLY A  18      39.816  21.025  15.573  1.00 10.61           H  
ATOM     40  HA2 GLY A  18      40.513  22.729  14.176  1.00 11.84           H  
ATOM     41  HA3 GLY A  18      39.023  22.532  13.283  1.00 11.64           H  

where "HG12", "HG13", "HG21", "HG22", "HG23", "HD11", "HD12", and "HD13" are one character left-shifted compared with the preceding lines. Improvement or Debug (Jul. 31, 2017) Implemented two printing formats to deal with this issue.

  • Usually, PDB files do not contain hydrogen atoms (might due to the highly dynamic of the motion of hydrogen atoms or the limitation of the X-ray resolution). However, in some PDB file, e.g. 5JRY.pdb, hydrogen atoms were recorded.
ATOM      1  N   MET A   1       3.164  22.103 135.939  1.00 28.43           N  
ANISOU    1  N   MET A   1     3558   4003   3241    245    418   -535       N  
ATOM      2  CA  MET A   1       3.182  20.676 135.533  1.00 27.33           C  
ANISOU    2  CA  MET A   1     3398   3863   3124    254    483   -543       C  
ATOM      3  C   MET A   1       3.889  20.519 134.187  1.00 26.06           C  
ANISOU    3  C   MET A   1     3199   3710   2993    137    508   -477       C  
ATOM      4  O   MET A   1       3.671  21.292 133.254  1.00 26.86           O  
ANISOU    4  O   MET A   1     3353   3787   3064    176    487   -391       O  
ATOM      5  CB  MET A   1       1.755  20.132 135.441  1.00 27.72           C  
ANISOU    5  CB  MET A   1     3472   3915   3143    245    541   -550       C  
ATOM      6  H   MET A   1       2.770  22.181 136.733  1.00 34.12           H  
ATOM      7  HA  MET A   1       3.659  20.162 136.203  1.00 32.80           H  
ATOM      8  N   LEU A   2       4.740  19.510 134.085  1.00 23.71           N  
ANISOU    8  N   LEU A   2     2783   3443   2781    -63    570   -535       N  
ATOM      9  CA  LEU A   2       5.389  19.232 132.817  1.00 21.69           C  
ANISOU    9  CA  LEU A   2     2418   3187   2638   -244    572   -551       C  
ATOM     10  C   LEU A   2       4.358  18.763 131.793  1.00 20.93           C  
ANISOU   10  C   LEU A   2     2144   3167   2643   -261    488   -592       C  
ATOM     11  O   LEU A   2       3.268  18.294 132.137  1.00 21.65           O  
ANISOU   11  O   LEU A   2     2148   3333   2746   -381    548   -733       O  
ATOM     12  CB  LEU A   2       6.449  18.148 133.007  1.00 20.95           C  
ANISOU   12  CB  LEU A   2     2388   3018   2555   -362    577   -509       C  
ATOM     13  CG  LEU A   2       7.526  18.465 134.041  1.00 20.97           C  
ANISOU   13  CG  LEU A   2     2473   2948   2547   -376    504   -471       C  
ATOM     14  CD1 LEU A   2       8.523  17.324 134.149  1.00 21.51           C  
ANISOU   14  CD1 LEU A   2     2624   2963   2586   -440    390   -438       C  
ATOM     15  CD2 LEU A   2       8.236  19.754 133.671  1.00 21.33           C  
ANISOU   15  CD2 LEU A   2     2560   2944   2601   -381    439   -472       C  
ATOM     16  H   LEU A   2       4.955  18.978 134.726  1.00 28.45           H  
ATOM     17  HA  LEU A   2       5.819  20.035 132.484  1.00 26.03           H  
ATOM     18  HB2 LEU A   2       6.007  17.331 133.287  1.00 25.14           H  
ATOM     19  HB3 LEU A   2       6.894  18.000 132.158  1.00 25.14           H  
ATOM     20  HG  LEU A   2       7.110  18.586 134.909  1.00 25.16           H  
ATOM     21 HD11 LEU A   2       9.193  17.553 134.812  1.00 25.81           H  
ATOM     22 HD12 LEU A   2       8.053  16.519 134.418  1.00 25.81           H  
ATOM     23 HD13 LEU A   2       8.943  17.189 133.285  1.00 25.81           H  
ATOM     24 HD21 LEU A   2       8.916  19.941 134.336  1.00 25.60           H  
ATOM     25 HD22 LEU A   2       8.646  19.648 132.798  1.00 25.60           H  
ATOM     26 HD23 LEU A   2       7.588  20.475 133.648  1.00 25.60           H  

Improvement (Nov. 16, 2017) In this case, I added a new option in my PDB_cleaner script enabling the users to choose whether remove all hydrogen atoms or not.

Improvement (Dec. 13, 2017) Modified the script by using 'Element' (at column 76:78 in PDB file) as the condition.

(previously, I used pdb_info.ResName.str.startswith("H"), which is slower).

  • ligands/solvents

Some of the PDB files contains ligands/solvents (e.g. 1HQ2.pdb, which contains MG, CL, ACT, APC, PH2, HOH). Those information is listed below the last TER line of the protein chains. Here, only parts of them are shown as an example.

TER    1288      TRP A 158                                                      
HETATM 1289 MG    MG A 161      -2.797   1.884  19.740  1.00  6.70          MG  
ANISOU 1289 MG    MG A 161     1097    971    478     15    338    -47      MG  
HETATM 1290 MG    MG A 162      -5.869   3.399  19.011  1.00  7.19          MG  
ANISOU 1290 MG    MG A 162      855   1269    610    165    324   -165      MG  
HETATM 1291 CL    CL A 163     -16.840 -10.191  19.213  1.00 15.49          CL  
ANISOU 1291 CL    CL A 163     2248   1922   1713    -61    398      8      CL  
HETATM 1292  C   ACT A 164      -6.064  -1.027  24.199  1.00 37.58           C  
ANISOU 1292  C   ACT A 164     7931   1868   4482    143  -3066   -290       C  
HETATM 1293  O   ACT A 164      -6.343  -1.714  23.182  1.00 14.54           O  
ANISOU 1293  O   ACT A 164     1249   1159   3116    -81    364     -9       O  
HETATM 1294  OXT ACT A 164      -6.052   0.230  24.235  1.00 21.94           O  
ANISOU 1294  OXT ACT A 164     2886   1823   3627     17   -396   -183       O  
HETATM 1295  CH3 ACT A 164      -5.715  -1.844  25.481  1.00 22.07           C  
ANISOU 1295  CH3 ACT A 164     2913   2265   3206   1466    -77   -427       C  
...
HETATM 1308  PG  APC A 171      -7.079   2.750  21.870  1.00  6.88           P  
ANISOU 1308  PG  APC A 171      911    790    915     71    401   -193       P  
HETATM 1309  O1G APC A 171      -6.616   1.344  22.152  1.00  9.60           O  
ANISOU 1309  O1G APC A 171      871    835   1940    100    305    -71       O  
HETATM 1310  O2G APC A 171      -8.226   3.192  22.715  1.00  6.03           O  
ANISOU 1310  O2G APC A 171      938    744    609     31    592     49       O  
HETATM 1311  O3G APC A 171      -7.294   3.042  20.400  1.00  7.08           O  
ANISOU 1311  O3G APC A 171      758   1522    408     29    337   -202       O  
HETATM 1312  PB  APC A 171      -4.370   3.764  21.854  1.00  6.31           P  
ANISOU 1312  PB  APC A 171      909   1010    479     80    206    -23       P  
HETATM 1313  O1B APC A 171      -4.334   3.108  20.508  1.00  6.48           O  
ANISOU 1313  O1B APC A 171     1088    961    411    197    230    187       O  
HETATM 1314  O2B APC A 171      -3.797   5.194  21.941  1.00  8.43           O  
ANISOU 1314  O2B APC A 171     1093    812   1296   -168    237    134       O  
HETATM 1315  O3B APC A 171      -5.859   3.729  22.378  1.00  6.62           O  
ANISOU 1315  O3B APC A 171      943    913    661     23    285    -98       O  
HETATM 1316  PA  APC A 171      -1.763   2.605  22.706  1.00  6.54           P  
ANISOU 1316  PA  APC A 171      934    931    619    -41    125    -54       P  
HETATM 1317  O1A APC A 171      -1.570   2.627  21.218  1.00  6.39           O  
ANISOU 1317  O1A APC A 171     1032    863    534   -110     85    176       O  
HETATM 1318  O2A APC A 171      -1.020   3.644  23.495  1.00  7.30           O  
ANISOU 1318  O2A APC A 171     1040    900    833   -144    100   -132       O  
HETATM 1319  C3A APC A 171      -3.506   2.689  22.980  1.00  6.45           C  
ANISOU 1319  C3A APC A 171     1048    908    494    -10     50   -250       C  
HETATM 1320  O5' APC A 171      -1.282   1.138  23.187  1.00  7.04           O  
ANISOU 1320  O5' APC A 171     1051    899    726     99    216    199       O  
HETATM 1321  C5' APC A 171      -1.316   0.838  24.562  1.00  6.44           C  
ANISOU 1321  C5' APC A 171     1106    788    551     17    188     29       C  
HETATM 1322  C4' APC A 171      -1.315  -0.661  24.737  1.00  6.29           C  
ANISOU 1322  C4' APC A 171     1058   1000    331    -50    125    -50       C  
HETATM 1323  O4' APC A 171      -2.428  -1.236  24.248  1.00  6.72           O  
ANISOU 1323  O4' APC A 171     1097    794    660   -168     83   -206       O  
HETATM 1324  C3' APC A 171      -0.144  -1.406  24.035  1.00  5.09           C  
ANISOU 1324  C3' APC A 171      840    635    461    114   -260    152       C  
HETATM 1325  O3' APC A 171       1.112  -1.294  24.673  1.00  8.26           O  
ANISOU 1325  O3' APC A 171      951   1232    954    -47   -388     51       O  
HETATM 1326  C2' APC A 171      -0.589  -2.790  23.968  1.00  6.71           C  
ANISOU 1326  C2' APC A 171      899    800    850   -297    285    216       C  
HETATM 1327  O2' APC A 171      -0.030  -3.702  24.840  1.00  7.41           O  
ANISOU 1327  O2' APC A 171     1086    962    766     50    -20    387       O  
HETATM 1328  C1' APC A 171      -2.056  -2.689  24.271  1.00  6.78           C  
ANISOU 1328  C1' APC A 171      966   1005    607     69    -51    -28       C  
HETATM 1329  N9  APC A 171      -3.025  -3.347  23.426  1.00  6.15           N  
ANISOU 1329  N9  APC A 171      878   1013    448    -50      4    190       N  
HETATM 1330  C8  APC A 171      -4.109  -4.250  23.860  1.00  5.38           C  
ANISOU 1330  C8  APC A 171      963    639    442    128    139   -256       C  
HETATM 1331  N7  APC A 171      -4.834  -4.707  22.964  1.00  6.53           N  
ANISOU 1331  N7  APC A 171     1035    948    498     64    114   -303       N  
HETATM 1332  C5  APC A 171      -4.263  -4.110  21.750  1.00  6.10           C  
ANISOU 1332  C5  APC A 171      827    920    570     95     98    -94       C  
HETATM 1333  C6  APC A 171      -4.711  -4.298  20.433  1.00  6.59           C  
ANISOU 1333  C6  APC A 171      929    875    698    -65    341   -235       C  
HETATM 1334  N6  APC A 171      -5.694  -5.022  20.065  1.00  5.61           N  
ANISOU 1334  N6  APC A 171     1039    448    646    -77    172     -3       N  
HETATM 1335  N1  APC A 171      -3.967  -3.599  19.483  1.00  5.96           N  
ANISOU 1335  N1  APC A 171      967    811    486    -27    349    -21       N  
HETATM 1336  C2  APC A 171      -2.985  -2.876  19.852  1.00  6.29           C  
ANISOU 1336  C2  APC A 171      770    937    683     62    -16    219       C  
HETATM 1337  N3  APC A 171      -2.475  -2.631  21.110  1.00  7.13           N  
ANISOU 1337  N3  APC A 171     1440    715    552    -17    159     88       N  
HETATM 1338  C4  APC A 171      -3.227  -3.343  22.105  1.00  5.85           C  
ANISOU 1338  C4  APC A 171      888    820    514     95    -85    145       C  
HETATM 1339  N1  PH2 A 181      -7.610   6.951  18.003  1.00  6.05           N  
ANISOU 1339  N1  PH2 A 181     1052    952    296    -94     78   -172       N  
HETATM 1340  C2  PH2 A 181      -7.491   7.106  19.276  1.00  6.24           C  
ANISOU 1340  C2  PH2 A 181      751   1399    220     93    132     13       C  
HETATM 1341  C3  PH2 A 181      -8.350   8.309  19.918  1.00 12.59           C  
ANISOU 1341  C3  PH2 A 181     2656   1496    632   1012   -233   -450       C  
HETATM 1342  N4  PH2 A 181      -9.107   9.037  19.073  1.00  8.47           N  
ANISOU 1342  N4  PH2 A 181     1999    603    614     77    272     74       N  
HETATM 1343  N5  PH2 A 181      -9.913   9.531  16.981  1.00  6.33           N  
ANISOU 1343  N5  PH2 A 181      859    710    836   -209    210    247       N  
HETATM 1344  C6  PH2 A 181     -10.042   9.367  15.609  1.00  6.17           C  
ANISOU 1344  C6  PH2 A 181     1209    401    734    -75    256    195       C  
HETATM 1345  N6  PH2 A 181     -10.760  10.085  14.925  1.00  7.94           N  
ANISOU 1345  N6  PH2 A 181     1288    739    992    607    231     -6       N  
HETATM 1346  N7  PH2 A 181      -9.294   8.335  15.116  1.00  6.20           N  
ANISOU 1346  N7  PH2 A 181     1142    360    855     36     21    218       N  
HETATM 1347  C8  PH2 A 181      -8.506   7.520  15.753  1.00  5.42           C  
ANISOU 1347  C8  PH2 A 181      958    690    411     48    430    -24       C  
HETATM 1348  O8  PH2 A 181      -7.860   6.607  15.236  1.00  6.00           O  
ANISOU 1348  O8  PH2 A 181      592   1089    600    260     34   -140       O  
HETATM 1349  C9  PH2 A 181      -8.445   7.773  17.143  1.00  6.49           C  
ANISOU 1349  C9  PH2 A 181      857   1146    461    217     -8    -87       C  
HETATM 1350  C10 PH2 A 181      -9.183   8.808  17.707  1.00  6.03           C  
ANISOU 1350  C10 PH2 A 181     1024    777    489     36    384      5       C  
HETATM 1351  C11 PH2 A 181      -6.690   6.344  20.210  1.00  7.41           C  
ANISOU 1351  C11 PH2 A 181     1201   1148    469    155   -176    -66       C  
HETATM 1352  O4  PH2 A 181      -5.798   5.476  19.528  1.00  8.32           O  
ANISOU 1352  O4  PH2 A 181     1571    989    601    322    507   -127       O  
HETATM 1353  O   HOH A 201      -1.216   0.760  18.974  1.00  7.10           O  
ANISOU 1353  O   HOH A 201     1048   1158    492    106     60    -57       O  
HETATM 1354  O   HOH A 202     -10.274 -11.677  16.318  1.00  6.62           O  
ANISOU 1354  O   HOH A 202      865    817    835    -50    295    115       O  
HETATM 1355  O   HOH A 203      -7.887   3.653  14.884  1.00  8.89           O  
ANISOU 1355  O   HOH A 203     1091   1226   1061   -110    356    -28       O  

Improvement (Dec. 13, 2017) PDB_cleaner is able to report them in the final report.txt file.

Modules required

  • Numpy (version 1.9.1 or above)
  • Pandas (version 1.19.2 or above)

How to run this script

This script can be run in both Linux and Windows system. The command is shown below,

$python pdb_cleaner.py

Then, the program will ask you to specified the directory that the PDB files located, and how to deal with multiple chains (keep all the chains or just one of them).

If you choose "one", the program will choose the longest chain in the PDB file (if all chains have the same length, the first chain will be kept).

  • Workflow:
  1. Collect all the PDB files in the given directory;

  2. In each PDB file, check the following items:

    2.1 ligands;

    2.2. alternate locations;

    2.3. non-standard amino acid residues;

    2.4. negative sequence numbers (less important);

    2.5. sequence gaps;

    2.6. insertion code;

    2.7. multiple chains;

    2.8. hydrogen atoms;

    2.9. ** to do: missing atoms; **

    2.10. ** to do: keep ligands/solvents or not. Currently, all ligands/solvents are removed. **

  3. Clean the PDB files if the aforementioned items exist, with following options if protein has multiple chains;

    3.1. remove hydrogein, if the user specified "y";

    3.2. keep all chains if the user specified "all";

    3.3. keep the longest chain (or the 1st chain, if all chains have the same length), if the user specified "one".

  4. Save the cleaned PDB files one by one;

  5. Save the summary report.