Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GROMOS11 Reader #4292

Merged
merged 49 commits into from
Dec 17, 2023
Merged

GROMOS11 Reader #4292

merged 49 commits into from
Dec 17, 2023

Conversation

JoStoe
Copy link
Contributor

@JoStoe JoStoe commented Sep 18, 2023

Fixes #4291

Changes made in this Pull Request:

  • This PR implements a basic reader for GROMOS11 md-software trajectories.
    It can read timestep information, box-sizes and atom-coordinates.

You can find more information about the GROMOS software here: https://www.gromos.net/

Technical details about GROMOS data structures and formats can be found here:
https://gromos.net/gromos11_pdf_manuals/vol4.pdf
The relevant documentation of the GROMOS11 trajectory format, used in this PR, can be found in chapter 2 and 4 (page number 3 and 37-56).

The trajectory format organizes its data in blocks, which allows it to carry data depending on the task of the simulation. This PR implements the blocks TIMESTEP, GENBOX and POSITIONRED, which should cover the vast majority of trajectories generated with the GROMOS11 software. In a non-supported block is encountered, a warning is served to the user and the data in the block is ignored.

\Edit: Updated on 2023-11-16 to give more information about the trajectory format

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Developers certificate of origin


📚 Documentation preview 📚: https://mdanalysis--4292.org.readthedocs.build/en/4292/

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello there first time contributor! Welcome to the MDAnalysis community! We ask that all contributors abide by our Code of Conduct and that first time contributors introduce themselves on the developer mailing list so we can get to know you. You can learn more about participating here. Please also add yourself to package/AUTHORS as part of this PR.

@github-actions
Copy link

github-actions bot commented Sep 18, 2023

Linter Bot Results:

Hi @JoStoe! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/7237918204/job/19718122962


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

@JoStoe
Copy link
Contributor Author

JoStoe commented Sep 18, 2023

Please have a look at my comment in issue #3743 (comment)

The tests for this reader use assert_allclose() instead of assert_almost_equal()

@codecov
Copy link

codecov bot commented Sep 18, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (5eca713) 93.37% compared to head (3e666a8) 93.41%.
Report is 1 commits behind head on develop.

Files Patch % Lines
package/MDAnalysis/coordinates/TRC.py 98.24% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4292      +/-   ##
===========================================
+ Coverage    93.37%   93.41%   +0.04%     
===========================================
  Files          170      185      +15     
  Lines        22340    23626    +1286     
  Branches      4085     4129      +44     
===========================================
+ Hits         20859    22070    +1211     
- Misses         963     1036      +73     
- Partials       518      520       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@IAlibay
Copy link
Member

IAlibay commented Sep 18, 2023

Thanks for this @JoStoe - before we review, is there a format specification we can look at somewhere?

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! I have some initial comments, the most important one being that we need to have a specification of the file format.

Please note that our User Group Meeting is coming up and we are all fairly busy until that's over, so please be patient with reviewers! Thanks.

package/MDAnalysis/coordinates/TRC.py Show resolved Hide resolved
package/MDAnalysis/coordinates/TRC.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/TRC.py Outdated Show resolved Hide resolved
Reads coordinates, timesteps and box-sizes from GROMOS11 TRC trajectories.

To load the trajectory into :class:`~MDAnalysis.core.universe.Universe`,
you need to provide topology information using a pdb::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a pdb absolutely necessary or could it be any file that MDA can create a topology from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made no artificial limitation on which source can be used as topology. It is only important, that the order of atoms in the topology is strictly the same as in the trajectory file. As MDAnalysis does not support gromos *.cnf files as well as gromos native topologies, I think all users will use PDB files as topology.

I tested the functionality of the reader only with pdb files as topology, as other topology formats are difficult to obtain starting from a gromos simulation.

package/MDAnalysis/coordinates/TRC.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/TRC.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/TRC.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/TRC.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/TRC.py Outdated Show resolved Hide resolved
@JoStoe
Copy link
Contributor Author

JoStoe commented Sep 19, 2023

Thank you for the contribution! I have some initial comments, the most important one being that we need to have a specification of the file format.

Please note that our User Group Meeting is coming up and we are all fairly busy until that's over, so please be patient with reviewers! Thanks.

Thank you for the fast initial review and your valued feedback.

Regarding the documentation, I will get into contact with more senior GROMOS developers to find proper specifications of the trajectory format. For now, I edited the initial PR to show a link to the technical documentation of GROMOS. If I don't find a more handy specification, I will write that on my own. Please note, that in this case I would need a positive sign-off from the GROMOS Maintainers, which may take a month.

Regarding the feedback on the code, I will work through the comments and prepare an update to the code.

@orbeckst orbeckst self-assigned this Dec 9, 2023
@JoStoe
Copy link
Contributor Author

JoStoe commented Dec 10, 2023

Thank you for putting in the effort to give feedback to this PR; I think we are making progress.

With the last few commits I hope to have addressed most of the change requests. Most should be marked as resolved; I kept the ones open where I wrote longer answers. I further hope that I did not miss any requests; if so please inform me so I can solve them.

Some points that might still be open:

  1. @hmacdope made some annotations about how the internal API should be used. I think it is reasonable how I did it, but I have to say it was quite some work to figure out how it (should) work. Please have a look again if there is any misuse of the API that I'm not aware of.
  2. I tried to add my files to the test_multiprocessing.py file for serialized reader testing as requested by @orbeckst . The additions lead to four additional tests, but I don't know if those changes are sufficient. Please check it out.
  3. When using the chain-reader with the continuous argument set to True, I receive the following warning: UserWarning: Missing frame in continuous chain . It is because the trajectories of the chain are not overlapping (this is intended). I think this is fine.
  4. Using the chain-reader, the reader gets very slow. The first few frames of each file are fine but frames from the end of the file are slow. It seems to me, that the reader (in chain-reader mode) is reading each frame separately calling _read_frame(i) for each frame. This is not the case when one single trajectory is read without use of the chain-reader. I think this is just how ChainReaders work and has nothing to do with this PR.
  5. This PR has quite a lot of commits. If it gets merged, is it possible to squash the commits? There were quite a lot of changes that are only distracting/uninteresting once this PR is merged.

Thank you :)

@JoStoe JoStoe marked this pull request as ready for review December 10, 2023 16:32
@orbeckst
Copy link
Member

Thanks! Re commit history: You can rewrite your history on this commit yourself if you think it fits in a handful (1-3 typically) well-defined commits. Otherwise we will squash it all into a single commit when we merge.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work @JoStoe , you addressed all my comments.

Please fix the merge conflict and I'll wait for any remaining comments from @hmacdope .

@orbeckst
Copy link
Member

Using the chain-reader, the reader gets very slow. The first few frames of each file are fine but frames from the end of the file are slow. It seems to me, that the reader (in chain-reader mode) is reading each frame separately calling _read_frame(i) for each frame. This is not the case when one single trajectory is read without use of the chain-reader. I think this is just how ChainReaders work and has nothing to do with this PR.

Can you put some code here that you used to observe/benchmark this behavior?

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments, but in general if this is working for you then this would be a great addition as-is. Things can always be iterated on.

if (ext[1:] == "trc") or (ext[1:] == "trj"):
self.compression = None
else:
self.compression = ext[1:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see where else this attribute is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not used but should have been the documented optional attribute compressed of the trajectory reader class that might be accessed by the user.
I corrected it from compression to compressed

The readers XYZ and H5MD also use the attribute compression. Might be worth to check out, if that is also a typo.

package/MDAnalysis/coordinates/TRC.py Show resolved Hide resolved
package/MDAnalysis/coordinates/TRC.py Outdated Show resolved Hide resolved
frameDat["time"] = float(tmp_time)

elif "POSITIONRED" == line.strip():
tmp_buf = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than creating a list, appending to it, then converting this to an array, I think we've found it's faster to create a np.array of the correct size, pass the strings in to this (numpy converts strings faster than Python). You'll just have to guard against setting more than n_atoms rows of the numpy array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested that out today and have seen different results (at least with the gromos11 file format).

When I pre-allocate a numpy array of size (n_atoms,3) then I have to cast strings to np.float64 every time an atom is added and I need to guard the range of the array to avoid a possible overflow-error. Both are very costly and in the end, it does not yield any performance improvement.

As it is, I fill up the list with strings and convert and check for length only once after reading. I ask to stick with the existing solution.

Overall, the reader is slow. This is likely to to the nature of the compressed, very flexible and human-readable gromos trajectories, which are costly to parse.

@orbeckst
Copy link
Member

@hmacdope could you please have a look at this PR in the next few days and see if any outstanding comments have been addressed?

@orbeckst
Copy link
Member

@JoStoe please have a careful look at @richardjgowers' comments, which all look like things that are easy to address to improve code robustness. Please also resolve conflicts (as this will put the PR into a stage where we can merge quickly once everybody is happy).

Do feel free to ping me if you're waiting on anything specific to happen.

@JoStoe
Copy link
Contributor Author

JoStoe commented Dec 16, 2023

@orbeckst I tried to address/answer to the recommendations in the comments above.

Regarding the merge conflicts, may I ask you to look through the pull requests if there are any that are likely to be merged before this PR? It seems to me there are some that might get merged very soon (#4366, #4368, #4370) which do changes to the AUTHORS and CHANGELOG files. Every time something is commited to these files, I get a new merge conflict to fix. If you think they take more time, I will resolve the merge conflict with this PR but I want to avoid doing it every other day.

@IAlibay
Copy link
Member

IAlibay commented Dec 16, 2023

@JoStoe - I can very much sympathise that merge conflicts are a lot of effort to keep on top of. However, in order for your code to get reviewed you need to ensure that CI can pass. If there's a direct merge conflict for any files you touched then CI will be blocked by it. As far as I'm aware, none of the PRs you point to are touching files you have touched except the changelog (which is a relatively low cost merge conflict, at worst we can deal with that one ourselves).

@hmacdope
Copy link
Member

@hmacdope could you please have a look at this PR in the next few days and see if any outstanding comments have been addressed?

The amazing @JoStoe has addressed all my comments, the only open question being conversion of trunc-oct boxes which we will probably address down the line. With tests passing this should be good to merge from my POV. Thanks for all your work @JoStoe!

Copy link
Member

@hmacdope hmacdope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wahooo 🥳

@JoStoe
Copy link
Contributor Author

JoStoe commented Dec 17, 2023

Thank you for the good feedback :)

I still added my github handle to the CHANGELOG, I think this is usually done - correct me if I'm wrong.
Somewhere above I was asked on how to squash the commits - I think all changes together into 1 commit is fine for me.

@orbeckst orbeckst merged commit b808ef1 into MDAnalysis:develop Dec 17, 2023
23 checks passed
@orbeckst
Copy link
Member

It's merged, congratulations to your first merged PR, @JoStoe 🎉 , that was a sizable one! And thank you very much for your professional (and patient) interactions with everyone on the PR, with very diligent attention to detail. We hope to see more of you :-).

I'll also send you a "contributor" invite for the MDA org.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GROMOS11 trajectory reader
6 participants