-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss Output format #2
Comments
We also have some open discussions here: https://github.com/choderalab/fah-projects/issues |
Love the whiteboard. |
The idea is that #1 generates the protein HDF5 files, which we then rsync to your desktop or cluster for analysis. |
My interest isn't so much in the automation side. I'm thinking about things like on-boarding new students, and standardizing protocols. For all of its downsides, I think the MSMBuilder2 "Project" format has been pretty useful. In particular, it helps jump in to debugging another students MSM work, because I'm already pretty familiar with how their stuff is laid out. I like that most of the Mixtape API doesn't insist that you structure your files according to any particular layout, but I think that more opinionated conversion / munging code is good. |
So my pipeline can be converted into MSMB2 format with like 10 lines of Python. However, I think we can improve on MSMB2 in several ways:
|
I definitely see that linking the trajectory files and their output provenance has the advantage that it's harder to get them out of sync, but that's not the only concern. Putting the provenance inside the trajectory file
Have you looked at any of the more 'standardized' formats for expressing provenance information? Maybe we should be using JSON-LD or something, for example. |
Using HDF5 format is not a real barrier if it's treated as a "munging intermediate". If we're just interested in the protein coordinates, it's quite fast to convert to the output format of choice. So I'm not really sure we're "tied". I agree that more standardization provenance is desirable, but the HDF5 fields are a reasonable near-term solution. |
Another thing is that, essentially, the FAH project layout can be thought of, for the purpose of statistical analysis as, "A bunch of nested nested directories in a tree, the leaves of which each contain an MD trajectory which is saved in 'chunks' following some filename pattern". For classic FAH, the directories are RUN/CLONE, and the pattern is 'frame_.xtc'. For siegetank, you've got 'results-_.tar.bz2', but in general this stuff is not so different. |
Yeah, this is kind of the conversation I want to have. Like, what does the 'right' long-term solution look like? |
I kind of like the idea of cramming everything into an HDF5 file, including provenance information. It is enormously quick to slice and reslice to extract bits you want out, and distributing it as a single object makes it easy to distribute and analyze the data. The biggest drawback seems to be that if the HDF5 file keeps growing, you can't easily |
(regarding the tree form) I agree, but IMHO there is considerable utility in constructing "continuous" trajectories for simplified visualization etc. It's quite useful to massage the data into the most "human meaningful" form. The right "long-term solution" should be brilliant and implemented by someone who's not me... |
Siegetank directories are not quite result-*.tar.bz2 Siegetank doesn't even try to auto-tar or auto-compress anything. Frames On Mon, Sep 15, 2014 at 9:33 PM, Robert McGibbon [email protected]
Yutong Zhao www.proteneer.com | simbios.stanford.edu |
FWIW, here's the script I'm using to convert an old FAH project into something I can analyze: |
It's nothing fancy, but using |
So this is fine too. I'm happy with either approach. Maybe we should just add both approaches to this github, test them out, pick a winner, fill it out with the additional functionality, and deprecate the loser? |
Continued from #1.
So I'm hesitant about the separate metadata files, as IMHO it's very important to keep the metadata attached to the coordinates throughout the pipeline. Ideally, the metadata-data attachment would be an "atomic" operation, in the sense that they could never get broken. That's more effort than I have to invest in this, however, but it could probably be done later via a context manager type approach.
If we can make this metadata an official part of the MDTraj format, I'd be happy to follow that route as well.
The current pull request separates the directory structure from the trajectory structure.
fah.py
contains the tools for concatenating a single "CLONE", "stream", or "trajectory" object.automation.py
contains the tools for iterating over FAH projects.Obviously one can engineer these things ad nauseam. Right now, I just want something that works, as we're generating 6 datasets and grabbing the coordinates from the bzips takes about 7 days.
The text was updated successfully, but these errors were encountered: