Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Draft of Automated Data Munging #1

Merged
merged 13 commits into from
Sep 17, 2014
Merged

Conversation

kyleabeauchamp
Copy link
Collaborator

  1. Convert Bzipped XTC files to all-atom HDF5 files with extra meta containing already processed filenames
  2. Strip water from all-atom HDF5 files to create protein HDF5 files
  3. Do this periodically on all local FAH datasets

@jchodera
Copy link
Member

What if we called this fah-tools? Or do you really want a separate repo
for "munging"?

@kyleabeauchamp
Copy link
Collaborator Author

Tools is pretty general; right now, the code is just for munging. We can change the name if the scope of the code expands in the future.

@jchodera
Copy link
Member

This looks pretty good! The only thing I'd ask for is more documentation in the code about what the various "munging" steps do.

@schwancr
Copy link

What about periodic image issues?

@kyleabeauchamp
Copy link
Collaborator Author

I suppose we'll have to add that later, as AFAIK we don't have pbc whole implemented in MDTraj. I'll look into that.

Right now, the key issue is automating the bunzip, which currently nearly 15 seconds per WU and makes it nearly impossible to do meaningful real-time analysis / reporting...

@kyleabeauchamp
Copy link
Collaborator Author

@schwancr I just adjusted the stripping function keep the unitcell information in the protein HDF5, which should allow us to perform downstream PBC changes.

@schwancr
Copy link

Yea that sounds like a good idea. Ideally mdtraj will be able to do this in the future, though it's not trivial to implement.

@rmcgibbo
Copy link

Has anyone looked at the PBC-whole code in gromacs or ambertools? It might
actually not be that complex.

-Robert

On Mon, Sep 15, 2014 at 12:15 PM, Christian Schwantes <
[email protected]> wrote:

Yea that sounds like a good idea. Ideally mdtraj will be able to do this
in the future, though it's not trivial to implement.


Reply to this email directly or view it on GitHub
#1 (comment).

@schwancr
Copy link

But it doesn't work that well. They're (gromacs) recipe for doing it involves several calls of the same command-line script and even then they admit it doesn't work in all cases.

@rmcgibbo
Copy link

@kyleabeauchamp: what's the appropriate forum to discuss the provenance metadata storage (e.g. processed_filenames), and the directory structure we want to encourage for FAH projects and mixtape?

I'm not sure that storing extra attributes on the HDF5 files is the best way to go -- if we really want to do that, we should consider simply adding that field to the MDTraj HDF5 format spec. We could also do something more akin to the MSMBuilder 2 design, where a separate metadata file is stored which contains the provenance info. It might be nice, also, not to irreversibly tie this data munging step to the use of HDF5 files for the output.

It would be helpful to get to some consensus on these design choices, especially as we start pushing mixtape for end users.

@kyleabeauchamp
Copy link
Collaborator Author

This is working well enough for now, we will discuss future iterations in issue #2

kyleabeauchamp added a commit that referenced this pull request Sep 17, 2014
Initial Draft of Automated Data Munging
@kyleabeauchamp kyleabeauchamp merged commit 512bdbe into master Sep 17, 2014
steven-albanese added a commit that referenced this pull request Nov 20, 2015
Merge pull request #12 from steven-albanese/master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants