Initial Draft of Automated Data Munging #1

kyleabeauchamp · 2014-09-12T18:31:25Z

Convert Bzipped XTC files to all-atom HDF5 files with extra meta containing already processed filenames
Strip water from all-atom HDF5 files to create protein HDF5 files
Do this periodically on all local FAH datasets

jchodera · 2014-09-12T22:04:03Z

What if we called this fah-tools? Or do you really want a separate repo
for "munging"?

kyleabeauchamp · 2014-09-12T22:08:05Z

Tools is pretty general; right now, the code is just for munging. We can change the name if the scope of the code expands in the future.

jchodera · 2014-09-15T03:56:11Z

This looks pretty good! The only thing I'd ask for is more documentation in the code about what the various "munging" steps do.

schwancr · 2014-09-15T18:27:43Z

What about periodic image issues?

kyleabeauchamp · 2014-09-15T18:53:45Z

I suppose we'll have to add that later, as AFAIK we don't have pbc whole implemented in MDTraj. I'll look into that.

Right now, the key issue is automating the bunzip, which currently nearly 15 seconds per WU and makes it nearly impossible to do meaningful real-time analysis / reporting...

kyleabeauchamp · 2014-09-15T19:08:28Z

@schwancr I just adjusted the stripping function keep the unitcell information in the protein HDF5, which should allow us to perform downstream PBC changes.

schwancr · 2014-09-15T19:15:39Z

Yea that sounds like a good idea. Ideally mdtraj will be able to do this in the future, though it's not trivial to implement.

rmcgibbo · 2014-09-15T19:28:58Z

Has anyone looked at the PBC-whole code in gromacs or ambertools? It might
actually not be that complex.

-Robert

On Mon, Sep 15, 2014 at 12:15 PM, Christian Schwantes <
[email protected]> wrote:

Yea that sounds like a good idea. Ideally mdtraj will be able to do this
in the future, though it's not trivial to implement.

—
Reply to this email directly or view it on GitHub
#1 (comment).

schwancr · 2014-09-15T19:33:12Z

But it doesn't work that well. They're (gromacs) recipe for doing it involves several calls of the same command-line script and even then they admit it doesn't work in all cases.

rmcgibbo · 2014-09-16T00:51:57Z

@kyleabeauchamp: what's the appropriate forum to discuss the provenance metadata storage (e.g. processed_filenames), and the directory structure we want to encourage for FAH projects and mixtape?

I'm not sure that storing extra attributes on the HDF5 files is the best way to go -- if we really want to do that, we should consider simply adding that field to the MDTraj HDF5 format spec. We could also do something more akin to the MSMBuilder 2 design, where a separate metadata file is stored which contains the provenance info. It might be nice, also, not to irreversibly tie this data munging step to the use of HDF5 files for the output.

It would be helpful to get to some consensus on these design choices, especially as we start pushing mixtape for end users.

kyleabeauchamp · 2014-09-17T13:43:15Z

This is working well enough for now, we will discuss future iterations in issue #2

Initial Draft of Automated Data Munging

Merge pull request #12 from steven-albanese/master

kyleabeauchamp added 11 commits September 10, 2014 14:41

Added initial checkin of automation script

f745fa2

Updated

7f6c3d2

munge

6f612f6

typo

aa80204

Fixes

a0dbe6e

Updates

d44187c

Change output data structure to support faster rsync

52d9ed9

Overhaul of water striping.

20aebb7

Print message when sleeping.

409bebd

Docstrings.

bb13e5e

Updated munged directories to new names.

5cd169d

Added some docs to readme.md

1cbbfbe

Added unitcells to stripped HDF5 file.

99b6a3c

kyleabeauchamp mentioned this pull request Sep 16, 2014

Discuss Output format #2

Open

kyleabeauchamp added a commit that referenced this pull request Sep 17, 2014

Merge pull request #1 from FoldingAtHome/automation

512bdbe

Initial Draft of Automated Data Munging

kyleabeauchamp merged commit 512bdbe into master Sep 17, 2014

steven-albanese added a commit that referenced this pull request Nov 20, 2015

Merge pull request #1 from choderalab/master

eb2e3bc

Merge pull request #12 from steven-albanese/master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Draft of Automated Data Munging #1

Initial Draft of Automated Data Munging #1

kyleabeauchamp commented Sep 12, 2014

jchodera commented Sep 12, 2014

kyleabeauchamp commented Sep 12, 2014

jchodera commented Sep 15, 2014

schwancr commented Sep 15, 2014

kyleabeauchamp commented Sep 15, 2014

kyleabeauchamp commented Sep 15, 2014

schwancr commented Sep 15, 2014

rmcgibbo commented Sep 15, 2014

schwancr commented Sep 15, 2014

rmcgibbo commented Sep 16, 2014

kyleabeauchamp commented Sep 17, 2014

Initial Draft of Automated Data Munging #1

Initial Draft of Automated Data Munging #1

Conversation

kyleabeauchamp commented Sep 12, 2014

jchodera commented Sep 12, 2014

kyleabeauchamp commented Sep 12, 2014

jchodera commented Sep 15, 2014

schwancr commented Sep 15, 2014

kyleabeauchamp commented Sep 15, 2014

kyleabeauchamp commented Sep 15, 2014

schwancr commented Sep 15, 2014

rmcgibbo commented Sep 15, 2014

schwancr commented Sep 15, 2014

rmcgibbo commented Sep 16, 2014

kyleabeauchamp commented Sep 17, 2014