Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job.restart copies OUTPUT of parent job?? #1575

Open
freyso opened this issue May 16, 2024 · 9 comments
Open

job.restart copies OUTPUT of parent job?? #1575

freyso opened this issue May 16, 2024 · 9 comments
Assignees

Comments

@freyso
Copy link
Contributor

freyso commented May 16, 2024

I ran into a nasty problem with "job.restart" for SPHInX, but I suspect the root issue in pyiron_base .

I used job.restart to create a follow-up job of a successful calculation (changing the charge of the system). Then I used new_job.get_electronic_structure () to get the eigenvalues - and they were identical to the PREVIOUS calculation, even though the raw output of SPHInX shows different numbers.

After quite a bit of debugging, I figured out what happened: apparently, the entire OUTPUT of the parent job is copied to the restart. Then (as a bug or feature?), the sphinx eps.dat parser only puts the newly parsed values into output.generic.dft if there aren't any already (this was introduced by @samwaseda last November). Thus, the eigenvalues get never updated to the new ones.

You may argue in many ways why I am stupid, or why I shouldn't use restart for this, or that the collect_eps_dat should update. But for me the core question is this: Is there any good reason to copy the output of the parent in the first place? If the restarted job does not populate the same fields in the output as its parent (for whatever reason, and I could give you reasonable ones), the restarted output will contain junk from its parent.

You can quickly reproduce this behavior: just call a restart from any job and inspect its output before running the job. Common sense says that a job that has not run cannot have any output. pyiron behaves differently.

@samwaseda
Copy link
Member

I cannot immediately say whether it's a feature or not. It appears indeed somewhat improper to me that the output is also copied, but I presume it's not an easy fix, because inputs and outputs are not unified, and if we base restart on copy, then we would need to somehow figure out how to erase data that shouldn't be in the restart job.

On the other hand, I don't really know why we shouldn't overwrite the data, so I opened a PR with a fix here

@freyso
Copy link
Contributor Author

freyso commented May 16, 2024

but I presume it's not an easy fix, because inputs and outputs are not unified, and if we base restart on copy, then we would need to somehow figure out how to erase data that shouldn't be in the restart job.

If we are able to remove stuff from hdf5 by modifying 'output' (does that work?), we could do this at any level: GenericJob/GenericAtomisticJob/GenericDFT/Sphinx ; at least the atomistic job knows that all output is in 'output'.

If modifying hdf5 is hard - maybe not copy it at all upon restart for atomistic/dft/sphinx jobs? What's the rationale behind it? It might mean that one moves the hdf5 copying upon restart to those job classes that rely on it.

@pmrv
Copy link
Contributor

pmrv commented May 16, 2024

I seem to recall there's a 'input_only' flag somewhere, but I'd need to hunt for it.

@freyso
Copy link
Contributor Author

freyso commented Aug 22, 2024

OK, @skatnagallu Shyam ran into the same problem, so I did some further digging into details. (The depth of the object hierarchy is a pain)
@pmrv was entirely right: there is an 'input_only' flag, and its being used in
pyiron_base.jobs.job.generic.GenericJob.copy_to
This cleans up the new hdf5 file from all 'output', and I verified that the file on disk is ok. However, it beforehand creates a copy of the in-memory object in
pyiron_base.jobs.job.core.JobCore._internal_copy_to via self.copy (), which resolves apparently to
pyiron_base.jobs.job.generic.GenericJob.copy
which (if I correctly understand) saves the parent object to disk (hdf5), and then loads the new object from the parent's hdf5 file (or an identical copy?). In other words, the in-memory object inherits the output of the parent via this reloading. I assume that when the newly created in-memory object is run, it is dumped again to disk from memory, including its inherited output, thereby overwriting the cleaned-up hdf5 file.

Allow me to find this entire copying procedure entangled and weird.

And I am not sure in which of the many classes it should be fixed. The 'output' of a job created from scratch is a job-type-specific, non-empty output class created somewhere in the __init__ of the corresponding class , so the pyiron_base classes will not be able to renew it without a hook-in. The copy function itself (even if overloaded later) has no argument to know that output should not be copied.

A possible solution would be to create a clear_output callback and call it from within copy_to when the input_only flag is set, and do so right after the in-memory object has been created. Then, clear_output could be defined in the class that sets up the output class for fresh jobs.

Comments/thoughts are welcome.

@freyso
Copy link
Contributor Author

freyso commented Aug 26, 2024

Conclusion from discussion over lunch/in pyiron Q&A:
In the long run, this should be fixed together with the "new way of copying" via dictionaries, that is currently being planned.

As a short term solution, this can be fixed at the level of the sphinx job as follows

  1. create a SphinxOutput.clear function that resets the output to its "fresh state" (as in __init__)
self.generic = DataContainer (table_name="output/generic")
self.charge_density = SphinxVolumetricData ()
self.electrostatic_potential = SphinxVolumetricData ()
self.generic.create_group ("dft")

not sure about _job parameter (weird construct anyway: creates a cyclic reference chain (job.output._job -> job)
2. call that new_job.output.clear () function in the restart function of the sphinx job class

@skatnagallu will take care of it.

@pmrv
Copy link
Contributor

pmrv commented Aug 26, 2024

I had a brief look at this just now. A good plan to implement the above is probably modifying and overloading the _after_generic_copy_to method. Add here also a parameter input_only (in pyiron_base), then overload this in the sphinx job to initialize the object as @freyso wrote above.

@freyso
Copy link
Contributor Author

freyso commented Aug 26, 2024

I had a brief look at this just now. A good plan to implement the above is probably modifying and overloading the _after_generic_copy_to method. Add here also a parameter input_only (in pyiron_base), then overload this in the sphinx job to initialize the object as @freyso wrote above.

@pmrv: I politely disagree. The suggested strategy of today's meeting is to make NO changes to pyiron_base at this stage, because it would interfere with @jan-janssen 's plans to clean up. Thus, we fix the sphinx-specific issue inside the sphinx classes as a temporary solution, and leave the generic solution to Jan. Jan said he has the copy without output issue on his radar now.

If I understood @jan-janssen (current) plans correctly, the future strategy would be to convert the parent pyiron object to a plain dictionary, which can be manipulated to implement the input_only functionality, and then create the new object from the dictionary.

The current mess comes from the fact that there are two competing representations in the current copying, namely the hdf file and the object in memory, and for SPHInX, the wrong item was manipulated. pyiron_base cannot tell which of the two representations is relevant (for VASP output, for instance, it is the hdf file). Augmenting the parameter list of the post-copy hook will add more mess to it, while Jan's solution probably is more straightforward. Passing around parameters across function calls is a strong sign of a design flaw.

Shyam was told to make a very BOLD comment about his SPHInX changes being a temporarary solution until the "copy-issue" is fixed.

@pmrv
Copy link
Contributor

pmrv commented Aug 28, 2024

Shyam was told to make a very BOLD comment about his SPHInX changes being a temporarary solution until the "copy-issue" is fixed.

Ah, I just missed that another plan was communicated to him in the pyiron meeting already. All good then.

@jan-janssen
Copy link
Member

@pmrv Yes this was partially discussed in the pyiron FAQ meeting and on the way to lunch between Christoph and me. Still I agree that this implicit communication is hard to follow. @skatnagallu Could you give a quick summary of your changes and our future strategy in the next pyiron meeting? You could simply add it to the agenda https://github.com/orgs/pyiron/discussions/243 . This way we make sure we have it at least communicated to the pyiron developers.

@jan-janssen jan-janssen transferred this issue from pyiron/pyiron_base Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants