Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transfer calculation for density restart #706

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ramirezfranciscof
Copy link
Member

(This is a required addition for the LUMI pilot)

This is a workflow that runs a TransferCalcjob that copies the specific files necessary for restarting QE calculations from a source_data input. The exact behavior depends on the type of this input and some extra parameters:

  • If this input is of type RemoteData, it will create a FolderData output that will store the files locally (without requiring any other input).
  • If the input is of type FolderData, it also needs to be provided with a computer (remote_computer) or a code (nscf.pw.code). The computer just lets the workflow know in which computer to put the RemoteData in which to copy the information from the source_data, but if a code is provided instead, it will also run an NSCF calculation to regenerate all wavefunctions and other files necessary to then run normal PW calculations (at least until we implement the startingpot option, see PwBaseWorkChain: revisit restarting from calculations #691).

So far compatibility works in the way that all inputs are accepted even if they are redundant, as long as they are coherent (i.e. if you provide a RemoteData and a remote_computer, the data node should point to that same computer; idem if you provide a remote_computer and a nscf.pw.code). I think the only exception is if nscf input is provided with a RemoteData input, in which case the nscf will be ignored (and a warning will be issued).

@ramirezfranciscof
Copy link
Member Author

@mbercx I will now start working in adding some tests, so let me know if there is something important I should be testing for so that I don't miss it. I may also add some doc section on how to use it after that. You can also give a first pass and check if the workchain follows the general guildelines, I tried to make it as similar as possible to existing ones (the only thing missing is the yaml for the protocol which I hadn't time to investigate, but since it shouldn't affect the interface I'm hoping it is not critical right now).

@ramirezfranciscof
Copy link
Member Author

ramirezfranciscof commented Jul 21, 2021

@mbercx erm...any idea why all tests are failing like that? I didn't modify anything that should make old tests fail...

@sphuber
Copy link
Contributor

sphuber commented Jul 21, 2021

@ramirezfranciscof its because of a broken release of psycopg which has already been worked around in aiida-core but it hasn't been released yet I think.

@ramirezfranciscof
Copy link
Member Author

ramirezfranciscof commented Jul 21, 2021

Hey @sphuber, thanks for the reply! You mean this one? That should have been already released in v1.6.4 (at least it is there on the commit log).

EDIT: uhm, I now see that this plugin has aiida_core[atomic_tools]~=1.4,>=1.4.4 in its requirements, any reason why this can't be bumped to ~=1.6? Alternatively if this is not possible, if this plugin pins psycopg2-binary~=2.8.3 then that should also work no?

@sphuber
Copy link
Contributor

sphuber commented Jul 21, 2021

The problem is that the fix was only applied to aiida-core~=1.6 and that dropped Python 3.6 support. The plugin still supports 3.6 and so installs aiida-core==1.5.2 for that version which doesn't have the requirement limitation on psycopg. This shows that all Python 3.6 builds are currently broken and maybe you should make a patch release for v1.5 with the same fix. There are some other recent bug fixes that might be useful to backport.

@sphuber
Copy link
Contributor

sphuber commented Jul 21, 2021

Thanks @ramirezfranciscof . Before going into a detailed review, I first have some general questions about the purpose and design of the workchain. If I understand correctly, this work chain is a utility to run the TransferCalcJob to prepare the output folder of a previously completed PwCalculation to be used for a restart run through the PwBaseWorkChain. If that is the case, then why:

  1. Is the running of the PwBaseWorkChain optional? If you don't want to run that, what is the point of this work chain? Shouldn't you just simply run the TransferCalculation directly?
  2. Why the asymmetry in behavior of the TransferCalculation based on whether the data_source is a RemoteData or a FolderData. Surely, when using this workchain, the purpose is to get the source folder to the computer where the PwCalculation of the PwBaseWorkChain will be run? So why not always create a RemoteData on the target computer?

Then I have a comment on the design of the spec and the protocol. First of all, thanks for making an effort to reproduce the style of the other workchains, I appreciate it. The generic design principle is to make the input spec as "unrestricted" as possible. That is to say, unless you really need to, I think it is best to fully expose the inputs of sub processes. You already did this for the PwBaseWorkChain but not for the TransferCalcJob. I imagine you did this because, this workchain being a simple utility wrapper, you wanted the workchain to automatically build the inputs for the TransferCalcJob based on a single input. Although this is nice, this is exactly what the protocols were invented for. By not restricting the input namespaces of sub processes by fully exposing them, you give full flexibility. (This will become important for example once the TransferCalcJob provides functionality to restore from stashed data which would require specific inputs) With your design, a user can now not touch any other inputs of the TransferCalcJob should they want to. By fully exposing this you don't have this limitation, but of course you put the onus on the user to fully define the inputs. This is where the get_builder_from_protocol method comes in. Its purpose is exactly to build the full inputs dictionary based on a reduced number of inputs. Therefore I would suggest to simply move the logic to build the instructions and other inputs for the TransferCalcJob to the get_builder_from_protocol.

@ramirezfranciscof
Copy link
Member Author

Thanks @ramirezfranciscof . Before going into a detailed review, I first have some general questions about the purpose and design of the workchain. If I understand correctly, this work chain is a utility to run the TransferCalcJob to prepare the output folder of a previously completed PwCalculation to be used for a restart run through the PwBaseWorkChain. If that is the case, then why:

1. Is the running of the `PwBaseWorkChain` optional? If you don't want to run that, what is the point of this work chain? Shouldn't you just simply run the `TransferCalculation` directly?

2. Why the asymmetry in behavior of the `TransferCalculation` based on whether the `data_source` is a `RemoteData` or a `FolderData`. Surely, when using this workchain, the purpose is to get the source folder to the computer where the `PwCalculation` of the `PwBaseWorkChain` will be run? So why not always create a `RemoteData` on the target computer?

Hey @sphuber thanks for the comments! Yeah, in fact this would be a helper function to automate the movement of the restart information from/to external clusters (RemoteData) and local storage (FolderData). This is in part doing what you describe (preparing a folder to be used as parent_folder for other calculations), but can also retrieve this data from a remote to store in the aiida repository (which explains the asymmetry mentioned in item 2).

  1. This is optional because re-constructing the wavefunctions might not be needed in the future (for example, once we implement the startingpot restart option discussed in PwBaseWorkChain: revisit restarting from calculations #691), in which case it is true that it would just be a wrapper around TransferCalcJob that autopopulates the inputs with the necessary data for QE.

  2. I mentioned the symmetry issue just above, but the other point about allowing the use case of moving the restart information from one computer to another is interesting. I think I originally discarded it quickly because the engine doesn't allow a single TransferCalcjob to copy between remotes, but it is true that nothing prevents me from running two transfers within the workchain (to first copy locally and then to the other remote). If you think this might be a useful use case I can adapt the workflow to add this option as well.

Then I have a comment on the design of the spec and the protocol. First of all, thanks for making an effort to reproduce the style of the other workchains, I appreciate it. The generic design principle is to make the input spec as "unrestricted" as possible. That is to say, unless you really need to, I think it is best to fully expose the inputs of sub processes. You already did this for the PwBaseWorkChain but not for the TransferCalcJob. I imagine you did this because, this workchain being a simple utility wrapper, you wanted the workchain to automatically build the inputs for the TransferCalcJob based on a single input. Although this is nice, this is exactly what the protocols were invented for. By not restricting the input namespaces of sub processes by fully exposing them, you give full flexibility. (This will become important for example once the TransferCalcJob provides functionality to restore from stashed data which would require specific inputs) With your design, a user can now not touch any other inputs of the TransferCalcJob should they want to. By fully exposing this you don't have this limitation, but of course you put the onus on the user to fully define the inputs. This is where the get_builder_from_protocol method comes in. Its purpose is exactly to build the full inputs dictionary based on a reduced number of inputs. Therefore I would suggest to simply move the logic to build the instructions and other inputs for the TransferCalcJob to the get_builder_from_protocol.

Yeah, this is a good point, I don't know why I didn't exposed the TransferCalcJob like the PwBaseWorkChain (well, I do have an idea of the explanation but it does not come with any good reason so I guess no point in mentioning it =P). I'll try to adapt it now before adding the tests.

@sphuber
Copy link
Contributor

sphuber commented Jul 21, 2021

Thanks for the clarification. I think the purpose of the workchain just needs to be cleared up a bit. If I understand correctly, the optional PwBaseWorkChain is just to regenerate a complete output directory from a partial output directory, which then can be used in some other work chain that restarts from the reconstructed one. And the only reason that it is optional, is that in some cases the source output directory already contains all the files necessary and the directory merely needs to be put on some target remote computer?

If the intention of the user is really to prepare the source data on the target computer such that it can be restarted from, do we really need the separate nscf run? Can't I just run just the TransferCalcJob and then run whatever I actually want to run PwCalculation or PwBaseWorkChain and use the remote folder of the TransferCalcJob as the parent_folder? Whatever the contents, it will be able to restart from it, wouldn't it? (If the nscf step in your work chain would have been able to as well. I guess ultimately what I am asking is this use case really needs an entire work chain? Couldn't all of this just be satisfied with some utility function that simply prepares the inputs of a TransferCalcJob that prepare a parent_folder?

@ramirezfranciscof
Copy link
Member Author

ramirezfranciscof commented Jul 22, 2021

If I understand correctly, the optional PwBaseWorkChain is just to regenerate a complete output directory from a partial output directory, which then can be used in some other work chain that restarts from the reconstructed one.

Yes, this is correct.

If the intention of the user is really to prepare the source data on the target computer such that it can be restarted from, do we really need the separate nscf run? Can't I just run just the TransferCalcJob and then run whatever I actually want to run PwCalculation or PwBaseWorkChain and use the remote folder of the TransferCalcJob as the parent_folder? Whatever the contents, it will be able to restart from it, wouldn't it?

Well, this is the problem. Currently this is not possible for two separate reasons:

  1. On the QE side: While a NSCF calculation can re-generate all the wavefunctions from the density file, the SCF can't "restart" a calculation without them being there already.
  2. On the aiida-qe side: SCF calculations can use a density file to generate the starting guess without the wavefunctions (startingpot keyword), but this has to be ran with restart_mode = 'from_scratch'. The problem is that the PW calculation plugin will automatically set restart_mode = 'restart' if I provide a parent folder from which to copy the density, I can't toggle that. This is what we are discussing how to change in PwBaseWorkChain: revisit restarting from calculations #691 , but we still discussing the best way to modify this and it may take a while to settle it.

I guess ultimately what I am asking is this use case really needs an entire work chain? Couldn't all of this just be satisfied with some utility function that simply prepares the inputs of a TransferCalcJob that prepare a parent_folder?

I can reorganize this so that there is a separate "long term solution" auxiliary function that does what you say and the rest of the workchain can be deprecated later, but so far doing the NSCF after the transfer is a necessity when copying to a remote for running.

And the only reason that it is optional, is that in some cases the source output directory already contains all the files necessary and the directory merely needs to be put on some target remote computer?

Not exactly, the output already contains all files necessary if the next process starts with an NSCF. The main reasons it was optional were:

  1. So that it could be used in the future for when the startingpot option was available.
  2. Because copying TO remote is not the only thing the workchain took care of, but the idea was that it could also be used to retrieve FROM a remote to your local computer, so the density could be stored in the DB for future use. (This use case is why I bolded the sentence in responding to the previous quote).

This made sense because I was thinking that the WorkChain would be the only tool dealing with the transfer. If you want me to do the auxiliary function route, then I would move al preparations there and then the workchain can retain only the function of bundling transferring + running the NSCF together (thus eliminating most of the "optional" stuff, in the sense that the NSCF would now not be optional since just running the assisted TransferCalcjob can be done with the auxiliary functions).

I actually would agree I like that organization better.

@ramirezfranciscof ramirezfranciscof force-pushed the transferdens branch 2 times, most recently from 81a91a1 to 4078d77 Compare July 22, 2021 16:24
@ramirezfranciscof ramirezfranciscof force-pushed the transferdens branch 3 times, most recently from aa79f5c to 2ad963b Compare July 23, 2021 16:56
@ramirezfranciscof
Copy link
Member Author

I added some tests for the utility function. I think that before I add any tests for the actual workchain and documentation for the process I would like you guys (@sphuber @mbercx) to take a first look in case any other significant modification is required.

Moreover, I was just reminded by some test failures that the TransferCalcjob was incorporated in aiida-core v1.6, so I can't run them in the CI tests yet. I think it still makes sense to have them there (for real reasons but also motivated by it having been a whole day's work), and you should be able to run them locally if you want to test them out.

PS: I'm not super sure why the CI for python 3.8 is failing, it would seem some database contamination between tests maybe? I didn't touch any other tests of the suite and python 3.6 works, so it is strange...

@ramirezfranciscof ramirezfranciscof marked this pull request as ready for review July 23, 2021 17:17
@sphuber
Copy link
Contributor

sphuber commented Aug 2, 2021

Thanks for the detailed additional explanation though @ramirezfranciscof . Now that the goal is a bit clearer, I still have to wonder why this needs to be in the main repository. Since it seems all of this is not fully hashed out yet and contains workarounds for issues that are being addressed elsewhere, and mostly this is to be used for a private project. What is the benefit of adding this to the public repository if so much is still unclear and might even get deprecated soon? Why not just keep this in a personal or the private QE repo?

@ramirezfranciscof
Copy link
Member Author

What is the benefit of adding this to the public repository if so much is still unclear and might even get deprecated soon? Why not just keep this in a personal or the private QE repo?

Well, I'm not sure how "soon" it would be deprecated. The discussion on the startingpot option seems to still be ongoing and since it would imply modifying the PwCalcjob, which is underlying of a lot of QE workflows, it is worth taking the proper time to discuss. This added to the fact that I don't know how much of a priority that change is in general, means that an ETA is not super clear to me at this point.

On the other hand I was also incentivized to clean up and publish some of the tools I'm using for my research project by "higher powers". I was under the impression that increasing the ease for re-starting calculations from stored densities was a feature of general interest we wanted to offer more publicly. I may have misinterpreted though and perhaps the private repo is enough for this (although I feel it is a bit more awkward to keep up to date with the public repo?), maybe @giovannipizzi can chip in with some clarifications.

@giovannipizzi
Copy link
Member

Hi all, thanks for the work @ramirezfranciscof and for the review @sphuber

Indeed, we've been discussing and suggesting Francisco to release his code in the public repository for public use.
After rediscusisng a few days ago, the guideline would be the following:

  • code to restart from a previous QE calculation (in various forms, from charge density, potential, ...) is a very common and general need for many QE users (and one needs to spend weeks to understand how this actually works in QE...). So this would definitely be something to add to aiida-quantuemespresso.
  • code specific to a scientific project (e.g. workflows to compute specific materials properties Francisco is working on) are better suited for another repository

So as long as the code is 1) generic enough (e.g. restarting a previous QE calculation), 2) with a clear code and clear interface, 3) has tests, and 4) is documented, I would vote to merge it here. If there are specific questions I'm happy to discuss (maybe in person is faster if many more discussions arise on the scope of the code?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants