Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rerun a job that has child jobs #139

Open
QuantumChemist opened this issue Jul 8, 2024 · 2 comments
Open

Rerun a job that has child jobs #139

QuantumChemist opened this issue Jul 8, 2024 · 2 comments

Comments

@QuantumChemist
Copy link
Contributor

QuantumChemist commented Jul 8, 2024

Hey 😃

I want to rerun a job that is generating supercells based on a matrix and the unit cell of my structure. The state of the job I want to rerun is COMPLETED. I had chosen a too big matrix and thought I can just override the matrix entry and rerun the job.

But when I do, I get the following message:

(auto) certural@...:~$ jf job rerun -f 999
The selected project is auto from config file /home/certural/.jfremote/auto.yaml
The Runner is active. This operation may lead to inconsistencies in this case. Proceed anyway? [y/n] (n): y
[14:17:13] ERROR    Error executing the operation                                                                                                                     
                    ValueError: Job 96aa6541-d219-431d-bfe0-c5f7b09efa48 has a child job (cb3e4f3b-2938-414b-9da3-0c65368f283b) which is not the last index (1.       
                    Rerunning the Job will lead to inconsistencies and is not allowed. 

I'm using the interactive branch.

Do you have an idea what I could try? Thank you in advance!

@gpetretto
Copy link
Contributor

Hi. Unfortunately at the moment that message means that the code is working as intended, so I am afraid I have no easy solution other than rerunning the workflow.
The problem is that rerunning a Job that dynamically generates a "replace" will result in the creation of new jobs in the db that are generated by the Job that executes the "replace". Reverting the dynamic operation of jobs that create children would make the logic of rerun much more involved, so at the moment we decided not to support this option.

One possibility could be to introduce a job delete functionalities that would be for "very advanced users". This would allow to manually delete the jobs at your risk of leaving the Flow in an unconsistent state, but if you carefully select which jobs to delete it could work. What do you think @davidwaroquiers?

Actually this could also be an option for you now. If you remove directly from the queue DB all the jobs that were generated by the job that you want to rerun, including the job with uuid 96aa6541-d219-431d-bfe0-c5f7b09efa48 and index 2 you should then be able to rerun the job that you want to rerun. Notice that the Jobs are in the jobs collection, but the flows collection also contain references to the Jobs belonging to the Flow and the connections among them. The dynamically generated Jobs should also be removed from that. I never tried that though, so I cannot guarantee that it will work properly. 😁

@QuantumChemist
Copy link
Contributor Author

Hi,

well, I already guessed that the code is working as intended. 😆

I was just thinking of how it's handled in FireWorks. In FireWorks, when you rerun a job, it will rerun in a new directory and also all the depending jobs will rerun in new directories and then all kinds of documents etc. keeping track of the jobs are updated.

But I understand that it's pretty complicated to introduce the handling of such dynamic jobs. Sorry for making all kinds of advanced requests haha 😅

I'm trying to rerun the jobs with more resources and if this fails, just rerunning this part of the workflow sounds as the easiest solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants