Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watcher incorrect handling of failed job #307

Closed
mbthornton-lbl opened this issue Nov 26, 2024 · 0 comments · Fixed by #308
Closed

Watcher incorrect handling of failed job #307

mbthornton-lbl opened this issue Nov 26, 2024 · 0 comments · Fixed by #308
Assignees

Comments

@mbthornton-lbl
Copy link
Contributor

The process_failed_job routine is not behaving right:

  • new_jobs_from state is not handing failed jobs right - keeps looping
  • failed jobs are not being re-submitted to Cromwell

Example log

2024-11-26 09:42:30,896 INFO: Initializing Watcher: config file: /global/homes/n/nmdcda/nmdc_automation/dev/site_configuration_nersc.toml
2024-11-26 09:42:30,897 INFO: Using state file from config: /global/cfs/cdirs/m3408/var/dev/agent.state
2024-11-26 09:42:30,897 INFO: New Job from State: nmdc:wfmag-11-g7msr323.1, nmdc:66cf64b6-7462-11ef-8b84-deaa01ab0f49
2024-11-26 09:42:30,898 INFO: Last Status: Succeeded
2024-11-26 09:42:30,898 INFO: New Job from State: nmdc:wfmag-12-h52r0792.1, nmdc:c2b7c884-ab78-11ef-8298-3e652b5abb3d
2024-11-26 09:42:30,898 INFO: Last Status: Failed
2024-11-26 09:42:30,898 INFO: Adding 2 new jobs from state file.
2024-11-26 09:42:32,277 INFO: Entering polling loop
2024-11-26 09:42:32,312 DEBUG: Starting new HTTPS connection (1): api-dev.microbiomedata.org:443
2024-11-26 09:42:32,862 DEBUG: https://api-dev.microbiomedata.org:443 "POST /token HTTP/11" 200 None
2024-11-26 09:42:32,873 DEBUG: Starting new HTTPS connection (1): api-dev.microbiomedata.org:443
2024-11-26 09:42:33,293 DEBUG: https://api-dev.microbiomedata.org:443 "GET /jobs?max_page_size=100&filter=%7B%22workflow.id%22:%20%7B%22$in%22:%20%5B%22Sequencing%20Noninterleaved:%20%22,%20%22Sequencing%20Interleaved:%20
%22,%20%22Reads%20QC:%20v1.0.13%22,%20%22Reads%20QC%20Interleave:%20v1.0.12%22,%20%22Metagenome%20Assembly:%20v1.0.7%22,%20%22Metagenome%20Annotation:%20v1.1.0%22,%20%22MAGs:%20v1.3.12%22,%20%22Readbased%20Analysis:%20v1.
0.8%22%5D%7D,%20%22claims%22:%20%7B%22$size%22:%200%7D%7D HTTP/11" 200 16
2024-11-26 09:42:33,294 INFO: Found 0 unclaimed jobs.
2024-11-26 09:42:33,531 INFO: Checking for finished jobs.
(nmdc-automation-py3.11) (nersc-python) nmdcda@perlmutter:login11:~/nmdc_automation/dev> more nohup.out
2024-11-26 09:42:30,896 INFO: Initializing Watcher: config file: /global/homes/n/nmdcda/nmdc_automation/dev/site_configuration_nersc.toml
2024-11-26 09:42:30,897 INFO: Using state file from config: /global/cfs/cdirs/m3408/var/dev/agent.state
2024-11-26 09:42:30,897 INFO: New Job from State: nmdc:wfmag-11-g7msr323.1, nmdc:66cf64b6-7462-11ef-8b84-deaa01ab0f49
2024-11-26 09:42:30,898 INFO: Last Status: Succeeded
2024-11-26 09:42:30,898 INFO: New Job from State: nmdc:wfmag-12-h52r0792.1, nmdc:c2b7c884-ab78-11ef-8298-3e652b5abb3d
2024-11-26 09:42:30,898 INFO: Last Status: Failed
2024-11-26 09:42:30,898 INFO: Adding 2 new jobs from state file.
2024-11-26 09:42:32,277 INFO: Entering polling loop
2024-11-26 09:42:32,312 DEBUG: Starting new HTTPS connection (1): api-dev.microbiomedata.org:443
2024-11-26 09:42:32,862 DEBUG: https://api-dev.microbiomedata.org:443 "POST /token HTTP/11" 200 None
2024-11-26 09:42:32,873 DEBUG: Starting new HTTPS connection (1): api-dev.microbiomedata.org:443
2024-11-26 09:42:33,293 DEBUG: https://api-dev.microbiomedata.org:443 "GET /jobs?max_page_size=100&filter=%7B%22workflow.id%22:%20%7B%22$in%22:%20%5B%22Sequencing%20Noninterleaved:%20%22,%20%22Sequencing%20Interleaved:%20
%22,%20%22Reads%20QC:%20v1.0.13%22,%20%22Reads%20QC%20Interleave:%20v1.0.12%22,%20%22Metagenome%20Assembly:%20v1.0.7%22,%20%22Metagenome%20Annotation:%20v1.1.0%22,%20%22MAGs:%20v1.3.12%22,%20%22Readbased%20Analysis:%20v1.
0.8%22%5D%7D,%20%22claims%22:%20%7B%22$size%22:%200%7D%7D HTTP/11" 200 16
2024-11-26 09:42:33,294 INFO: Found 0 unclaimed jobs.
2024-11-26 09:42:33,531 INFO: Checking for finished jobs.
(nmdc-automation-py3.11) (nersc-python) nmdcda@perlmutter:login11:~/nmdc_automation/dev> more nohup.out
2024-11-26 09:42:30,896 INFO: Initializing Watcher: config file: /global/homes/n/nmdcda/nmdc_automation/dev/site_configuration_nersc.toml
2024-11-26 09:42:30,897 INFO: Using state file from config: /global/cfs/cdirs/m3408/var/dev/agent.state
2024-11-26 09:42:30,897 INFO: New Job from State: nmdc:wfmag-11-g7msr323.1, nmdc:66cf64b6-7462-11ef-8b84-deaa01ab0f49
2024-11-26 09:42:30,898 INFO: Last Status: Succeeded
2024-11-26 09:42:30,898 INFO: New Job from State: nmdc:wfmag-12-h52r0792.1, nmdc:c2b7c884-ab78-11ef-8298-3e652b5abb3d
2024-11-26 09:42:30,898 INFO: Last Status: Failed
2024-11-26 09:42:30,898 INFO: Adding 2 new jobs from state file.
2024-11-26 09:42:32,277 INFO: Entering polling loop
2024-11-26 09:42:32,312 DEBUG: Starting new HTTPS connection (1): api-dev.microbiomedata.org:443
2024-11-26 09:42:32,862 DEBUG: https://api-dev.microbiomedata.org:443 "POST /token HTTP/11" 200 None
2024-11-26 09:42:32,873 DEBUG: Starting new HTTPS connection (1): api-dev.microbiomedata.org:443
2024-11-26 09:42:33,293 DEBUG: https://api-dev.microbiomedata.org:443 "GET /jobs?max_page_size=100&filter=%7B%22workflow.id%22:%20%7B%22$in%22:%20%5B%22Sequencing%20Noninterleaved:%20%22,%20%22Sequencing%20Interleaved:%20
%22,%20%22Reads%20QC:%20v1.0.13%22,%20%22Reads%20QC%20Interleave:%20v1.0.12%22,%20%22Metagenome%20Assembly:%20v1.0.7%22,%20%22Metagenome%20Annotation:%20v1.1.0%22,%20%22MAGs:%20v1.3.12%22,%20%22Readbased%20Analysis:%20v1.
0.8%22%5D%7D,%20%22claims%22:%20%7B%22$size%22:%200%7D%7D HTTP/11" 200 16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

1 participant