Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler not installed raised as an Autosubmit Critical error in the middle of the run. ( Should be an Autosubmit Error ) #2102

Open
mbatllem opened this issue Feb 5, 2025 · 7 comments
Milestone

Comments

@mbatllem
Copy link

mbatllem commented Feb 5, 2025

Hello,

I'm not sure if this issue is related to the workflow or the AS, but I suspect it is more related to AS.

I'm using AS 4.1.11 and WF 4.2.0 on MN5.

In my experiment a1yv, the last chunk was apparently successfully COMPLETED I can see this from the model logs and also from the file:
/gpfs/scratch/ehpc01/bsc998159/a1yv/LOG_a1yv/a1yv_19900101_fc0_332_SIM_COMPLETED.

However, when I run autosubmit monitor a1yv or check the AS GUI, this same SIM job still appears as RUNNING. The experiment crashed, outputting the following in the nohup:

[CRITICAL] Scheduler is not installed. [eCode=7052]  

I would like to continue the experiment ASAP. Would it be okay if I change the job status from RUNNING to WAITING and re-submit? Or would this prevent you from investigating the root cause of the issue?

@LuiggiTenorioK
Copy link
Member

I think this is related to Autosubmit and not the API. Transferring the issue ➡➡➡

@LuiggiTenorioK LuiggiTenorioK transferred this issue from BSC-ES/autosubmit-api Feb 5, 2025
@mbatllem
Copy link
Author

mbatllem commented Feb 5, 2025

Oh, you’re right! I'm sorry, I actually thought I was writing in the AS repo. Thanks @LuiggiTenorioK

@dbeltrankyl
Copy link
Contributor

Hello @mbatllem

It is not shown as completed in the GUI/API or autosubmit monitor because the Autosubmit instance is stopped.

If you haven't prompted recovery or setstatus commands yet, just doing the autosubmit run $expid should be enough for Autosubmit to continue the run. And it is the recommended way of doing it

It also should be fine to set it to COMPLETED or even perform an autosubmit recovery $expid -s ( --all is not needed there)

The error is strange, tho. That shows up when the command ( sacct squeue... ) is not found in the remote. Maybe the platform had some weird error in which the slurm was not detected. Just resume the experiment and we'll see if it still happens.

@dbeltrankyl
Copy link
Contributor

dbeltrankyl commented Feb 5, 2025

Also you don't need to resubmit the job as it is completed

@dbeltrankyl dbeltrankyl added this to the 4.1.13 milestone Feb 5, 2025
@dbeltrankyl
Copy link
Contributor

I'll update the issue title to

Scheduler not installed raises an Autosubmit Critical in the middle of the run.

I think we need to change this critical raise to only pop-up when you try to connect to the platforms, if it happens in the middle of the run, it should be an error raise so Autosubmit can reconnect to the platform.

@dbeltrankyl dbeltrankyl changed the title SIM job COMPLETED but monitored as RUNNING Scheduler not installed raises an Autosubmit Critical in the middle of the run. Feb 5, 2025
@dbeltrankyl dbeltrankyl changed the title Scheduler not installed raises an Autosubmit Critical in the middle of the run. Scheduler not installed raised as an Autosubmit Critical in the middle of the run. Feb 5, 2025
@dbeltrankyl dbeltrankyl changed the title Scheduler not installed raised as an Autosubmit Critical in the middle of the run. Scheduler not installed raised as an Autosubmit Critical error in the middle of the run. ( Should be Autosubmit Error ) Feb 5, 2025
@dbeltrankyl dbeltrankyl changed the title Scheduler not installed raised as an Autosubmit Critical error in the middle of the run. ( Should be Autosubmit Error ) Scheduler not installed raised as an Autosubmit Critical error in the middle of the run. ( Should be an Autosubmit Error ) Feb 5, 2025
@mbatllem
Copy link
Author

mbatllem commented Feb 5, 2025

Thank you for your quick responses!

@mbatllem
Copy link
Author

mbatllem commented Feb 5, 2025

Hello again, apparently this also happened here: /gpfs/scratch/ehpc01/bsc998159/a236/LOG_a236/a236_19900101_wf_5_LRA_GENERATOR_COMPLETED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants