-
Notifications
You must be signed in to change notification settings - Fork 3
Error handling on the grid
galerykaeser edited this page Jun 1, 2021
·
2 revisions
- Generates state successors
- Submits slurm array jobs to evaluate successors (default array length up to 200)
- Polls array job status with
sacct
- Parses evaluation results and continues search
Possible Errors | Safety Measures |
---|---|
Job submission with sbatch is not successful |
If enforce_order flag is not set, continue with next successor batch; else, abort search by returning the current state |
Slurm task is in a state other than PENDING, RUNNING or COMPLETED when polled | If enforce_order flag is not set, ignore failed tasks; else, only consider tasks before the first failed one |
Evaluation result file is not present (something went wrong in the execution of the evaluation script on the compute node) | After waiting for the file for a maximum of 60 s (checking every 3 s), the corresponding task is ignored (without enforce_order ) or the search is aborted by returning the current state (with enforce_order ) |
Script crashes for any reason | None, the user is responsible for noticing the crash and re-executing the script |
- Limits memory of child processes with
ulimit
(setting the soft limit to 98% of the product of the slurm parameterscpus-per-task
andmem-per-cpu
) - Executes the evaluation script (similar to:
./script.py --evaluate /path/to/state-dump
) inside a sub-shell and redirects the stdout and stderr outputs to log files
Possible Errors | Safety Measures |
---|---|
Space character in /path/to/state causes error in execution of the evaluation script |
Check script path for spaces in the beginning of the grid search |
Evaluation process consumes too much memory | Memory of child processes of the bash script is limited with ulimit in the beginning of the script; by only allowing 98% of the theoretically possible memory for a task, we ensure that slurm is not responsible for killing tasks that consume too much memory |
Evaluation process takes too much time | Time limits are taken care of in level 4, where Run objects can only be defined with a time limit |
Script crashes for any reason | None, this is not expected to happen; in case it does, the output in the slurm.err log file might provide insight |
- Parses the corresponding state from its dump and runs the evaluation defined by the user (generally by executing the Run instances in the state and processing the parsed output streams in a meaningful way)
Possible Errors | Safety Measures |
---|---|
Evaluation consumes too much memory | Memory limit set in the bash script with ulimit causes the evaluation script in the sub-shell to terminate on a memory error |
Evaluation takes too much time |
Run classes (that are meant to define the program executions to be evaluated) have a mandatory argument time_limit that is set in each run when its command is started as a subprocess (using resource.setrlimit from Python); therefore, the subprocess of the run will always terminate latest after its time limit expired |
Script crashes for any reason | Any crash is reflected in the evaluation exit code, which is always captured and determines whether a state is accepted (i.e., evaluates to true) or rejected, inferring that a state whose evaluation crashes is simply rejected |
- Any program whose execution the user wants to analyze, e.g.,
./fast-downward.py domain.pddl problem.pddl --search astar(lmcut())
- Time and memory constraints are already handled by levels 2 and 3
Possible Errors | Safety Measures |
---|---|
Any error or crash that can occur during the execution of the run | Not needed, as any behavior of the program is captured via the produced outputs and returncode, which are then parsed and processed as part of the evaluation |