Add retry system #44

ggtools · 2020-05-15T06:28:50Z

Behaviour

I noticed today that my nightly job hasn't run with the following message:

 Rejected 5 hours ago    "No such image: whatever-image:latest@sha256:[...]"

Steps to reproduce this issue

A bit complicated but might be a network issue, a Docker bug, etc.

Expected behaviour

As the job didn't run correctly, retry it. Swarm cronjob should probably keep track of the failed runs and retry a couple of times before giving up

Actual behaviour

Job is not restarted until the next slot

Configuration

Target Docker version (the host/cluster you manage) : 19.03.4
Platform (windows/linux) : Linux
System info (type uname -a) : Linux xxxxx 4.19.75-v7+ #1270 SMP Tue Sep 24 18:45:11 BST 2019 armv7l GNU/Linux
Target Swarm version : 1.6.0

Docker info

Output of command docker info

Logs

swarm-cronjob service logs (set LOG_LEVEL to debug) and cron based service logs if useful

The text was updated successfully, but these errors were encountered:

camo-f · 2021-06-22T10:26:38Z

Hello,

I faced the same issue recently, As a workaround, I tried to make use of the condition on-failure for restart-policy provided by Docker Swarm (see https://docs.docker.com/compose/compose-file/compose-file-v3/#restart_policy).

It seems to work with the following minimal example :

  test-exit1:
    image: alpine:3.12.5
    deploy:
      replicas: 0
      restart_policy:
        condition: on-failure
        max_attempts: 3
      labels:
        - "swarm.cronjob.enable=true"
        - "swarm.cronjob.schedule=*/5 * * * *"
        - "swarm.cronjob.skip-running=true"
    entrypoint: /bin/sh -c "echo 'test exit 1' && exit 1"

  test-exit0:
    image: alpine:3.12.5
    deploy:
      replicas: 0
      restart_policy:
        condition: on-failure
        max_attempts: 3
      labels:
        - "swarm.cronjob.enable=true"
        - "swarm.cronjob.schedule=*/5 * * * *"
        - "swarm.cronjob.skip-running=true"
    entrypoint: /bin/sh -c "echo 'test exit 0' && exit 0"

Results

tcdmgu79hgfb        swarm-cronjob-jobs_test-exit1.1                                 alpine:3.12.5   w1.lab.lan   Shutdown            Failed about a minute ago     "task: non-zero exit (1)"          
a1xypld4onk9         \_ swarm-cronjob-jobs_test-exit1.1                             alpine:3.12.5   w1.lab.lan      Shutdown            Failed about a minute ago     "task: non-zero exit (1)"          
u02ret3246sv         \_ swarm-cronjob-jobs_test-exit1.1                             alpine:3.12.5   w1.lab.lan   Shutdown            Failed about a minute ago     "task: non-zero exit (1)"          
vv79doiga2ej         \_ swarm-cronjob-jobs_test-exit1.1                             alpine:3.12.5   w1.lab.lan      Shutdown            Failed about a minute ago     "task: non-zero exit (1)"          
l5lkanbekc4z        swarm-cronjob-jobs_test-exit0.1                                 alpine:3.12.5   w1.lab.lan   Shutdown            Complete about a minute ago                                      
znh3msh857qe         \_ swarm-cronjob-jobs_test-exit0.1                             alpine:3.12.5   w1.lab.lan   Shutdown            Complete 6 minutes ago                                           
kybup9t116o1         \_ swarm-cronjob-jobs_test-exit0.1                             alpine:3.12.5   w1.lab.lan   Shutdown            Complete 11 minutes ago

It shows two things :

When the job exits with 0, it does not restart.
When the job exits with 1, it restarts as expected, up to the number of max_attempts configured. The job is also restarted every 5 minutes by Swarm-Cronjob.

@ggtools if your job exits 0 on success, I guess using on-failure as a restart condition for the service should work.

@crazy-max Do you see any issue with the behavior I showed above ? If not, I would suggest updating the documentation to specify that on-failure can be used if the jobs exits 0. Currently, only the none condition is documented. I can make a PR if you don't have time for that :)

crazy-max added the kind/enhancement label Jun 2, 2020

crazy-max added the 📌 pinned label Mar 7, 2021

crazy-max removed the 📌 pinned label May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry system #44

Add retry system #44

ggtools commented May 15, 2020

camo-f commented Jun 22, 2021

Add retry system #44

Add retry system #44

Comments

ggtools commented May 15, 2020

Behaviour

Steps to reproduce this issue

Expected behaviour

Actual behaviour

Configuration

Docker info

Logs

camo-f commented Jun 22, 2021