Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry system #44

Open
ggtools opened this issue May 15, 2020 · 1 comment
Open

Add retry system #44

ggtools opened this issue May 15, 2020 · 1 comment

Comments

@ggtools
Copy link

ggtools commented May 15, 2020

Behaviour

I noticed today that my nightly job hasn't run with the following message:

 Rejected 5 hours ago    "No such image: whatever-image:latest@sha256:[...]"

Steps to reproduce this issue

A bit complicated but might be a network issue, a Docker bug, etc.

Expected behaviour

As the job didn't run correctly, retry it. Swarm cronjob should probably keep track of the failed runs and retry a couple of times before giving up

Actual behaviour

Job is not restarted until the next slot

Configuration

  • Target Docker version (the host/cluster you manage) : 19.03.4
  • Platform (windows/linux) : Linux
  • System info (type uname -a) : Linux xxxxx 4.19.75-v7+ #1270 SMP Tue Sep 24 18:45:11 BST 2019 armv7l GNU/Linux
  • Target Swarm version : 1.6.0

Docker info

Output of command docker info

Logs

swarm-cronjob service logs (set LOG_LEVEL to debug) and cron based service logs if useful

@camo-f
Copy link

camo-f commented Jun 22, 2021

Hello,

I faced the same issue recently, As a workaround, I tried to make use of the condition on-failure for restart-policy provided by Docker Swarm (see https://docs.docker.com/compose/compose-file/compose-file-v3/#restart_policy).

It seems to work with the following minimal example :

  test-exit1:
    image: alpine:3.12.5
    deploy:
      replicas: 0
      restart_policy:
        condition: on-failure
        max_attempts: 3
      labels:
        - "swarm.cronjob.enable=true"
        - "swarm.cronjob.schedule=*/5 * * * *"
        - "swarm.cronjob.skip-running=true"
    entrypoint: /bin/sh -c "echo 'test exit 1' && exit 1"

  test-exit0:
    image: alpine:3.12.5
    deploy:
      replicas: 0
      restart_policy:
        condition: on-failure
        max_attempts: 3
      labels:
        - "swarm.cronjob.enable=true"
        - "swarm.cronjob.schedule=*/5 * * * *"
        - "swarm.cronjob.skip-running=true"
    entrypoint: /bin/sh -c "echo 'test exit 0' && exit 0"

Results

tcdmgu79hgfb        swarm-cronjob-jobs_test-exit1.1                                 alpine:3.12.5   w1.lab.lan   Shutdown            Failed about a minute ago     "task: non-zero exit (1)"          
a1xypld4onk9         \_ swarm-cronjob-jobs_test-exit1.1                             alpine:3.12.5   w1.lab.lan      Shutdown            Failed about a minute ago     "task: non-zero exit (1)"          
u02ret3246sv         \_ swarm-cronjob-jobs_test-exit1.1                             alpine:3.12.5   w1.lab.lan   Shutdown            Failed about a minute ago     "task: non-zero exit (1)"          
vv79doiga2ej         \_ swarm-cronjob-jobs_test-exit1.1                             alpine:3.12.5   w1.lab.lan      Shutdown            Failed about a minute ago     "task: non-zero exit (1)"          
l5lkanbekc4z        swarm-cronjob-jobs_test-exit0.1                                 alpine:3.12.5   w1.lab.lan   Shutdown            Complete about a minute ago                                      
znh3msh857qe         \_ swarm-cronjob-jobs_test-exit0.1                             alpine:3.12.5   w1.lab.lan   Shutdown            Complete 6 minutes ago                                           
kybup9t116o1         \_ swarm-cronjob-jobs_test-exit0.1                             alpine:3.12.5   w1.lab.lan   Shutdown            Complete 11 minutes ago

It shows two things :

  • When the job exits with 0, it does not restart.
  • When the job exits with 1, it restarts as expected, up to the number of max_attempts configured. The job is also restarted every 5 minutes by Swarm-Cronjob.

@ggtools if your job exits 0 on success, I guess using on-failure as a restart condition for the service should work.

@crazy-max Do you see any issue with the behavior I showed above ? If not, I would suggest updating the documentation to specify that on-failure can be used if the jobs exits 0. Currently, only the none condition is documented. I can make a PR if you don't have time for that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants