Skip to content

Commit

Permalink
Adding an alert to all workflows if they get timed out (#2593)
Browse files Browse the repository at this point in the history
* Adding an alert to all workflows if they get timed out

* Let's add some logic to create the secret

* Let's just delete and recreate at all times

* Changing image

* Removing APK command
  • Loading branch information
AidanHilt authored Jul 17, 2024
1 parent e7fb972 commit d4e2651
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 0 deletions.
12 changes: 12 additions & 0 deletions gen3/bin/kube-setup-argo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,18 @@ EOF
aws iam put-role-policy --role-name ${roleName} --policy-name ${internalBucketPolicy} --policy-document file://$internalBucketPolicyFile || true
fi

# Create a secret for the slack webhook
alarm_webhook=$(g3kubectl get cm global -o yaml | yq .data.slack_alarm_webhook | tr -d '"')

if [ -z "$alarm_webhook" ]; then
gen3_log_err "Please set a slack_alarm_webhook in the 'global' configmap. This is needed to alert for failed workflows."
exit 1
fi

g3kubectl -n argo delete secret slack-webhook-secret
g3kubectl -n argo create secret generic "slack-webhook-secret" --from-literal=SLACK_WEBHOOK_URL=$alarm_webhook


## if new bucket then do the following
# Get the aws keys from secret
# Create and attach lifecycle policy
Expand Down
14 changes: 14 additions & 0 deletions kube/services/argo/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,20 @@ controller:
workflowDefaults:
spec:
archiveLogs: true
onExit: alert-on-timeout
templates:
- name: alert-on-timeout
script:
image: quay.io/cdis/amazonlinux-debug:master
command: [sh]
envFrom:
- secretRef:
name: slack-webhook-secret
source: |
failure_reason=$(echo {{workflow.failures}} | jq 'any(.[]; .message == "Step exceeded its deadline")' )
if [ "$failure_reason" ]; then
curl -X POST -H 'Content-type: application/json' --data "{\"text\":\"ALERT: Workflow {{workflow.name}} has been killed due to timeout\"}" "$SLACK_WEBHOOK_URL"
fi
# -- [Node selector]
nodeSelector:
Expand Down

0 comments on commit d4e2651

Please sign in to comment.