Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support per-component tolerations (+ manager) #154

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

loganmzz
Copy link

@loganmzz loganmzz commented Jan 6, 2025

Should partially fix aws/containers-roadmap#2515

Issue #, if available:

aws/containers-roadmap#2515

Description of changes:

  1. Let's support per-component tolerations
  2. Add tolerations for manager

Behavior

  • Root tolerations is applied by default
  • Manager tolerations is disabled by default ([] value, see below)
  • Each component may override tolerations
  • Tolerations can be disabled by using empty list [] (root or specific)
  • If specific tolerations is null (nil in Go/Template), root ones are used.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@giany giany mentioned this pull request Feb 5, 2025
@@ -1272,6 +1272,9 @@ manager:
podLabels: {}
service:
name:
# When true, root tolerations aren't added to this one.
tolerationsReplaced: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be very intuitive to set another boolean just to express that the tolerations at the specific component level should be respected.

Could we instead have the tolerations commented out at the specific component level, such as:

# tolerations: []  # Uncomment this line to override root-level tolerations.

Once you merge from latest main, you will notice the root tolerations have been defined & only apply to daemonsets and do NOT apply to the controller manager deployment. So just for controller manager alone, we truly do want to explicitly set tolerations: [] and let it override the root level ones to maintain compatibility i.e. not comment that line.

We can then probably have logic that does something like:

{{- if hasKey .Values.manager "tolerations" -}}
  {{- toYaml .Values.manager.tolerations -}}
{{- else -}}
  {{- toYaml .Values.tolerations -}}
{{- end -}}
{{- end -}}

Could probably put this in a shared method in _helpers.tpl and call it for the all components.

The components to cover would be:

  • manager:
  • containerLogs.fluentBit
  • agent
  • dcgmExporter
  • neuronMonitor

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sky333999,

Thank you for the feedback!

The PR has been designed to be conservative to avoid unexpected impact for existing deployment. I will check new code changes and update my PR accordingly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sky333999

I have updated the PR based on:

  • Root tolerations is applied by default
  • Manager tolerations is disabled by default ([] value, see below)
  • Each component may override tolerations
  • Tolerations can be disabled by using empty list [] (root or specific)
  • If specific tolerations is null (nil in Go/Template), root ones are used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats great! Thank you so much for the changes. Looks good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it a bit more, you could probably get away with something like:

  {{- with (.Values.agent.tolerations | default .Values.tolerations ) }}
  tolerations: {{- toYaml . | nindent 2}}
  {{- end }}

Im debating if this is just simpler for each of the components instead of the helper function. Im fine either way.

Copy link
Author

@loganmzz loganmzz Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I miss your two last review feedbacks.

Because nil and empty ([]) are handled differently, default function won't do what is expected.

@loganmzz loganmzz force-pushed the aws-containers-roadmap-issues-2515 branch from 26ec8ef to 260e688 Compare February 11, 2025 07:46
@loganmzz loganmzz force-pushed the aws-containers-roadmap-issues-2515 branch from 260e688 to 238c132 Compare February 11, 2025 08:24
@sky333999
Copy link
Contributor

Thanks for the updates @loganmzz . Would you mind adding some testing information to the description so we have that captured before I merge this in?
Some manual test suggestions if you could please try out locally:

  • Render the chart prior to and after your change to make sure the rendered manifests are identical since your change should be a no-op
  • Override the deployment tolerations to something other than [] and ensure its reflected correctly in the rendered manifests
  • Override one of the daemonset tolerations to [] or something eles to ensure its reflected correctly in the rendered manifests

@loganmzz
Copy link
Author

loganmzz commented Feb 12, 2025

Thanks for the updates @loganmzz . Would you mind adding some testing information to the description so we have that captured before I merge this in? Some manual test suggestions if you could please try out locally:

  • Render the chart prior to and after your change to make sure the rendered manifests are identical since your change should be a no-op
  • Override the deployment tolerations to something other than [] and ensure its reflected correctly in the rendered manifests
  • Override one of the daemonset tolerations to [] or something eles to ensure its reflected correctly in the rendered manifests

For sure here the values I have tested on:

# 00 - Empty 
# 01 - Disable all tolerations with empty
tolerations: []
# 02 - Disable all tolerations with null
tolerations:
# 03 - Set manager tolerations
manager:
  tolerations:
  - key: "specific"
    operator: "Equal"
    value: "manager"
# 04 - Set manager tolerations to default
manager:
  tolerations:
# 05 - Set agent tolerations
agent:
  tolerations:
  - key: "specific"
    operator: "Equal"
    value: "agent"
# 06 - Disable agent tolerations
agent:
  tolerations: []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[EKS] [request]: Add toleration for each Amazon CloudWatch Observability EKS add-on components
2 participants