Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADR]: 🎚️ Limiting resource requests and limits on the Dagster webserver and daemon #7

Closed
2 tasks done
JasperHG90 opened this issue Feb 18, 2024 · 0 comments
Closed
2 tasks done
Labels
accepted Suggestion has been accepted ADR This issue is labeled as an ADR enhancement New feature or request

Comments

@JasperHG90
Copy link
Owner

JasperHG90 commented Feb 18, 2024

✍️ Context

To reduce spending on the dagster deployment on GKE, we should limit the resources allocated for the long-running services.

These services are:

  • Dagster webserver
  • Dagster daemon
  • Dagster code locations

The resource limits can be set on the dagster helm chart values.yml.

Current resource requests

Looking at the pod deployments, we see that the following resource requests and limits are set for the webserver, daemon, and code location:

Limits:
  cpu:                500m
  ephemeral-storage:  1Gi
  memory:             2Gi
Requests:
  cpu:                500m
  ephemeral-storage:  1Gi
  memory:             2Gi

Currently, all three Dagster services have unused resources

CPU (ranked in terms of unused resources)

  1. Webserver
  2. Daemon
  3. Code location

Memory (ranked in terms of unused resources)

  1. Code location
  2. Daemon
  3. Webserver

Requested versus used resource requests (from GKE workload overview, 24 hour window)

Daemon
Pasted image 20240218094942

Webserver
Pasted image 20240218095234

Code location
Pasted image 20240218095117

🤝 Decision

Set the resource constraints as follows:

Daemon

Limits:
  cpu:                200m
  ephemeral-storage:  1Gi
  memory:             400Mi
Requests:
  cpu:                200m
  ephemeral-storage:  1Gi
  memory:             400Mi

Webserver

Limits:
  cpu:                120m
  ephemeral-storage:  1Gi
  memory:             400Mi
Requests:
  cpu:                120m
  ephemeral-storage:  1Gi
  memory:             400Mi

Code location

Limits:
  cpu:                250m
  ephemeral-storage:  1Gi
  memory:             400Mi
Requests:
  cpu:                250m
  ephemeral-storage:  1Gi
  memory:             400Mi

These values can be set in "dagster-infra/app.tf" as follows:

resource "helm_release" "dagster" {
  name       = "dagster-${var.environment}"
  repository = "https://dagster-io.github.io/helm"
  chart      = "dagster"
  namespace  = kubernetes_namespace.dagster.metadata[0].name

  values = [
    file("${path.module}/static/values.yaml")
  ]
  ...
  # Resource requests
  set {
    name = "dagsterWebserver.resources.limits.cpu"
    value = "120m"
  }

  set {
    name = "dagsterWebserver.resources.limits.memory"
    value = "400Mi"
  }

  set {
    name = "dagsterWebserver.resources.requests.cpu"
    value = "120m"
  }

  set {
    name = "dagsterWebserver.resources.requests.memory"
    value = "300Mi"
  }

  set {
    name = "dagsterDaemon.resources.limits.cpu"
    value = "200m"
  }

  set {
    name = "dagsterDaemon.resources.limits.memory"
    value = "400Mi"
  }

  set {
    name = "dagsterDaemon.resources.requests.cpu"
    value = "200m"
  }

  set {
    name = "dagsterDaemon.resources.requests.memory"
    value = "400Mi"
  }
}

These values can be set in "dagster-dags/values.yaml.j2" as follows:

...
deployments:
      ...
      resources:
      limits:
        cpu: 250m
        memory: 400Mi
      requests:
        cpu: 250m
        memory: 400Mi

💥 Impact

Shouldn't impact users as long as we properly monitor resources. We don't currently do this. We have a ticket to pick this up, see #5

☝️ Consequences

  • Saves money

Harder:

  • need to start monitoring resources and send out alerts in case resources are too tightly specified.
  • We have split the code locations from the webserver and daemon, and need to specify the resource requests/limits in two places. (see RFC: Rethink strategy for filling in values when deploying code location. #6)
    • In dagster-infra, this is done using Terraform
    • In dagster-dags, we have to fill in the values directly in values.yaml

📝 Checklist (after ADR has been accepted)

  • I've set the appropriate status label
  • I've linked relevant issues, PRs, and RFCs
@JasperHG90 JasperHG90 added enhancement New feature or request proposed Suggestion has been proposed ADR This issue is labeled as an ADR labels Feb 18, 2024
@JasperHG90 JasperHG90 added accepted Suggestion has been accepted and removed proposed Suggestion has been proposed labels Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Suggestion has been accepted ADR This issue is labeled as an ADR enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant