Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add the manifests overlay for Kubeflow Training V2 #2382

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions manifests/v2/base/manager/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,2 @@
resources:
- manager.yaml
# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system
Comment on lines -3 to -4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this is going to work since we don't copy JobSet manifests in the Kubeflow Training ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the JobSet manifest is referred from a remote URL, so the namespace configuration in the base or overlays cannot override the JobSet manifests, I think.

From the changlog, it seems that the latest release hasn’t included the changes in kubernetes-sigs/jobset/#719. We may need to manually modify the namespace setting for JobSet. I'm not sure if I need to deal with it in this pr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I tried it with:

$ kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.2/manifests.yaml -n kubeflow-test

namespace/jobset-system serverside-applied
customresourcedefinition.apiextensions.k8s.io/jobsets.jobset.x-k8s.io serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-manager-role serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-metrics-reader serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-proxy-role serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/jobset-manager-rolebinding serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/jobset-proxy-rolebinding serverside-applied
mutatingwebhookconfiguration.admissionregistration.k8s.io/jobset-mutating-webhook-configuration serverside-applied
validatingwebhookconfiguration.admissionregistration.k8s.io/jobset-validating-webhook-configuration serverside-applied
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.
the namespace from the provided object "jobset-system" does not match the namespace "kubeflow-test". You must pass '--namespace=jobset-system' to perform this operation.

It does not work. Maybe we should modify the jobset manifests to make namespace configurable.

And just a question, do we really need to install all components in the same namespace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recently @kannon92 made a PR to support other namespaces in JobSet: kubernetes-sigs/jobset#719.
Probably, we should use kubeflow namespace to install the JobSet operator within Kubeflow Platform if @kubeflow/wg-training-leads or @kubeflow/wg-manifests-leads don't have any other suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this PR seems not to be included in the current release of jobset:

kubernetes-sigs/jobset@release-6.0...v0.7.2

In the manifest, the namespace is hardcoded into jobset-system:

https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.2/manifests.yaml

Copy link
Member

@Electronic-Waste Electronic-Waste Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. For me, I just don't know how to overlay namespace items outside of the .metadata field if we only specify namespace field in kustomization.yaml. Here is the example of which namespace field lies outside of the .metadata field:

截屏2025-01-14 10 28 31

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the var references, similar to this example. We will tell Kustomize how to override the namespace value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. That's what I want to do with the JobSet manifests!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Doris-xm So now we can wait for the next release of JobSet. And make some appropriate changes to the manifest after that:)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for the insightful discussion! I’ve learned a lot from this. Much appreciated!

2 changes: 0 additions & 2 deletions manifests/v2/base/rbac/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,3 @@ resources:
- role.yaml
- role_binding.yaml
- service_account.yaml
# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system
2 changes: 0 additions & 2 deletions manifests/v2/base/webhook/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,3 @@ patches:
kind: ValidatingWebhookConfiguration
configurations:
- kustomizeconfig.yaml
# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeflow-training-admin-v2
labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin-v2: "true"
aggregationRule:
clusterRoleSelectors:
- matchLabels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-training-admin-v2: "true"
rules: []

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeflow-training-edit-v2
labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit-v2: "true"
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-training-admin-v2: "true"
rules:
- apiGroups:
- kubeflow.org
resources:
- clustertrainingruntimes
- trainingruntimes
- trainjobs
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- kubeflow.org
resources:
- trainjobs/status
verbs:
- get
- apiGroups:
- ""
resources:
- persistentvolumeclaims
verbs:
- create
- delete
- get
- list
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- get
- list
- watch

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeflow-training-view-v2
labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-view-v2: "true"
rules:
- apiGroups:
- kubeflow.org
resources:
- clustertrainingruntimes
- trainingruntimes
- trainjobs
verbs:
- get
- list
- watch
- apiGroups:
- kubeflow.org
resources:
- trainjobs/status
verbs:
- get
20 changes: 20 additions & 0 deletions manifests/v2/overlays/kubeflow-platform/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask a question? Since we have defined the namespace for manager and webhook:

# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system

Does kubeflow namespace take any effect here? WDYT👀 @kubeflow/wg-training-leads @Doris-xm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reviewing. It seems that namespace: kubeflow in the overlay won't override namespace: kubeflow-system defined for manager and webhook. I'm wondering if we should unify the namespace as kubeflow when installing them within Kubeflow Platform, aligning with other modules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should make our namespace configurable and use kubeflow namespace for Kubeflow Platform installation for now.
By default the Kubeflow Training V2 will use the kubeflow-system namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Doris-xm FYI, I think what @andreyvelich means is that we should address these TODO:

# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system

# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system

And specify their namespace in overlays and kubeflow-platform dir. Because we need to make namesapce configurable and install these components in different namspace:

  1. Standalone training-operator: Install all components in kubeflow-system namespace
  2. Kubeflow Platform: Install all components in kubeflow namespace

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your detailed explanation! I've moved the namespace configuration from base to overlays. Appreciate your time!

resources:
- ../../base/crds
- ../../base/manager
- ../../base/rbac
- ../../base/webhook
- ../../base/runtimes/pre-training
- kubeflow-training-roles.yaml
# TODO (andreyvelich): JobSet should support kubeflow namespace.
- https://github.com/kubernetes-sigs/jobset/releases/download/v0.6.0/manifests.yaml
images:
- name: kubeflow/training-operator-v2
newTag: latest
secretGenerator:
- name: training-operator-v2-webhook-cert
namespace: kubeflow
options:
disableNameSuffixHash: true
1 change: 1 addition & 0 deletions manifests/v2/overlays/only-manager/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow-system
resources:
- namespace.yaml
- ../../base/crds
Expand Down
1 change: 1 addition & 0 deletions manifests/v2/overlays/standalone/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow-system
resources:
- namespace.yaml
- ../../base/crds
Expand Down
Loading