Releases: kubernetes-sigs/kueue
Releases Β· kubernetes-sigs/kueue
Kueue v0.5.1
Changes since v0.5.0
:
Bug or Regression
- Fix client-go libraries bug that can not operate clusterScoped resources like ClusterQueue and ResourceFlavor. (#1294, @tenzen-y)
- Fixed fungiblity policy
whenCanPreempt: Preempt
. The admission should happen in the flavor for which preemptions were issued. (#1332, @alculquicondor) - Fix a bug that plain pods managed by kueue will remain a terminating condition forever. (#1342, @tenzen-y)
- Fix fungibility policy
Preempt
where it was not able to utilize the next flavor if preemption was not possible. (#1366, @alculquicondor, @KunWuLuan)
Kueue v0.5.0
Changes since v0.4.0
:
Highlights
- AdmissionChecks: a mechanism for internal or external components to influence whether a Workload can be admitted.
- Integration with cluster-autoscaler's ProvisioningRequest via AdmissionChecks.
- Information about pending workloads in a ClusterQueue status.
- Metrics for resource usage of ClusterQueues and LocalQueues.
- Policy to control whether to preempt or borrow before trying the next flavors.
- Partial admission graduated to Beta.
- Workload priority, independent from Pod priority.
- New integrations:
- All Kubeflow training APIs
- Single plain Pods
Changes by Kind
Feature
- A mechanism for AdmissionChecks to provide labels, annotations, tolerations and node selectors to the pod templates when starting a job (#1180, @mimowo)
- A reference standalone controller that can be used to support plain Pods using taints and tolerations, which can be used in Kubernetes versions that don't support scheduling gates. (#1111, @nstogner)
- Add Active condition to AdmissionChecks (#1193, @trasc)
- Add optional cluster queue resource quota and usage metrics. (#982, @trasc)
- Add support for AdmissionChecks, a mechanism for internal or external components to influence whether a Workload can be admitted. (#1045, @trasc)
- Add support for single plain Pods. (#1072, @achernevskii)
- Add support for workload Priority (#1081, @Gekko0114)
- Add tolerations to ResourceFlavor. Kueue injects these tolerations to the jobs that are assigned to the flavor when admitted. (#1248, @trasc)
- Added pprof endpoints for profiling (#978, @stuton)
- Allow the admission of multiple workloads within one scheduling cycle while borrowing. (#1039, @trasc)
- An option to synchronize batch/job.completions with parallelism in case of partial admission (#971, @trasc)
- Expose cluster queue information about pending workloads (#1069, @stuton)
- Expose probe configurations to helm chart (#986, @yyzxw)
- Graduate Partial admission to Beta. (#1221, @trasc)
- Integrate with Cluster Autoscaler's ProvisioningRequest via two stage admission (#1154, @trasc)
- Manage cluster queue active state based on admission checks life cycle. (#1079, @trasc)
- Metrics for usage and reservations in ClusterQueues and LocalQueues. (#1206, @trasc)
- Options to allow workloads to borrow quota or preempt other workloads before trying the next flavor in the list (#849, @KunWuLuan)
- Support kubeflow.org/mxjob (#1183, @tenzen-y)
- Support kubeflow.org/paddlejob (#1142, @tenzen-y)
- Support kubeflow.org/pytorchjob (#995, @tenzen-y)
- Support kubeflow.org/tfjob (#1068, @tenzen-y)
- Support kubeflow.org/xgboostjob (#1114, @tenzen-y)
- Workload objects have the label
kueue.x-k8s.io/job-uid
where the value matches the uid of the parent job, whether that's a Job, MPIJob, RayJob, JobSet (#1032, @achernevskii)
Bug or Regression
- Adjust resources (based on LimitRanges, PodOverhead and resource limits) on existing Workloads when a LocalQueue is created (#1197, @alculquicondor)
- Ensure the ClusterQueue status is updated as the number of pending workloads changes. (#1135, @mimowo)
- Fix resuming of RayJob after preempted. (#1156, @kerthcet)
- Fixed missing create verb for webhook (#1035, @stuton)
- Fixed scheduler to only allow one admission or preemption per cycle within a cohort that has ClusterQueues borrowing quota (#1023, @alculquicondor)
- Helm: Enable the JobSet integration by default (#1184, @tenzen-y)
- Improve job controller to be resilient to API failures during preemption (#1005, @alculquicondor)
- Prevent workloads in ClusterQueue with StrictFIFO from blocking higher priority workloads in other ClusterQueues in the same cohort that require preemption (#1024, @alculquicondor)
- Terminate Kueue when there is an internal failure during setup, so that it can be retried. (#1077, @alculquicondor)
Other (Cleanup or Flake)
- Add client-go library for AdmissionCheck (#1104, @tenzen-y)
- Add mergeStrategy:merge to all conditions of API objects (#1089, @alculquicondor)
- Update ray-operator to v0.6.0 (#1231, @lowang-bh)
Kueue v0.4.2
Changes since v0.4.1
:
Bug or Regression
- Adjust resources (based on LimitRanges, PodOverhead and resource limits) on existing Workloads when a LocalQueue is created (#1197, @alculquicondor)
- Fix resuming of RayJob after preempted. (#1190, @kerthcet)
Kueue v0.4.1
Bug or Regression
- Fixed missing create verb for webhook (#1053, @stuton)
- Fixed scheduler to only allow one admission or preemption per cycle within a cohort that has ClusterQueues borrowing quota (#1029, @alculquicondor)
- Prevent workloads in ClusterQueue with StrictFIFO from blocking higher priority workloads in other ClusterQueues in the same cohort that require preemption (#1030, @alculquicondor)
Kueue v0.4.0
Changes since v0.3.0
:
API Change
Feature
- Add client-go libraries. (#789, @tenzen-y)
- Add support for Kuberay's RayJobs. (#667, @trasc)
- Add support for dynamic reclaim in the JobSet integration. (#901, @trasc)
- Add support for partial workload admission (#771, @trasc)
- Add the support for dynamic resources reclaim. (#756, @trasc)
- Allow scheduler to admit more jobs when the head job have not reached the PodReady=true status. (#708, @KunWuLuan)
- Allow specifying the manager pod and container security context instead of hardcoded values (#878, @bh-tt)
- Feature gates for alpha/experimental features is introduced to Kueue Project. (#788, @kerthcet)
- Ignoring integrations if crd wasn't installed otherwise all integrations are enabled by default (#883, @stuton)
- Integrate JobSet into kueue (#762, @mcariatm)
Bug or Regression
- Add permission to update frameworkjob status. (#797, @tenzen-y)
- Fix a bug that updates events for clusterQueues are created endlessly. (#907, @tenzen-y)
- Fix a bug where a child batch/job of an unmanaged parent (doesn't have queue name) was being suspended. (#835, @tenzen-y)
- Fix panic in cluster queue if resources and coveredResources do not have the same length. (#787, @kannon92)
- Fix: Enforce borrowed=0 if ClusterQueue doesn't belong to a cohort. (#759, @tenzen-y)
- Fix: Potential over-admission within cohort when borrowing. (#805, @trasc)
- Fixed preemption to prefer preempting workloads that were more recently admitted. (#843, @stuton)
- Fixed the suspend=true add to the job/mpijob by the default webhook has not taken effect. (#758, @fjding)
Other (Cleanup or Flake)
Kueue v0.3.2
Changes since v0.3.1
:
Bug or Regression
- Add permission to update frameworkjob status. (#798, @tenzen-y)
- Fix a bug where a child batch/job of an unmanaged parent (doesn't have queue name) was being suspended. (#839, @tenzen-y)
- Fix panic in cluster queue if resources and coveredResources do not have the same length. (#799, @kannon92)
- Fix: Potential over-admission within cohort when borrowing. (#822, @trasc)
- Fixed preemption to prefer preempting workloads that were more recently admitted. (#845, @stuton)
Kueue v0.3.1
Changes since v0.3.0
:
Bug fixes
- Fix a bug that the validation webhook doesn't validate the queue name set as a label when creating MPIJob. #711
- Fix a bug that updates a queue name in workloads with an empty value when using framework jobs that use batch/job internally, such as MPIJob. #713
- Fix a bug in which borrowed values are set to a non-zero value even though the ClusterQueue doesn't belong to a cohort. #761
- Fixed adding suspend=true job/mpijob by the default webhook. #765
Kueue v0.3.0
Changes since v0.2.1
:
Features
- Support for kubeflow's MPIJob (v2beta1)
- Upgrade the
config.kueue.x-k8s.io
API version fromv1alpha1
tov1beta1
.v1alpha1
is no longer supported.
v1beta1
includes the following changes:- Add
namespace
to propagate the namespace where kueue is deployed to the webhook certificate. - Add
internalCertManagement
with fieldsenable
,webhookServiceName
andwebhookSecretName
. - Remove
enableInternalCertManagement
. UseinternalCertManagement.enable
instead.
- Add
- Upgrade the
kueue.x-k8s.io
API version fromv1alpha2
tov1beta1
.
v1alpha2
is no longer supported.
v1beta1
includes the following changes:ClusterQueue
:- Immutability of
spec.queueingStrategy
. - Refactor
quota.min
andquota.max
intonominalQuota
andborrowingLimit
. - Swap hieararchy between
resources
andflavors
. - Group flavors and resources into
spec.resourceGroups
to make
co-dependent resources explicit. - Move
admission
fromspec
tostatus
. - Add
conditions
field tostatus
.
- Immutability of
LocalQueue
:- Add
admitted
field instatus
. - Add
conditions
field tostatus
.
- Add
Workload
:- Add
metadata
topodSet
templates. - Move
admission
intostatus
.
- Add
ResourceFlavor
:- Introduce
spec
to hold all fields. - Rename
labels
tonodeLabels
. - Rename
taints
tonodeTaints
.
- Introduce
- Reduce API calls by setting
.status.admission
and updating theAdmitted
condition in the same API call. - Obtain queue names from label
kueue.x-k8s.io/queue-name
. The annotation with
the same name is still supported, but it's now deprecated. - Multiplatform support for
linux/amd64
andlinux/arm64
. - Validating webhook for
batch/v1.Job
validates kueue-specific labels and
annotations. - Sequential admission of jobs https://kueue.sigs.k8s.io/docs/tasks/setup_sequential_admission/
- Preemption within ClusterQueue and cohort https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#preemption
- Support for LimitRanges when calculating jobs usage.
- Library for integrating job-like CRDs (controller and webhooks) https://sigs.k8s.io/kueue/pkg/controller/jobframework
Production Readiness
- E2E tests for kubernetes 1.24, 1.25 1.26 on Kind
- Improve readability and code location in logging #14
- Optimized configuration for small size clusters with higher API QPS and number
of workers. - Reproducible load tests https://sigs.k8s.io/kueue/test/performance
- Documentation website https://kueue.sigs.k8s.io/docs/
Bug fixes
- Fix job controller ClusterRole for clusters that enable OwnerReferencesPermissionEnforcement admission control validation #392
- Fix race condition when admission attempt and requeuing happen at the same time #427
- Atomically release quota and requeue previously inadmissible workloads #512
- Fix support for leader election #580
- Fix support for RuntimeClass when calculating jobs usage #565
Acknowledgments
Thanks to our contributors in this release, in no particular order:
@tenzen-y @mcariatm @moficodes @mwielgus @trasc @mimowo @alculquicondor @fjding @kerthcet @ArangoGutierrez @Fish-pro @rbarberop @cortespao @rptaylor @kannon92 @noryev @oginskis @charlieyu1996 @kincl @ahg-g
Kueue v0.2.1
Changes since v0.1.0
:
Features
- Upgrade the API version from v1alpha1 to v1alpha2. v1alpha1 is no longer supported.
v1alpha2 includes the following changes:- Rename Queue to LocalQueue.
- Remove ResourceFlavor.labels. Use ResourceFlavor.metadata.labels instead.
- Add webhooks to validate and to add defaults to all kueue APIs.
- Add internal cert manager to serve webhooks with TLS.
- Use finalizers to prevent ClusterQueues and ResourceFlavors in use from being
deleted prematurely. - Support codependent resources
by assigning the same flavor to codependent resources in a pod set. - Support pod overhead
in Workload pod sets. - Set requests to limits if requests are not set in a Workload pod set,
matching internal defaulting for k8s Pods. - Add prometheus metrics to monitor health of
the system and the status of ClusterQueues. - Use Server Side Apply for Workload admission to reduce API conflicts.
Bug fixes
- Fix bug that caused Workloads that don't match the ClusterQueue's
namespaceSelector to block other Workloads in StrictFIFO ClusterQueues. - Fix the number of pending workloads in BestEffortFIFO ClusterQueues status.
- Fix a bug in BestEffortFIFO ClusterQueues where a workload might not be
retried after a transient error. - Fix requeuing an out-of-date workload when failed to admit it.
- Fix a bug in BestEffortFIFO ClusterQueues where inadmissible workloads
were not removed from the ClusterQueue when removing the corresponding Queue.
Thanks to all our contributors!
In no particular order: @ahg-g @alculquicondor @ArangoGutierrez @cmssczy @denkensk @kerthcet @knight42 @cortespao @shuheiktgw @thisisprasad
Full Changelog: v0.1.0...v0.2.1
Kueue v0.2.0
Do not use. The published container image doesn't match the release.