Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Elasticquota failed to update due to velasticquota verification #2220

Open
qinfustu opened this issue Oct 8, 2024 · 11 comments
Open
Labels
area/koord-manager kind/bug Create a report to help us improve lifecycle/stale

Comments

@qinfustu
Copy link
Contributor

qinfustu commented Oct 8, 2024

What happened:
The update of elasticquota failed, and the error reported was denied the request: checkMinQuotaSum all children's MinQuota > current MinQuota. However, in fact all children's MinQuota < current MinQuota. The following is the debug information

error: elasticquotas.scheduling.sigs.k8s.io "cluster-root-quota" could not be patched: admission webhook "velasticquota.koordinator.sh" denied the request: checkMinQuotaSum all children's MinQuota > current MinQuota, current: cluster-root-quota, childMinSum: cpu: 23838, memory: 87751Gi, gpu: 724, newQuotaInfoMin: cpu: 500k, memory: 100000Gi, gpu: 20k

Locate the problem lies in the LessThanOrEqualCompletely method. There is a problem with the value returned by the Value() method of corev1.ResourceList, which is inconsistent with String().

childMinSum: cpu: 23838, memory: 87751Gi, gpu: 724, newQuotaInfoMin: cpu: 500k, memory: 300000Gi, gpu: 20k
resource.go:274] delta: cpu: -476162, memory: -212249Gi, gpu: -19276
resource.go:276] k: memory, value: 8894628506, valueS: -212249Gi
resource.go:276] k: nvidia.com/gpu, value: -19276, valueS: -19276
resource.go:276] k: cpu, value: -476162, valueS: -476162

The current temporary processing strategy is to change value.Value() > 0 to value.Value() > 0 && !strings.Contains(value.String(), "-"). Hope to find the root cause and give the best solution.

Environment:

  • App version: v1.5
  • Kubernetes version (use kubectl version): 1.20
  • Install details (e.g. helm install args):
  • Node environment (for koordlet/runtime-proxy issue):
    • Containerd/Docker version:
    • OS version:
    • Kernal version:
    • Cgroup driver: cgroupfs/systemd
  • Others:
@qinfustu qinfustu added the kind/bug Create a report to help us improve label Oct 8, 2024
@saintube
Copy link
Member

saintube commented Oct 9, 2024

checkMinQuotaSum all children's MinQuota > current MinQuota, current: cluster-root-quota, childMinSum: cpu: 23838, memory: 87751Gi, gpu: 724, newQuotaInfoMin: cpu: 500k, memory: 100000Gi, gpu: 20k

@qinfustu It seems you have modified the quota webhook since the vanilla version does not print the following quantities. Perhaps we should find out the cause first about why the memory.Value() in the example shows an incorrect value, as it is supposed to be calculated by the childMinSum.Memory().Sub(newQuotaInfoMin.Memory()).

@qinfustu
Copy link
Contributor Author

qinfustu commented Oct 9, 2024

@saintube In order to troubleshoot this problem, new log information has been added. The guess is that there is a problem with the memory.Value() method in scenarios where the numerical value is large and negative. I haven't found the reason after looking at the source code. I need your help.

@saintube
Copy link
Member

saintube commented Oct 9, 2024

@saintube In order to troubleshoot this problem, new log information has been added. The guess is that there is a problem with the memory.Value() method in scenarios where the numerical value is large and negative. I haven't found the reason after looking at the source code. I need your help.

@qinfustu How about adding a UT to verify this case?

@qinfustu
Copy link
Contributor Author

qinfustu commented Oct 9, 2024

@qinfustu How about adding a UT to verify this case?

@saintube LessThanOrEqualCompletely's UT did not find the problem.

func TestLessThanOrEqualCompletely(t *testing.T) {
	a := corev1.ResourceList{
		corev1.ResourceCPU:    *resource.NewQuantity(23838, resource.DecimalSI),
		corev1.ResourceMemory: *resource.NewQuantity(87751*1024*1024*1024, resource.BinarySI),
		"nvidia.com/gpu":      *resource.NewQuantity(724, resource.DecimalSI),
	}
	b := corev1.ResourceList{
		corev1.ResourceCPU:    *resource.NewQuantity(500000, resource.DecimalSI),
		corev1.ResourceMemory: *resource.NewQuantity(300000*1024*1024*1024, resource.BinarySI),
		"nvidia.com/gpu":      *resource.NewQuantity(20000, resource.DecimalSI),
	}
	result := LessThanOrEqualCompletely(a, b)
	fmt.Println(result)
	assert.Equal(t, true, result)
}

But there is a problem with the real scene.

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  labels:
    quota.scheduling.koordinator.sh/allow-lent-resource: "false"
    quota.scheduling.koordinator.sh/is-parent: "true"
    quota.scheduling.koordinator.sh/parent: koordinator-root-quota
  name: cluster-root-quota-test
  namespace: koordinator-system
spec:
  max:
    cpu: "500000"
    memory: 300000Gi
    nvidia.com/gpu: "20000"
  min:
    cpu: "500000"
    memory: 300000Gi
    nvidia.com/gpu: "20000"
---

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  labels:
    quota.scheduling.koordinator.sh/allow-lent-resource: "false"
    quota.scheduling.koordinator.sh/is-parent: "false"
    quota.scheduling.koordinator.sh/parent: cluster-root-quota-test
  name: cluster-root-quota-child-test
  namespace: koordinator-system
spec:
  max:
    cpu: "23838"
    memory: 87751Gi
    nvidia.com/gpu: "724"
  min:
    cpu: "23838"
    memory: 87751Gi
    nvidia.com/gpu: "724"

Modify the memory in cluster-root-quota-test to 200000Gi to reproduce the problem.
Error message:
error: elasticquotas.scheduling.sigs.k8s.io "cluster-root-quota-test" could not be patched: admission webhook "velasticquota.koordinator.sh" denied the request: checkMinQuotaSum all children's MinQuota > current MinQuota, current: cluster-root-quota-test, childMinSum: cpu: 23838, memory: 87751Gi, gpu: 724, newQuotaInfoMin: cpu: 500k, memory: 200000Gi, gpu: 20k

debug information:
resource.go:273] delta: cpu: -476162, memory: -112249Gi, gpu: -19276
resource.go:276] k: cpu, value: -476162, valueS: -476162
resource.go:276] k: memory, value: 4579775443, valueS: -112249Gi

@saintube
Copy link
Member

saintube commented Oct 14, 2024

@qinfustu How about adding a UT to verify this case?

@saintube LessThanOrEqualCompletely's UT did not find the problem.

func TestLessThanOrEqualCompletely(t *testing.T) {
	a := corev1.ResourceList{
		corev1.ResourceCPU:    *resource.NewQuantity(23838, resource.DecimalSI),
		corev1.ResourceMemory: *resource.NewQuantity(87751*1024*1024*1024, resource.BinarySI),
		"nvidia.com/gpu":      *resource.NewQuantity(724, resource.DecimalSI),
	}
	b := corev1.ResourceList{
		corev1.ResourceCPU:    *resource.NewQuantity(500000, resource.DecimalSI),
		corev1.ResourceMemory: *resource.NewQuantity(300000*1024*1024*1024, resource.BinarySI),
		"nvidia.com/gpu":      *resource.NewQuantity(20000, resource.DecimalSI),
	}
	result := LessThanOrEqualCompletely(a, b)
	fmt.Println(result)
	assert.Equal(t, true, result)
}

But there is a problem with the real scene.

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  labels:
    quota.scheduling.koordinator.sh/allow-lent-resource: "false"
    quota.scheduling.koordinator.sh/is-parent: "true"
    quota.scheduling.koordinator.sh/parent: koordinator-root-quota
  name: cluster-root-quota-test
  namespace: koordinator-system
spec:
  max:
    cpu: "500000"
    memory: 300000Gi
    nvidia.com/gpu: "20000"
  min:
    cpu: "500000"
    memory: 300000Gi
    nvidia.com/gpu: "20000"
---

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  labels:
    quota.scheduling.koordinator.sh/allow-lent-resource: "false"
    quota.scheduling.koordinator.sh/is-parent: "false"
    quota.scheduling.koordinator.sh/parent: cluster-root-quota-test
  name: cluster-root-quota-child-test
  namespace: koordinator-system
spec:
  max:
    cpu: "23838"
    memory: 87751Gi
    nvidia.com/gpu: "724"
  min:
    cpu: "23838"
    memory: 87751Gi
    nvidia.com/gpu: "724"

Modify the memory in cluster-root-quota-test to 200000Gi to reproduce the problem. Error message: error: elasticquotas.scheduling.sigs.k8s.io "cluster-root-quota-test" could not be patched: admission webhook "velasticquota.koordinator.sh" denied the request: checkMinQuotaSum all children's MinQuota > current MinQuota, current: cluster-root-quota-test, childMinSum: cpu: 23838, memory: 87751Gi, gpu: 724, newQuotaInfoMin: cpu: 500k, memory: 200000Gi, gpu: 20k

debug information: resource.go:273] delta: cpu: -476162, memory: -112249Gi, gpu: -19276 resource.go:276] k: cpu, value: -476162, valueS: -476162 resource.go:276] k: memory, value: 4579775443, valueS: -112249Gi

@qinfustu In your example, the root cause is unclear. It seems the LessThanOrEqualCompletely works well (anyway you can retest it to ensure the result is stable) but the calculated quantities in the webhook go wrong. We need more clues. Maybe it is due to other reasons instead of the LessThanOrEqualCompletely, e.g. if the comparing quantities are mutated in another goroutine.

@saintube
Copy link
Member

@qinfustu For more clues, could you please log the details of the childMinSum, newQuotaInfo.CalculatedInfo.Min and the delta when before entering the LessThanOrEqualCompletely and inside the LessThanOrEqualCompletely? BTW, the Quantity does not guarantee the valueS consistent to the latest value.

@qinfustu
Copy link
Contributor Author

@saintube The same problem will occur after adding .DeepCopy(). You can try to reproduce it based on the quota information. code:

	klog.V(4).Infof("childMinSum: %+v, newQuotaInfoMin: %+v", printResource(childMinSum), printResource(newQuotaInfo.CalculateInfo.Min))
	if !util.LessThanOrEqualCompletely(childMinSum.DeepCopy(), newQuotaInfo.CalculateInfo.Min.DeepCopy()) {
		return fmt.Errorf("checkMinQuotaSum all children's MinQuota > current MinQuota, current: %v, childMinSum: %+v, newQuotaInfoMin: %+v", newQuotaInfo.Name, printResource(childMinSum), printResource(newQuotaInfo.CalculateInfo.Min))
	}
func LessThanOrEqualCompletely(a corev1.ResourceList, b corev1.ResourceList) bool {
	result := true
	delta := quotav1.Subtract(a.DeepCopy(), b.DeepCopy())
	klog.V(4).Infof("delta: %+v", printResource(delta))
	for k, value := range delta {
		klog.V(4).Infof("k: %s, value: %+v, valueS: %s", k, value.Value(), value.String())
		if value.Value() > 0 {
			// if value.Value() > 0 && !strings.Contains(value.String(), "-") {
			result = false
			break
		}
	}

	return result
}

debug info:

quota_topology_check.go:287] allChildQuotaSum: cpu: 23838, memory: 87751Gi, gpu: 724
quota_topology_check.go:252] childMinSum: cpu: 23838, memory: 87751Gi, gpu: 724, newQuotaInfoMin: cpu: 500k, memory: 200000Gi, gpu: 20k
esource.go:273] delta: cpu: -476162, memory: -112249Gi, gpu: -19276
resource.go:275] k: cpu, value: -476162, valueS: -476162
resource.go:275] k: memory, value: 4579775443, valueS: -112249Gi

@saintube
Copy link
Member

saintube commented Oct 14, 2024

@saintube The same problem will occur after adding .DeepCopy(). You can try to reproduce it based on the quota information. code:

	klog.V(4).Infof("childMinSum: %+v, newQuotaInfoMin: %+v", printResource(childMinSum), printResource(newQuotaInfo.CalculateInfo.Min))
	if !util.LessThanOrEqualCompletely(childMinSum.DeepCopy(), newQuotaInfo.CalculateInfo.Min.DeepCopy()) {
		return fmt.Errorf("checkMinQuotaSum all children's MinQuota > current MinQuota, current: %v, childMinSum: %+v, newQuotaInfoMin: %+v", newQuotaInfo.Name, printResource(childMinSum), printResource(newQuotaInfo.CalculateInfo.Min))
	}
func LessThanOrEqualCompletely(a corev1.ResourceList, b corev1.ResourceList) bool {
	result := true
	delta := quotav1.Subtract(a.DeepCopy(), b.DeepCopy())
	klog.V(4).Infof("delta: %+v", printResource(delta))
	for k, value := range delta {
		klog.V(4).Infof("k: %s, value: %+v, valueS: %s", k, value.Value(), value.String())
		if value.Value() > 0 {
			// if value.Value() > 0 && !strings.Contains(value.String(), "-") {
			result = false
			break
		}
	}

	return result
}

debug info:

quota_topology_check.go:287] allChildQuotaSum: cpu: 23838, memory: 87751Gi, gpu: 724
quota_topology_check.go:252] childMinSum: cpu: 23838, memory: 87751Gi, gpu: 724, newQuotaInfoMin: cpu: 500k, memory: 200000Gi, gpu: 20k
esource.go:273] delta: cpu: -476162, memory: -112249Gi, gpu: -19276
resource.go:275] k: cpu, value: -476162, valueS: -476162
resource.go:275] k: memory, value: 4579775443, valueS: -112249Gi

the Quantity does not guarantee the valueS consistent to the latest value.

@qinfustu Please try to print the actual value of the quantities instead of calling (Quantity).String().

@qinfustu
Copy link
Contributor Author

@saintube

quota_topology_check.go:287] allChildQuotaSum: cpu: 23838/23838, memory: 87751Gi/94221918797824, gpu: 724
quota_topology_check.go:252] childMinSum: cpu: 23838/23838, memory: 87751Gi/94221918797824, gpu: 724, newQuotaInfoMin: cpu: 500k/500000, memory: 200000Gi/214748364800000, gpu: 20k
resource.go:273] delta: cpu: -476162/-476162, memory: -112249Gi/4579775443, gpu: -19276
resource.go:275] k: cpu, value: -476162, valueS: -476162
resource.go:275] k: memory, value: 4579775443, valueS: -112249Gi

@saintube
Copy link
Member

Will be fixed by #2235.
/cc @qinfustu

Copy link

stale bot commented Jan 23, 2025

This issue has been automatically marked as stale because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed
    You can:
  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Close this issue or PR with /close
    Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/koord-manager kind/bug Create a report to help us improve lifecycle/stale
Projects
None yet
Development

No branches or pull requests

2 participants