Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioner renders malformed device-plugin ConfigMap value which breaks GFD, causing Pods to be Pending forever #41

Open
zerodayyy opened this issue Jun 29, 2023 · 0 comments

Comments

@zerodayyy
Copy link

zerodayyy commented Jun 29, 2023

In internal/partitioning/mps/partitioner.go, in ToPluginConfig function, Config struct is used from github.com/NVIDIA/k8s-device-plugin/api/config/v1 package. This struct, while containing nested structs, does not use struct pointers, which causes YAML/JSON Marshal function to render "empty" structs as empty map/object, instead of omitting them. This results in following config value (take a look at timeSlicing):

flags:
  failOnInitError: null
  gdsEnabled: null
  migStrategy: none
  mofedEnabled: null
resources:
  gpus: null
sharing:
  mps:
    failRequestsGreaterThanOne: true
    resources:
    - devices:
      - "0"
      memoryGB: 10
      name: nvidia.com/gpu
      rename: gpu-10gb
      replicas: 2
  timeSlicing: {}
version: v1

This behavior is explained here.

There is a custom Unmarshal function that is executed when sharing.timeSlicing field exists in raw config, but throws an error when it is empty, exactly as we see in the above config example. See code here:

	resources, exists := ts["resources"]
	if !exists {
		return fmt.Errorf("no resources specified")
	}

GFD uses this package to read device-plugin config, created by partitioner. When new partitioning config is applied, empty timeSlicing field in it causes the above code to crash the GFD container with no resources specified error, until timeSlicing: {} is removed from ConfigMap, which resolves the error.

I think it makes sense to fix this issue in nebuly-ai/k8s-device-plugin by removing the checks and forking GFD to use that, as well as tweaking the structs to utilize pointers when nesting other structs in order to render proper YAML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant