Skip to content

Commit

Permalink
DRA: Fix the latest feedback
Browse files Browse the repository at this point in the history
  • Loading branch information
cyclinder committed Apr 19, 2024
1 parent b1501df commit 25160ff
Show file tree
Hide file tree
Showing 9 changed files with 203 additions and 119 deletions.
4 changes: 1 addition & 3 deletions charts/spiderpool/templates/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,10 @@ data:
{{- else}}
clusterSubnetDefaultFlexibleIPNumber: 0
{{- end }}
{{- if .Values.dra.enabled }}
dra:
enabled: true
enabled: {{ .Values.dra.enabled }}
cdiRootPath: {{ .Values.dra.cdiRootPath }}
libraryPath: {{ .Values.dra.libraryPath }}
{{- end }}
{{- if .Values.multus.multusCNI.install }}
---
kind: ConfigMap
Expand Down
8 changes: 6 additions & 2 deletions docs/reference/crd-spiderclaimparameter.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ metadata:
annotations:
dra.spidernet.io/cdi-version: 0.6.0
spec:
rdmaAcc: false
netResources:
spidernet.io/shared-rdma-device: 1
ippools:
- pool
```
## Spidercoordinators definition
Expand All @@ -30,4 +33,5 @@ This is the Spidercoordinators spec for users to configure.
| Field | Description | Schema | Validation | Values | Default |
|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|------------|-----------------------------------------------|------------------------------|
| rdmaAcc | TODO | bool | optional | true,false | false |
| netResources | Used for device-plugin declaration resources | map[string]string | optional | nil | nil |
ippools | A list of subnets used by the pod for scheduling purposes. | []string | optional | []string{} | empty |
87 changes: 31 additions & 56 deletions docs/usage/dra.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@
Dynamic-Resource-Allocation (DRA) is a new feature introduced by Kubernetes that puts resource scheduling in the hands of third-party developers. It provides an API more akin to a storage persistent volume, instead of the countable model (e.g., "nvidia.com/gpu: 2") that device-plugin used to request access to resources, with the main benefit being a more flexible and dynamic allocation of hardware resources, resulting in improved resource utilization. The main benefit is more flexible and dynamic allocation of hardware resources, which improves resource utilization and enhances resource scheduling, enabling Pods to schedule the best nodes. DRA is currently available as an alpha feature in Kubernetes 1.26 (December 2022 release), driven by Nvidia and Intel.
Spiderpool currently integrates with the DRA framework, which allows for the following, but not limited to:

* Enabling RDMA hardware resources.
* Enables the use and scheduling of RDMA hardware resources, mounting key linux so(shared object) files and setting environment variables.
* Automatically scheduling Pods to appropriate nodes based on their subnets and NICs to prevent Pods from failing to start after scheduling to a node.
* Unify the resource declaration of multiple device-plugins.
* Continuously updated, see for details. [RoadMap](../develop/roadmap.md)
Expand Down Expand Up @@ -65,11 +63,8 @@ Spiderpool currently integrates with the DRA framework, which allows for the fol
```
helm repo add spiderpool https://spidernet-io.github.io/spiderpool
helm repo update spiderpool
helm install spiderpool spiderpool/spiderpool --namespace kube-system --set dra.enabled=true \
--set dra.librarypath="/usr/lib/libtest.so"
helm install spiderpool spiderpool/spiderpool --namespace kube-system --set dra.enabled=true
> Specify the path to the so file via dra.librarypath, which will be mounted to the Pod's container via CDI. Note that this so file needs to exist on the host.
4. Verify the installation
Check that the Spiderpool pod is running correctly, and check for the presence of the resourceclass resource:
Expand Down Expand Up @@ -131,48 +126,46 @@ Spiderpool currently integrates with the DRA framework, which allows for the fol
~# export NAME=demo
apiVersion: spiderpool.spidernet.io/v2beta1
kind: SpiderClaimParameter
metadata.
metadata:
name: ${NAME}
metadata: name: ${NAME}
rdmaAcc: true
---ApiVersion: resource.k8s.io/v1alpha2
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata: ${NAME}
metadata:
name: ${NAME}
spec: ${NAME}
resourceClassName: netresources.k8s.io/valpha2
spec:
spec:
resourceClassName: netresources.spidernet.io
parametersRef: apiGroup: spiderpool.spidernet.io
parametersRef:
apiGroup: spiderpool.spidernet.io
kind: SpiderClaimParameter
name: ${NAME}
---
apiVersion: apps/v1
kind: Deployment
name: ${NAME} --- apiVersion: apps/v1 kind: Deployment
name: ${NAME} --- apiVersion: apps/v1 kind: Deployment
spec: replicas: 2
metadata:
name: ${NAME}
spec:
replicas: 2
selector: ${NAME
matchLabels: app: ${NAME}
selector:
matchLabels:
app: ${NAME}
template: ${NAME}
metadata: ${NAME}
annotations: ${NAME} template: metadata.
template:
metadata:
annotations:
v1.multus-cni.io/default-network: kube-system/macvlan-conf
labels: app: ${NAME}
labels:
app: ${NAME}
spec: ${NAME}
name: ctr: ${NAME} labels: app: ${NAME}
spec:
containers:
- name: ctr
image: nginx
resources: ${NAME}
claims: name: ${NAME}
resources:
claims:
- name: ${NAME}
resourceClaims: name: ${NAME}
resourceClaims:
- name: ${NAME}
resourceClaims: name: ${NAME}
source:
resourceClaimTemplateName: ${NAME}
```
Expand All @@ -190,53 +183,35 @@ Spiderpool currently integrates with the DRA framework, which allows for the fol
```
~# kubectl get resourceclaim
NAME RESOURCECLASSNAME ALLOCATIONMODE STATE AGE
demo-745fb4c498-72g7g-demo-7d458 netresources.spidernet.io WaitForFirstConsumer allocated,reserved 20d
NAME RESOURCECLASSNAME ALLOCATIONMODE STATE AGE
demo-745fb4c498-72g7g-demo-7d458 netresources.spidernet.io WaitForFirstConsumer allocated,reserved 20d
~# cat /var/run/cdi/k8s.netresources.spidernet.io-claim_1e15705a-62fe-4694-8535-93a5f0ccf996.yaml
---
cdiVersion: 0.6.0
containerEdits: {}
devices: {}
- {} devices: {} containerEdits: {}
env: {} devices: containerEdits: {} devices: containerEdits: {}
devices:
- containerEdits:
env:
- DRA_CLAIM_UID=1e15705a-62fe-4694-8535-93a5f0ccf996
- LD_PRELOAD=libtest.so
mounts.
- containerPath: /usr/lib/libtest.so
hostPath: /usr/lib/libtest.so
options: /usr/lib/libtest.so
- /usr/lib/libtest.so options: ro
- nosuid
- nodev
- nodev
- containerPath: /usr/lib64/libtest.so
hostPath: /usr/lib/libtest.so
options: /usr/lib64/libtest.so
- nosuid
- nosuid
- nodev
- bind
name: 1e15705a-62fe-4694-8535-93a5f0ccf996
kind: k8s.netresources.spidernet.io/claim
```
This shows that the ResourceClaim has been created, and STATE shows allocated and reserverd, indicating that it has been used by the pod. And spiderpool has generated a CDI file for the ResourceClaim, which describes the files and environment variables to be mounted.
Check that the pod is Running and verify that the so file is mounted and the environment variable (LD_PRELOAD) is declared.
Check that the pod is Running and verify that the the environment variable (DRA_CLAIM_UID) is declared.
```
~# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-745fb4c498-72g7g 1/1 Running 0 20m
nginx-745fb4c498-s92qr 1/1 Running 0 20m
~# kubectl exec -it nginx-745fb4c498-72g7g sh
~# ls /usr/lib/libtest.so
/usr/lib/libtest.so
~# printenv LD_PRELOAD
libtest.so
~# printenv DRA_CLAIM_UID
1e15705a-62fe-4694-8535-93a5f0ccf996
```
You can see that the Pod's containers have correctly mounted the so files and environment variables, and your containers are ready to use the so files you have mounted.
You can see that the Pod's containers have correctly declared environment variables, It shows the dra is works.
## Welcome to try it out
Expand Down
36 changes: 6 additions & 30 deletions docs/usage/dra_zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@

目前 Spiderpool 已经集成 DRA 框架,基于该功能可实现以下但不限于的能力:

* 实现 RDMA 硬件资源的使用和调度,挂载关键 linux so(shared object) 文件及设置环境变量
* 可根据 Pod 使用的子网和网卡信息,自动调度到合适的节点,避免 Pod 调度到节点之后无法启动
* 统一多个 device-plugin 的资源声明方式
* 持续更新, 详见 [RoadMap](../develop/roadmap.md)
Expand Down Expand Up @@ -68,12 +67,9 @@
helm repo update spiderpool
helm install spiderpool spiderpool/spiderpool --namespace kube-system --set dra.enabled=true \
--set dra.librarypath="/usr/lib/libtest.so"
helm install spiderpool spiderpool/spiderpool --namespace kube-system --set dra.enabled=true
```
> 通过 dra.librarypath 指定 so 文件的路径,这将会通过 CDI 挂载到 Pod 的容器中. 注意此 so 文件需要存在于主机上。
4. 验证安装
检查 Spiderpool pod 是否正常 running, 并检查是否存在 resourceclass 资源:
Expand Down Expand Up @@ -137,8 +133,6 @@
kind: SpiderClaimParameter
metadata:
name: ${NAME}
spec:
rdmaAcc: true
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
Expand Down Expand Up @@ -182,7 +176,7 @@
> 创建一个 ResourceClaimTemplate, K8s 将会根据这个 ResourceClaimTemplate 为每个 Pod 创建自己独有的 Resourceclaim。该 Resourceclaim 的声明周期与该 Pod保持一致。
>
> SpiderClaimParameter 用于扩展 ResourceClaim 的配置参数,将会影响 ResourceClaim 的调度以及其 CDI 文件的生成。本例子中,设置 rdmaAcc 为 true,将会影响是否挂载配置的 so 文件。
> SpiderClaimParameter 用于扩展 ResourceClaim 的配置参数,将会影响 ResourceClaim 的调度以及其 CDI 文件的生成。
>
> Pod 的 container 通过在 Resources 中声明 claims 的使用,这将影响 containerd 所需要的资源。容器运行时会将该 claim 对应的 CDI 文件翻译为 OCI Spec配置,从而决定container的创建。
>
Expand All @@ -204,43 +198,25 @@
- containerEdits:
env:
- DRA_CLAIM_UID=1e15705a-62fe-4694-8535-93a5f0ccf996
- LD_PRELOAD=libtest.so
mounts:
- containerPath: /usr/lib/libtest.so
hostPath: /usr/lib/libtest.so
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libtest.so
hostPath: /usr/lib/libtest.so
options:
- ro
- nosuid
- nodev
- bind
name: 1e15705a-62fe-4694-8535-93a5f0ccf996
kind: k8s.netresources.spidernet.io/claim
```
这里显示 ResourceClaim 已经被创建,并且 STATE 显示 allocated 和 reserverd,说明已经被 pod 使用。并且 spiderpool 已经为该 ResourceClaim 生成了对应的 CDI 文件。CDI 文件描述了需要挂载的文件和环境变量等。
检查 Pod 是否 Running,并且验证是否挂载 so 文件以及声明环境变量(LD_PRELOAD):
检查 Pod 是否 Running,并且验证 Pod 是否指定了环境变量 `DRA_CLAIM_UID`:
```
~# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-745fb4c498-72g7g 1/1 Running 0 20m
nginx-745fb4c498-s92qr 1/1 Running 0 20m
~# kubectl exec -it nginx-745fb4c498-72g7g sh
~# ls /usr/lib/libtest.so
/usr/lib/libtest.so
~# printenv LD_PRELOAD
libtest.so
~# printenv DRA_CLAIM_UID
1e15705a-62fe-4694-8535-93a5f0ccf996
```
可以看到 Pod 的容器已经正确挂载 so 文件和环境变量,您的容器已经可以正常使用你挂载的 so 文件
可以看到 Pod 的容器已经正确写入环境变量,说明 DRA 工作正常
## 欢迎试用
Expand Down
11 changes: 7 additions & 4 deletions test/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -162,8 +162,8 @@ setup_kind:
if [ "${E2E_SPIDERPOOL_ENABLE_DRA}" == "true" ]; then \
sed -i '$$ a\ DynamicResourceAllocation: true' $${NEW_KIND_YAML} ; \
printf 'containerdConfigPatches: \n# Enable CDI as described in https://tags.cncf.io/container-device-interface#containerd-configuration\n- |-\n [plugins."io.containerd.grpc.v1.cri"]\n enable_cdi = true\n ' >> $${NEW_KIND_YAML} ; \
fi \
fi
fi ;\
fi ; \
$(QUIET) cat $(CLUSTER_DIR)/$(E2E_CLUSTER_NAME)/kind-config.yaml ; \
echo "-------------" ; \
KIND_OPTION="" ; \
Expand Down Expand Up @@ -468,9 +468,12 @@ e2e_test:
export INSTALL_OVERLAY_CNI=$(INSTALL_OVERLAY_CNI) ; \
export E2E_SPIDERSUBNET_ENABLED=$(E2E_SPIDERPOOL_ENABLE_SUBNET) ; \
K8S_VERSION=` kubectl version -o json --kubeconfig $(E2E_KUBECONFIG) | jq '.serverVersion.gitVersion' ` ; \
echo "k8s version: $${K8S_VERSION}" ; \
if [ $$(echo -e "$${K8S_VERSION}\nv1.29.0" | sort -V | head -n1) == "v1.29.0" ] && [ "${E2E_SPIDERPOOL_ENABLE_DRA}" == "true" ]; then \
if [ $$(echo -e "$${K8S_VERSION}\nv1.29.0" | sort -V | head -n1) = "v1.29.0" ] && [ "${E2E_SPIDERPOOL_ENABLE_DRA}" == "true" ]; then \
echo "k8s version: $${K8S_VERSION}" ; \
export E2E_SPIDERPOOL_ENABLE_DRA=true ; \
else \
echo "k8s version1: $${K8S_VERSION}" ; \
export E2E_SPIDERPOOL_ENABLE_DRA=false ; \
fi ; \
rm -f $(E2E_LOG_FILE) || true ; \
echo "=========== before test `date` ===========" >> $(E2E_LOG_FILE) ; \
Expand Down
3 changes: 2 additions & 1 deletion test/doc/dra.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@

| Case ID | Title | Priority | Smoke | Status | Other |
| ------- | --------------------------------------------------------------------------------- | -------- | ----- | ------ | ----- |
| Q00001 | Creating a Pod to verify DRA if works | p1 | true | done | |
| Q00001 | Creating a Pod to verify DRA if works while set rdmaAcc to true | p1 | true | done | |
| Q00002 | Creating a Pod to verify DRA if works while set rdmaAcc to false | p1 | true | done | |
24 changes: 24 additions & 0 deletions test/e2e/common/resourceclaim.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,33 @@ import (

resourcev1alpha2 "k8s.io/api/resource/v1alpha2"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
apitypes "k8s.io/apimachinery/pkg/types"
"sigs.k8s.io/controller-runtime/pkg/client"
)

func ListResourceClaim(f *frame.Framework, opts ...client.ListOption) (*resourcev1alpha2.ResourceClaimList, error) {
list := resourcev1alpha2.ResourceClaimList{}
if err := f.ListResource(&list, opts...); err != nil {
return nil, err
}

return &list, nil
}

func GetResourceClaim(f *frame.Framework, name, ns string) (*resourcev1alpha2.ResourceClaim, error) {
if name == "" || f == nil {
return nil, errors.New("wrong input")
}

v := apitypes.NamespacedName{Name: name, Namespace: ns}
existing := &resourcev1alpha2.ResourceClaim{}
e := f.GetResource(v, existing)
if e != nil {
return nil, e
}
return existing, nil
}

func CreateResourceClaimTemplate(f *frame.Framework, rct *resourcev1alpha2.ResourceClaimTemplate, opts ...client.CreateOption) error {
if f == nil || rct == nil {
return fmt.Errorf("invalid parameters")
Expand Down
14 changes: 14 additions & 0 deletions test/e2e/common/spiderpool.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,20 @@ type SpiderConfigMap struct {
ClusterSubnetDefaultFlexibleIPNum int `yaml:"clusterSubnetDefaultFlexibleIPNumber"`
}

func GetSpiderClaimParameter(f *frame.Framework, name, ns string) (*v1.SpiderClaimParameter, error) {
if name == "" || f == nil {
return nil, errors.New("wrong input")
}

v := apitypes.NamespacedName{Name: name, Namespace: ns}
existing := &v1.SpiderClaimParameter{}
e := f.GetResource(v, existing)
if e != nil {
return nil, e
}
return existing, nil
}

func CreateSpiderClaimParameter(f *frame.Framework, scp *v1.SpiderClaimParameter, opts ...client.CreateOption) error {
if f == nil || scp == nil {
return fmt.Errorf("invalid parameters")
Expand Down
Loading

0 comments on commit 25160ff

Please sign in to comment.