You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using DPDK mount type requires specifying nic address in mount options. (source)
In our experience with deploying WEKA+Slurm with customers, the nic address naming convention varies with OS and with GCE instance, which can make it challenging to set the NIC address in the cluster blueprint (network_storage.mount_options).
We need a way to be able to reference NIC addresses as a variable in the blueprint that is caught during the startup process of slurm-gcp instances (controller, compute, and login).
As an example, suppose we could have the network storage block
where @nic1@ is interpreted as a variable in the setup that references the first available NIC for mounting WEKA. Note that, we would need to enforce protection of first NIC for slurmd communications.
In addition, we need to reserve cores for the weka agent when using DPDK mounts. From the mount options, the WEKA agent can be pinned to specific core ID's using the -o core=XX option (where XX is a comma separate list of core IDs) or can be confined to N cores with -o num_cores=N.
Because Slurm manages the cpuset cgroup...
If the -o core=XX option is used, we need to set the node configurations for each compute nodeset to set aside the same core ID. If users have a heterogeneous Slurm cluster, in this case, users would have to be advised to set the network_storage in the definition of the compute nodeset, rather than through the controller.network_storage, since the specific core ID they may want to pin to would vary across each nodeset.
If the -o num_cores=N option is used, we need to set node configurations for each compute nodeset to set aside the number of cores for WEKA.
The text was updated successfully, but these errors were encountered:
At the moment, we can't do DPDK mount type.
Using DPDK mount type requires specifying nic address in mount options. (source)
In our experience with deploying WEKA+Slurm with customers, the nic address naming convention varies with OS and with GCE instance, which can make it challenging to set the NIC address in the cluster blueprint (
network_storage.mount_options
).We need a way to be able to reference NIC addresses as a variable in the blueprint that is caught during the startup process of slurm-gcp instances (controller, compute, and login).
As an example, suppose we could have the network storage block
where
@nic1@
is interpreted as a variable in the setup that references the first available NIC for mounting WEKA. Note that, we would need to enforce protection of first NIC forslurmd
communications.In addition, we need to reserve cores for the weka agent when using DPDK mounts. From the mount options, the WEKA agent can be pinned to specific core ID's using the
-o core=XX
option (whereXX
is a comma separate list of core IDs) or can be confined toN
cores with-o num_cores=N
.Because Slurm manages the
cpuset
cgroup...-o core=XX
option is used, we need to set the node configurations for each compute nodeset to set aside the same core ID. If users have a heterogeneous Slurm cluster, in this case, users would have to be advised to set the network_storage in the definition of the compute nodeset, rather than through thecontroller.network_storage
, since the specific core ID they may want to pin to would vary across each nodeset.-o num_cores=N
option is used, we need to set node configurations for each compute nodeset to set aside the number of cores for WEKA.The text was updated successfully, but these errors were encountered: