Support DPDK mount type #4

fluidnumerics-joe · 2024-09-13T16:11:42Z

At the moment, we can't do DPDK mount type.

Using DPDK mount type requires specifying nic address in mount options. (source)

In our experience with deploying WEKA+Slurm with customers, the nic address naming convention varies with OS and with GCE instance, which can make it challenging to set the NIC address in the cluster blueprint (network_storage.mount_options).

We need a way to be able to reference NIC addresses as a variable in the blueprint that is caught during the startup process of slurm-gcp instances (controller, compute, and login).

As an example, suppose we could have the network storage block

    remote_mount          = "/default"
    local_mount           = var.local_mount
    fs_type               = "wekafs"
    mount_options         = "net=@nic1@,num_cores=1,remove_after_secs=900"
    server_ip             = local.weka_ips[0]
    client_install_runner = ""
    mount_runner          = ""

where @nic1@ is interpreted as a variable in the setup that references the first available NIC for mounting WEKA. Note that, we would need to enforce protection of first NIC for slurmd communications.

In addition, we need to reserve cores for the weka agent when using DPDK mounts. From the mount options, the WEKA agent can be pinned to specific core ID's using the -o core=XX option (where XX is a comma separate list of core IDs) or can be confined to N cores with -o num_cores=N.

Because Slurm manages the cpuset cgroup...

If the -o core=XX option is used, we need to set the node configurations for each compute nodeset to set aside the same core ID. If users have a heterogeneous Slurm cluster, in this case, users would have to be advised to set the network_storage in the definition of the compute nodeset, rather than through the controller.network_storage, since the specific core ID they may want to pin to would vary across each nodeset.
If the -o num_cores=N option is used, we need to set node configurations for each compute nodeset to set aside the number of cores for WEKA.

The text was updated successfully, but these errors were encountered:

fluidnumerics-joe added the enhancement New feature or request label Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DPDK mount type #4

Support DPDK mount type #4

fluidnumerics-joe commented Sep 13, 2024

Support DPDK mount type #4

Support DPDK mount type #4

Comments

fluidnumerics-joe commented Sep 13, 2024