Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DPDK mount type #4

Open
fluidnumerics-joe opened this issue Sep 13, 2024 · 0 comments
Open

Support DPDK mount type #4

fluidnumerics-joe opened this issue Sep 13, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@fluidnumerics-joe
Copy link
Member

At the moment, we can't do DPDK mount type.

Using DPDK mount type requires specifying nic address in mount options. (source)

In our experience with deploying WEKA+Slurm with customers, the nic address naming convention varies with OS and with GCE instance, which can make it challenging to set the NIC address in the cluster blueprint (network_storage.mount_options).

We need a way to be able to reference NIC addresses as a variable in the blueprint that is caught during the startup process of slurm-gcp instances (controller, compute, and login).

As an example, suppose we could have the network storage block

    remote_mount          = "/default"
    local_mount           = var.local_mount
    fs_type               = "wekafs"
    mount_options         = "net=@nic1@,num_cores=1,remove_after_secs=900"
    server_ip             = local.weka_ips[0]
    client_install_runner = ""
    mount_runner          = ""

where @nic1@ is interpreted as a variable in the setup that references the first available NIC for mounting WEKA. Note that, we would need to enforce protection of first NIC for slurmd communications.

In addition, we need to reserve cores for the weka agent when using DPDK mounts. From the mount options, the WEKA agent can be pinned to specific core ID's using the -o core=XX option (where XX is a comma separate list of core IDs) or can be confined to N cores with -o num_cores=N.

Because Slurm manages the cpuset cgroup...

  • If the -o core=XX option is used, we need to set the node configurations for each compute nodeset to set aside the same core ID. If users have a heterogeneous Slurm cluster, in this case, users would have to be advised to set the network_storage in the definition of the compute nodeset, rather than through the controller.network_storage, since the specific core ID they may want to pin to would vary across each nodeset.
  • If the -o num_cores=N option is used, we need to set node configurations for each compute nodeset to set aside the number of cores for WEKA.
@fluidnumerics-joe fluidnumerics-joe added the enhancement New feature or request label Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant