Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Adding New GPU Servers to Magic Castle Cluster #331

Open
OscarDiez opened this issue Nov 2, 2024 · 17 comments
Open

Issues with Adding New GPU Servers to Magic Castle Cluster #331

OscarDiez opened this issue Nov 2, 2024 · 17 comments
Assignees
Labels
azure bug Something isn't working question Further information is requested

Comments

@OscarDiez
Copy link

OscarDiez commented Nov 2, 2024

I’ve been working on adding 3 new GPU servers to the Magic Castle cluster, but unfortunately, I’ve been facing multiple issues with the setup, and I’m at a bit of a standstill.

Issues Encountered:

  1. Puppet Configuration and GPU Drivers
    I’ve been trying to get the NVIDIA drivers and kernel modules properly installed, but Puppet keeps returning the following error:
  Error: Unable to find a match: kmod-nvidia-latest-dkms

As a result, several stages are being skipped due to failed dependencies, including services like nvidia-persistenced and nvidia-dcgm. Despite manually trying to install the correct drivers (such as nvidia-driver-cuda), the error persists.

I’ve checked the logs and Puppet config files but haven’t been able to pinpoint the root cause. Here’s a portion of the error from the Puppet run:

  Error: /Stage[main]/Profile::Gpu::Install::Passthrough/Package[kmod-nvidia-latest-dkms]/ensure: change from 'purged' to 'present' failed.
  1. SLURM Node Availability
    SLURM also seems to be having issues with recognizing the new nodes. The nodes (gpu-node[1-3]) are showing up as down# in SLURM:
PARTITION          AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu-node              up   infinite      3  down# gpu-node[1-3]

When I try to submit jobs to these nodes, I get the following error:

Batch job submission failed: Invalid account or account/partition combination specified
Additionally, jobs remain pending with the reason:

(ReqNodeNotAvail, UnavailableNodes:gpu-node[1-3])
  1. Logs and Configuration
    I’ve checked the slurmd service on the nodes and confirmed that it’s running. I’ve also reviewed the following logs and config files:

/var/log/slurmctld.log on the controller shows node availability issues.
/var/log/slurmd.log on the GPU nodes themselves doesn't reveal much beyond the standard communication errors.
The slurm.conf file appears to correctly define the GPU nodes, but they are still marked as down# in SLURM.
Attempts and Outcome
I’ve tried multiple fixes over the last few days, including:

Manually installing drivers and reconfiguring Puppet.
Restarting SLURM and resuming the nodes via scontrol.
Ensuring Munge is running properly on all nodes.
Updating the SLURM node state using scontrol update nodename=gpu-node1 state=RESUME.
Despite my best efforts, the nodes remain unavailable for job scheduling and spawn via Jupiter, and I’m starting to feel a bit desperate at this point.

I would really appreciate your help with this issue or any pointers to documentation or someone who could assist. It’s been a challenging process, and any guidance you can provide would be invaluable.

@cmd-ntrf cmd-ntrf self-assigned this Nov 5, 2024
@cmd-ntrf cmd-ntrf added the question Further information is requested label Nov 5, 2024
@cmd-ntrf
Copy link
Member

cmd-ntrf commented Nov 5, 2024

Hi Oscar,

Problem 2 and 3 stem from problem 1. When GPU drivers cannot be installed properly, slurmd won't start and the node will never available for jobs.

  1. Which version of Magic Castle are you using?
  2. Which cloud provider are you using?
  3. What image / operating system are you using?

It should be pretty straightforward to find the culprit of your problem once you provide these 3 informations.

Best,
Felix

@OscarDiez
Copy link
Author

OscarDiez commented Nov 5, 2024 via email

@cmd-ntrf
Copy link
Member

cmd-ntrf commented Nov 5, 2024

Could you try from scratch but using the latest beta release instead?
https://github.com/ComputeCanada/magic_castle/releases/tag/14.0.0-beta.6

14.0.0 is just days from being officially release, and it will probably solve your problem.

@OscarDiez
Copy link
Author

OscarDiez commented Nov 6, 2024 via email

@cmd-ntrf
Copy link
Member

cmd-ntrf commented Nov 6, 2024

Yes, unfortunately you will have to recreate users and upload data if you start from scratch.

You could start a new clusters next to the one your already have and move data before destroying it, but you will have to recreate users.

@OscarDiez
Copy link
Author

OscarDiez commented Nov 11, 2024 via email

@cmd-ntrf
Copy link
Member

Hi Oscar,

In the future, I would appreciate if you could use the GitHub web interface to comment on issue. Replying via email disable the markdown rendering which makes your comment somewhat tougher to read.

Your assessment of the issue is correct. A recent change in AzureRM Terraform provider changed the default value of the azurerm_public_ip's sku variable from Basic to Standard. Explicitly defining its value to Basic correctly solves the issues as it was the default value before Azure made that change to the default value. Thank you for reporting this issue, Azure in Magic Castle is underused and this sort of issue often flies under my radar.

To your question:

  • Fixing the issue for now, best way to do it is to modify the network.tf as you did.
  • I will publish a new release with the fix today.
  • 14.0.0-beta.7 do not include a fix, 14.0.0-beta.8 will.
  • Unfortunately, there no are mechanisms in Magic Castle at the moment that facilitate the migration of data and users between clusters as the original intent was disposable clusters for training. I would suggest to create a new cluster before deleting the previous one, than transfer the data via rsync, re-create the users, and then delete the previous cluster.

Finally remark, the tag "gpu-node" for your gpu instance does not exist and therefore your gpu node will not be properly configured. Replace it by the "node". Puppet will correctly identify if the compute node has a gpu and configure it.

@odiezg
Copy link

odiezg commented Nov 14, 2024

Many thanks,
I still get issues with the public_ip and the network.tf.

│   with module.azure.azurerm_public_ip.public_ip["node1"],
│   on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│   18: resource "azurerm_public_ip" "public_ip" {
│
╵
╷
│ Error: static IP allocation must be used when creating Standard SKU public IP addresses
│
│   with module.azure.azurerm_public_ip.public_ip["mgmt1"],
│   on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│   18: resource "azurerm_public_ip" "public_ip" {
│
╵
╷
│ Error: static IP allocation must be used when creating Standard SKU public IP addresses
│
│   with module.azure.azurerm_public_ip.public_ip["node2"],
│   on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│   18: resource "azurerm_public_ip" "public_ip" {
│
╵
╷
│ Error: static IP allocation must be used when creating Standard SKU public IP addresses
│
│   with module.azure.azurerm_public_ip.public_ip["gpu-node1"],
│   on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│   18: resource "azurerm_public_ip" "public_ip" {
│

I tried to use this to solve it, it creates the servers, but after I only can connect via ssh to the login1 server.

locals {
  public_ip_skus = {
    for k, v in module.design.instances :
    k => contains(v.tags, "public") ? "Basic" : "Standard"
  }

  public_ip_allocation_methods = {
    for k, v in module.design.instances :
    k => contains(v.tags, "public") || local.public_ip_skus[k] == "Standard" ? "Static" : "Dynamic"
  }
}

resource "azurerm_public_ip" "public_ip" {
  for_each            = module.design.instances
  name                = format("%s-%s-public-ipv4", var.cluster_name, each.key)
  location            = var.location
  resource_group_name = local.resource_group_name

  sku                 = local.public_ip_skus[each.key]
  allocation_method   = local.public_ip_allocation_methods[each.key]
}

The cluster is not working and it is not staring any service.

@cmd-ntrf
Copy link
Member

cmd-ntrf commented Nov 15, 2024

Hi Oscar,

Sorry, it appears I got the sku wrong in my patch. Since the "Basic" sku is being deprecated, I think the best course of action is to simply remove it. This will give us this:

resource "azurerm_public_ip" "public_ip" {
  for_each            = module.design.instances
  name                = format("%s-%s-public-ipv4", var.cluster_name, each.key)
  location            = var.location
  resource_group_name = local.resource_group_name

  sku                 = "Standard"
  allocation_method   = "Static"
}

Can you clarify this statement?

but after I only can connect via ssh to the login1 server.

Does it mean you cannot ssh in other instances from login1 or you cannot SSH directly from the internet to any other instance than login1?

The absence of services running typically indicate the configuration with Puppet has either not finished or has encountered a problem. You can look at journalctl -u puppet and potentially provide a copy of it via gist.github.com.

Best,
Felix

@odiezg
Copy link

odiezg commented Nov 15, 2024

Hi, I have installed again in parallel the version 13.5.0 of magic castle and it creates the cluster but I still have the issues with GPU server. I include below the information of the cluster. For the new version 14.0.0 the cluster is not been properly created.

module "azure" {
source = "./azure"
config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git"
config_version = "13.5.0"

cluster_name = "hpcie"
domain = "labs.faculty.ie.edu"

image = {
publisher = "almalinux",
offer = "almalinux-x86_64",
sku = "9-gen2",
version = "9.4.2024050902"
}

instances = {
mgmt = { type = "Standard_B2ms", count = 1, tags = ["mgmt", "puppet", "nfs"] },
login = { type = "Standard_B2s", count = 1, tags = ["login", "public", "proxy"] },
node = { type = "Standard_B2s", count = 4, tags = ["node"] },
gpu-node = { type = "Standard_NV6ads_A10_v5", count = 1, tags = ["node"] }

@odiezg
Copy link

odiezg commented Nov 15, 2024

Sorry Felix,
I just saw your previous message now. What I meant before with I cannot connect to the new servers is that the servers are up but not responding to ssh, it is very strange, it does not happen with login 1 or with any server of the cluster with version 13.5 I think is related to the network interface. I managed to connect to the mgmt1 server using the Azure bastion connect but when I execute journalctl -u puppet I get no entries. I tried to start puppet: sudo systemctl status puppetserver
image

@cmd-ntrf
Copy link
Member

@OscarDiez : good news, I think I found the issue. The security was not properly associated with the instances.

The fix will be included in next release 14.1.2 that should come out today.

@odiezg
Copy link

odiezg commented Nov 20, 2024

Many thanks, I have run it again and I got initially an error. This is the error I get when creating the cluster:

Error: file provisioner error
│
│   with module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"],
│   on common/provision/main.tf line 79, in resource "terraform_data" "deploy_puppetserver_files":
│   79:   provisioner "file" {
│
│ Upload failed: Process exited with status 255

But I run it a second time and it has created the cluster. The cluster was created and I can connect to jupyter environment and launch a session on a normal node, but not in the GPU node. It tries but get a lot of messages saying "pending in queue."

image

Also, I can only connect via ssh from my laptop to login1 server. It does not work for the others (mgmt1 or nodes...), and when I try to connect from login1 to mgmt1 it asking me for password, it is not using the key file.

Info about the config file:

terraform {
  required_version = ">= 1.4.0"
}

variable "pool" {
  description = "Slurm pool of compute nodes"
  default = []
}

module "azure" {
  source         = "./azure"
  config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git"
  config_version = "14.1.2"

  cluster_name = "hpc"
  domain       = "labs.faculty.ie.edu"


  image        = {
    publisher = "almalinux",
    offer     = "almalinux-x86_64",
    sku       = "9-gen2",
    version   = "9.3.2023111602"
  }

  instances = {
    #mgmt  = { type = "Standard_DS2_v2",  count = 1, tags = ["mgmt", "puppet", "nfs"] },
    #login = { type = "Standard_DS1_v2", count = 1, tags = ["login", "public", "proxy"] },
    #node  = { type = "Standard_DS1_v2",  count = 1, tags = ["node"] }
    mgmt  = { type = "Standard_B2ms", count = 1, tags = ["mgmt", "puppet", "nfs"] },
    login = { type = "Standard_B2s",  count = 1, tags = ["login", "public", "proxy"] },
    node  = { type = "Standard_B2s",  count = 2, tags = ["node"] },
    gpu-node = { type = "Standard_NV6ads_A10_v5", count = 1, tags = ["node"] } 
  }

And when checking in the login1 the status of puppet:

[centos@login1 ~]$ puppet agent --test
Error: Connection to https://puppet:8140/puppet-ca/v1 failed, trying next route:                          Request to https://puppet:8140/puppet-ca/v1 failed after 0.144 seconds: Failed                          to open TCP connection to puppet:8140 (getaddrinfo: Name or service not known)
Wrapped exception:
Failed to open TCP connection to puppet:8140 (getaddrinfo: Name or service not k                         nown)
Error: No more routes to ca
Error: No more routes to ca"
module.azure.module.configuration.tls_private_key.rsa["mgmt"]: Creation complete after 13s [id=a19a025e9a1841ac5412b62ba5c1b58b76a52615]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Creating...
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Creating...
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Creating...
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Creating...
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Creating...
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [50s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [50s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [50s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [50s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [50s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [1m0s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [1m0s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [1m0s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [1m0s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [1m0s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [1m10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [1m10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [1m10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [1m10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [1m10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Creation complete after 1m19s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-gpu-node1]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [1m20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [1m20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [1m20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [1m20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [1m30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [1m30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [1m30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [1m30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [1m40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [1m40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [1m40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [1m40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Creation complete after 1m48s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-mgmt1]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Creation complete after 1m48s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-node1]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Creation complete after 1m49s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-login1]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Creation complete after 1m50s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-node2]
module.dns.module.record_generator.data.external.key2fp["login1"]: Reading...
module.azure.module.provision.data.archive_file.puppetserver_files: Reading...
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-home"]: Creating...
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-scratch"]: Creating...
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-project"]: Creating...
module.azure.module.provision.data.archive_file.puppetserver_files: Read complete after 0s [id=a64f220c292237bfdd75ac9314e560ff9f155f66]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Creating...
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Provisioning with 'file'...
module.dns.module.record_generator.data.external.key2fp["login1"]: Read complete after 1s [id=-]
module.dns.local_file.dns_record: Creating...
module.dns.local_file.dns_record: Creation complete after 0s [id=1f7fbe58d7c3914f5352279f91472dd27801f1b7]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-home"]: Creation complete after 10s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-mgmt1/dataDisks/hpc-mgmt1-nfs-home]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-scratch"]: Still creating... [10s elapsed]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-project"]: Still creating... [10s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [10s elapsed]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-scratch"]: Still creating... [20s elapsed]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-project"]: Still creating... [20s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [20s elapsed]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-scratch"]: Creation complete after 21s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-mgmt1/dataDisks/hpc-mgmt1-nfs-scratch]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-project"]: Still creating... [30s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [30s elapsed]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-project"]: Creation complete after 32s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-mgmt1/dataDisks/hpc-mgmt1-nfs-project]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [40s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [50s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [1m0s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [1m10s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [1m20s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [1m30s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [1m40s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [1m50s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [2m0s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [2m10s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [2m20s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [2m30s elapsed]
╷
│ Error: file provisioner error
│
│   with module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"],
│   on common/provision/main.tf line 79, in resource "terraform_data" "deploy_puppetserver_files":
│   79:   provisioner "file" {
│
│ Upload failed: Process exited with status 255

Please let me know if you need anything else.

@odiezg
Copy link

odiezg commented Nov 27, 2024

Dear Felix-Antoine, do you need any extra logs or information from my side?

@cmd-ntrf
Copy link
Member

Sorry Oscar, I was away for Supercomputing when you last wrote.

  1. It is possible that you have to run terraform apply twice. AlmaLinux images on Azure are currently missing rsync which is essential for the deploy_puppetserver_files resource to complete. I have added the installation of rsync in the cloud-init, but it is possible the provisioner runs before the installation complete and you run into the error. You have to re-run terraform apply until it completes. Wait 30s - 1min between apply to minimize the chances of error.
  2. To be able to connect to mgmt1 from login1, you will need to use an ssh-agent and forward your ssh key with the -A flag of the ssh client. It is by design that only login1 is available from the internet.
  3. To figure out why you cannot launch a GPU job with jupyter, you will need to look at Slurm state with sinfo and potentially ssh to gpu-node1 to look at the puppet log : journalctl -u puppet. If sinfo reports gpu-node1 is idle, the issue is with JupyterHub, it is reported as down, there was a problem with the configuration and the error should appear in gpu-node1 puppet log.
  4. puppet agent --test fails because it uses the default name for the puppet server. It is not a valid indicator wether puppet is properly working or not. If you want to know if puppet is working properly, look at the logs: journalctl -u puppet.

We'll look into scheduling a call if you cannot find out why the GPU node does work.

@odiezg
Copy link

odiezg commented Nov 27, 2024

Many thanks for the swift reply and do not worry. I hope you had a good time at SC24.
I can connect to the others servers using forward as you said. I did not have that feature with the previous version.
I have run it again, but I did not destroy the cluster, just ran again the terraform apply. I have tried what you told me (see below) but no success. We can setup a call, I can tomorrow Thursday 28th or Friday 29th from 13:00 Quebec time (19:00 Brussels time). Many thanks again.

Regarding the problem of the server what I get from sinfo is:

[centos@login1 ]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpubase_bycore_b1* up infinite 1 down
gpu-node1
cpubase_bycore_b1* up infinite 2 idle node[1-2]
gpu-node up infinite 1 down~ gpu-node1
node up infinite 2 idle node[1-2]
[centos@login1 ~]$

from the gpu-node when I execute: journalctl -u puppet

Nov 27 18:51:57 gpu-node1.int.hpc.labs.faculty.ie.edu systemd[1]: Started Puppet agent.
Nov 27 18:52:01 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1317]: Starting Puppet client version 7.32.1
Nov 27 18:52:01 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.002 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:01 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:01 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:01 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: No more routes to fileserver
Nov 27 18:52:12 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.001 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:12 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:12 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:12 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Could not retrieve catalog from remote server: No more routes to puppet
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.002 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Profile::Ssh::Base/File[/etc/ssh/sshd_config.d/49-magic_castle.conf]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/base/opensshserver-9.config: No more routes to files>
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.001 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Profile::Base/File[/usr/sbin/prepare4image.sh]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/base/prepare4image.sh: No more routes to fileserver
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.002 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Profile::Consul::Puppet_watch/File[/usr/bin/puppet_event_handler.sh]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/consul/puppet_event_handler.sh: No more routes to fi>
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.001 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Profile::Freeipa::Client/File[/sbin/mc-ipa-client-install]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/freeipa/mc-ipa-client-install: No more routes to fileserver
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.007 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Profile::Freeipa::Base/File[/etc/rsyslog.d/ignore-systemd-session-slice.conf]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/freeipa/ignore-systemd-session-slice.conf: >
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.001 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Jupyterhub::Node::Install/File[/opt/jupyterhub/lib/usercustomize/usercustomize.py]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/jupyterhub/usercustomize.py: No more routes to>
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.002 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Profile::Slurm::Base/File[/etc/slurm/epilog]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/slurm/epilog: No more routes to fileserver

and in the mgmt server puppet looks fine: [centos@mgmt1 ~]$ sudo systemctl status puppetserver
● puppetserver.service - puppetserver Service
Loaded: loaded (/usr/lib/systemd/system/puppetserver.service; enabled; preset: disabled)
Active: active (running) since Wed 2024-11-27 18:52:42 UTC; 3h 19min ago
Main PID: 1308 (java)
Tasks: 55 (limit: 4915)
Memory: 2.5G
CPU: 5min 3.059s
CGroup: /system.slice/puppetserver.service
└─1308 /usr/lib/jvm/jre-11/bin/java -Xms2g -Xmx2g -Djruby.logger.class=com.puppetlabs.jruby_utils.jruby.Slf4jLogger -Dlogappender=F1 -XX:>

Nov 27 18:52:10 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflect>
Nov 27 18:52:10 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: WARNING: All illegal access operations will be denied in a future release
Nov 27 18:52:14 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: WARNING: abs already refers to: #'clojure.core/abs in namespace: medley.core, be>
Nov 27 18:52:42 mgmt1.int.hpc.labs.faculty.ie.edu systemd[1]: Started puppetserver Service.
Nov 27 18:53:17 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: /etc/puppetlabs/code/environments/main/modules/lvm/lib/puppet/type/logical_volum>
Nov 27 18:53:17 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: /etc/puppetlabs/code/environments/main/modules/lvm/lib/puppet/type/logical_volum>
Nov 27 18:53:17 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: /etc/puppetlabs/code/environments/main/modules/lvm/lib/puppet/type/logical_volum>
Nov 27 18:53:17 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: /etc/puppetlabs/code/environments/main/modules/lvm/lib/puppet/type/logical_volum>
Nov 27 18:53:39 mgmt1.int.hpc.labs.faculty.ie.edu systemd[1]: /usr/lib/systemd/system/puppetserver.service:45: Standard output type syslog is obsolete>
Nov 27 18:53:40 mgmt1.int.hpc.labs.faculty.ie.edu systemd[1]: /usr/lib/systemd/system/puppetserver.service:45: Standard output type syslog is obsolete>
lines 1-20/20 (END)

and listening:
[centos@mgmt1 ~]$ sudo netstat -tuln | grep 8140
tcp6 0 0 :::8140 :::* LISTEN

and from mgmt1 node I can see mgmt1 server from the gpu-node:

[centos@gpu-node1 ~]$ nslookup mgmt1
Server: 10.0.1.4
Address: 10.0.1.4#53

Name: mgmt1.int.hpc.labs.faculty.ie.edu
Address: 10.0.1.4

[centos@gpu-node1 ~]$ ^C
[centos@gpu-node1 ~]$
[centos@gpu-node1 ~]$ ping mgmt1
PING mgmt1.int.hpc.labs.faculty.ie.edu (10.0.1.4) 56(84) bytes of data.
64 bytes from mgmt1.int.hpc.labs.faculty.ie.edu (10.0.1.4): icmp_seq=1 ttl=64 time=1.27 ms
64 bytes from mgmt1.int.hpc.labs.faculty.ie.edu (10.0.1.4): icmp_seq=2 ttl=64 time=1.05 ms
^C
--- mgmt1.int.hpc.labs.faculty.ie.edu ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 1.045/1.159/1.274/0.114 ms

[centos@gpu-node1 ~]$ sudo journalctl -u puppetserver
-- No entries --
[centos@gpu-node1 ~]$ cat /etc/puppetlabs/puppet/puppet.conf
[main]
server = mgmt1
certname = gpu-node1
waitforcert = 15s
report = false
postrun_command = /opt/puppetlabs/bin/postrun

This file can be used to override the default puppet settings.

See the following links for more details on what settings are available:

- https://puppet.com/docs/puppet/latest/config_important_settings.html

- https://puppet.com/docs/puppet/latest/config_about_settings.html

- https://puppet.com/docs/puppet/latest/config_file_main.html

- https://puppet.com/docs/puppet/latest/configuration.html

[centos@gpu-node1 ~]$

@odiezg
Copy link

odiezg commented Dec 6, 2024

Dear Felix-Antoine, do you need any extra logs or information from my side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
azure bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants