Issues with Adding New GPU Servers to Magic Castle Cluster #331

OscarDiez · 2024-11-02T07:22:44Z

I’ve been working on adding 3 new GPU servers to the Magic Castle cluster, but unfortunately, I’ve been facing multiple issues with the setup, and I’m at a bit of a standstill.

Issues Encountered:

Puppet Configuration and GPU Drivers
I’ve been trying to get the NVIDIA drivers and kernel modules properly installed, but Puppet keeps returning the following error:

  Error: Unable to find a match: kmod-nvidia-latest-dkms

As a result, several stages are being skipped due to failed dependencies, including services like nvidia-persistenced and nvidia-dcgm. Despite manually trying to install the correct drivers (such as nvidia-driver-cuda), the error persists.

I’ve checked the logs and Puppet config files but haven’t been able to pinpoint the root cause. Here’s a portion of the error from the Puppet run:

  Error: /Stage[main]/Profile::Gpu::Install::Passthrough/Package[kmod-nvidia-latest-dkms]/ensure: change from 'purged' to 'present' failed.

SLURM Node Availability
SLURM also seems to be having issues with recognizing the new nodes. The nodes (gpu-node[1-3]) are showing up as down# in SLURM:

PARTITION          AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu-node              up   infinite      3  down# gpu-node[1-3]

When I try to submit jobs to these nodes, I get the following error:

Batch job submission failed: Invalid account or account/partition combination specified
Additionally, jobs remain pending with the reason:

(ReqNodeNotAvail, UnavailableNodes:gpu-node[1-3])

Logs and Configuration
I’ve checked the slurmd service on the nodes and confirmed that it’s running. I’ve also reviewed the following logs and config files:

/var/log/slurmctld.log on the controller shows node availability issues.
/var/log/slurmd.log on the GPU nodes themselves doesn't reveal much beyond the standard communication errors.
The slurm.conf file appears to correctly define the GPU nodes, but they are still marked as down# in SLURM.
Attempts and Outcome
I’ve tried multiple fixes over the last few days, including:

Manually installing drivers and reconfiguring Puppet.
Restarting SLURM and resuming the nodes via scontrol.
Ensuring Munge is running properly on all nodes.
Updating the SLURM node state using scontrol update nodename=gpu-node1 state=RESUME.
Despite my best efforts, the nodes remain unavailable for job scheduling and spawn via Jupiter, and I’m starting to feel a bit desperate at this point.

I would really appreciate your help with this issue or any pointers to documentation or someone who could assist. It’s been a challenging process, and any guidance you can provide would be invaluable.

The text was updated successfully, but these errors were encountered:

cmd-ntrf · 2024-11-05T13:58:42Z

Hi Oscar,

Problem 2 and 3 stem from problem 1. When GPU drivers cannot be installed properly, slurmd won't start and the node will never available for jobs.

Which version of Magic Castle are you using?
Which cloud provider are you using?
What image / operating system are you using?

It should be pretty straightforward to find the culprit of your problem once you provide these 3 informations.

Best,
Felix

OscarDiez · 2024-11-05T17:03:02Z

Many thanks for the reply, it is much appreciated. I attach the main.tf file but the responses to your questions are as follows: MAgic Castle is 13.5.0, Cloud is Azure OS is almalinux 9-gen2 module "azure" { source = "./azure" config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git " config_version = "13.5.0" image = { publisher = "almalinux", offer = "almalinux-x86_64", sku = "9-gen2", version = "9.4.2024050902" } instances = { mgmt = { type = "Standard_B2ms", count = 1, tags = ["mgmt", "puppet", "nfs"] }, login = { type = "Standard_B2s", count = 1, tags = ["login", "public", "proxy"] }, node = { type = "Standard_B2s", count = 5, tags = ["node"] }, gpu-node = { type = "Standard_NV6ads_A10_v5", count = 3, tags = ["node", "gpu-node"] }

…

On Tue, Nov 5, 2024 at 2:59 PM Félix-Antoine Fortin < ***@***.***> wrote: Hi Oscar, Problem 2 and 3 stem from problem 1. When GPU drivers cannot be installed properly, slurmd won't start and the node will never available for jobs. 1. Which version of Magic Castle are you using? 2. Which cloud provider are you using? 3. What image / operating system are you using? It should be pretty straightforward to find the culprit of your problem once you provide these 3 informations. Best, Felix — Reply to this email directly, view it on GitHub <#331 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFSHEHWHYTQN5IBDDUIZGADZ7DFKTAVCNFSM6AAAAABRBPHGCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJXGI2TGMRUGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Regards, Oscar Diez

cmd-ntrf · 2024-11-05T20:52:30Z

Could you try from scratch but using the latest beta release instead?
https://github.com/ComputeCanada/magic_castle/releases/tag/14.0.0-beta.6

14.0.0 is just days from being officially release, and it will probably solve your problem.

OscarDiez · 2024-11-06T06:21:36Z

Many thanks for your reply, I will test it later today, but if I try to run again the system from scratch, will I maintain the users created and their data or will it be deleting everything and I will need to recreate the users again?

…

On Tue, 5 Nov 2024, 20:52 Félix-Antoine Fortin, ***@***.***> wrote: Could you try from scratch but using the latest beta release instead? https://github.com/ComputeCanada/magic_castle/releases/tag/14.0.0-beta.6 14.0.0 is just days from being officially release, and it will probably solve your problem. — Reply to this email directly, view it on GitHub <#331 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFSHEHSGWZ3G4BQCFSZA2DTZ7EV2HAVCNFSM6AAAAABRBPHGCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJYGEZTAMBTGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

cmd-ntrf · 2024-11-06T14:16:36Z

Yes, unfortunately you will have to recreate users and upload data if you start from scratch.

You could start a new clusters next to the one your already have and move data before destroying it, but you will have to recreate users.

OscarDiez · 2024-11-11T23:21:59Z

Dear Félix, Thank you for your assistance with my previous inquiry. I have destroyed the previous installation and installed the new one. But it is not working. The new configuration is below, I got the latest version of magic castle and I am using the newest almalinux version. : ```hcl module "azure" { source = "./azure" config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git " config_version = "14.0.0-beta.7" cluster_name = "hpcie" domain = "labs.faculty.ie.edu" # Using the AZure CLI, you can list the image versions that are available to use. For example, # az vm image list --location eastus --publisher almalinux --offer almalinux-x86_64 --sku 9-gen2 --all --output table # az vm image list --location eastus --publisher almalinux --offer almalinux-arm --sku 9-arm-gen2 --all --output table # (Note: available versions may be location specific!) image = { publisher = "almalinux", offer = "almalinux-x86_64", sku = "9-gen2", version = "9.4.2024050902" } instances = { mgmt = { type = "Standard_DS2_v2", count = 1, tags = ["mgmt", "puppet", "nfs"] }, login = { type = "Standard_DS1_v2", count = 1, tags = ["login", "public", "proxy"] }, node = { type = "Standard_DS1_v2", count = 4, tags = ["node"] }, gpu = { type = "Standard_NV6ads_A10_v5", count = 2, tags = ["gpu-node"] } } ``` *Issues Encountered:* When I try to apply the Terraform configuration to deploy the cluster with the new GPU nodes, I receive the following error: ``` Error: static IP allocation must be used when creating Standard SKU public IP addresses with module.azure.azurerm_public_ip.public_ip["gpu1"], on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip": 18: resource "azurerm_public_ip" "public_ip" { ``` This error repeats for each public IP resource being created. *Troubleshooting Steps and Changes Tried:* 1. *Understanding the Error:* - The error suggests that when creating Standard SKU public IP addresses, the allocation_method must be set to "Static", but in the configuration, some public IPs are set to "Dynamic". 2. *Examining network.tf <http://network.tf>:* Here's the relevant portion of my network.tf: ``` # Create public IPs resource "azurerm_public_ip" "public_ip" { for_each = module.design.instances name = format("%s-%s-public-ipv4", var.cluster_name, each.key) location = var.location resource_group_name = local.resource_group_name allocation_method = contains(each.value.tags, "public") ? "Static" : "Dynamic" } ``` 3. *Attempted Fixes:* - *Option 1:* Explicitly set the sku to "Basic" in the azurerm_public_ip resource to allow "Dynamic" allocation: ``` resource "azurerm_public_ip" "public_ip" { for_each = module.design.instances name = format("%s-%s-public-ipv4", var.cluster_name, each.key) location = var.location resource_group_name = local.resource_group_name allocation_method = contains(each.value.tags, "public") ? "Static" : "Dynamic" sku = "Basic" } ``` - *Result:* The error was resolved, but I'm unsure if using the Basic SKU is appropriate for my use case. - 4. *Constraints:* - I prefer not to modify the module files (network.tf) directly to keep the deployment process consistent and maintainable. - I attempted to make changes in main.tf to resolve the issue without modifying network.tf, but was unsuccessful. *Questions:* - Is there a recommended way to address this issue without modifying the module files? - Is there an updated version of Magic Castle that resolves this problem? - If I upgrade to version 14.0.0-beta.7 as suggested, will it resolve this issue, and what are the implications for existing users and data? *Additional Information:* - I'm concerned about redeploying the cluster from scratch due to the potential loss of existing user data. - If upgrading to the latest beta version is the best solution, could you advise on the best way to migrate existing data and users? Thank you for your assistance. Best regards, Oscar Diez

…

On Wed, Nov 6, 2024 at 3:17 PM Félix-Antoine Fortin < ***@***.***> wrote: Yes, unfortunately you will have to recreate users and upload data if you start from scratch. You could start a new clusters next to the one your already have and move data before destroying it, but you will have to recreate users. — Reply to this email directly, view it on GitHub <#331 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFSHEHRRH6HSIOOBADWR5MTZ7IQFZAVCNFSM6AAAAABRBPHGCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJZHA3TGMJRGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Regards, Oscar Diez

cmd-ntrf · 2024-11-12T14:01:28Z

Hi Oscar,

In the future, I would appreciate if you could use the GitHub web interface to comment on issue. Replying via email disable the markdown rendering which makes your comment somewhat tougher to read.

Your assessment of the issue is correct. A recent change in AzureRM Terraform provider changed the default value of the azurerm_public_ip's sku variable from Basic to Standard. Explicitly defining its value to Basic correctly solves the issues as it was the default value before Azure made that change to the default value. Thank you for reporting this issue, Azure in Magic Castle is underused and this sort of issue often flies under my radar.

To your question:

Fixing the issue for now, best way to do it is to modify the network.tf as you did.
I will publish a new release with the fix today.
14.0.0-beta.7 do not include a fix, 14.0.0-beta.8 will.
Unfortunately, there no are mechanisms in Magic Castle at the moment that facilitate the migration of data and users between clusters as the original intent was disposable clusters for training. I would suggest to create a new cluster before deleting the previous one, than transfer the data via rsync, re-create the users, and then delete the previous cluster.

Finally remark, the tag "gpu-node" for your gpu instance does not exist and therefore your gpu node will not be properly configured. Replace it by the "node". Puppet will correctly identify if the compute node has a gpu and configure it.

issue #331

odiezg · 2024-11-14T00:25:44Z

Many thanks,
I still get issues with the public_ip and the network.tf.

│   with module.azure.azurerm_public_ip.public_ip["node1"],
│   on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│   18: resource "azurerm_public_ip" "public_ip" {
│
╵
╷
│ Error: static IP allocation must be used when creating Standard SKU public IP addresses
│
│   with module.azure.azurerm_public_ip.public_ip["mgmt1"],
│   on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│   18: resource "azurerm_public_ip" "public_ip" {
│
╵
╷
│ Error: static IP allocation must be used when creating Standard SKU public IP addresses
│
│   with module.azure.azurerm_public_ip.public_ip["node2"],
│   on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│   18: resource "azurerm_public_ip" "public_ip" {
│
╵
╷
│ Error: static IP allocation must be used when creating Standard SKU public IP addresses
│
│   with module.azure.azurerm_public_ip.public_ip["gpu-node1"],
│   on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│   18: resource "azurerm_public_ip" "public_ip" {
│

I tried to use this to solve it, it creates the servers, but after I only can connect via ssh to the login1 server.

locals {
  public_ip_skus = {
    for k, v in module.design.instances :
    k => contains(v.tags, "public") ? "Basic" : "Standard"
  }

  public_ip_allocation_methods = {
    for k, v in module.design.instances :
    k => contains(v.tags, "public") || local.public_ip_skus[k] == "Standard" ? "Static" : "Dynamic"
  }
}

resource "azurerm_public_ip" "public_ip" {
  for_each            = module.design.instances
  name                = format("%s-%s-public-ipv4", var.cluster_name, each.key)
  location            = var.location
  resource_group_name = local.resource_group_name

  sku                 = local.public_ip_skus[each.key]
  allocation_method   = local.public_ip_allocation_methods[each.key]
}

The cluster is not working and it is not staring any service.

cmd-ntrf · 2024-11-15T20:46:37Z

Hi Oscar,

Sorry, it appears I got the sku wrong in my patch. Since the "Basic" sku is being deprecated, I think the best course of action is to simply remove it. This will give us this:

resource "azurerm_public_ip" "public_ip" {
  for_each            = module.design.instances
  name                = format("%s-%s-public-ipv4", var.cluster_name, each.key)
  location            = var.location
  resource_group_name = local.resource_group_name

  sku                 = "Standard"
  allocation_method   = "Static"
}

Can you clarify this statement?

but after I only can connect via ssh to the login1 server.

Does it mean you cannot ssh in other instances from login1 or you cannot SSH directly from the internet to any other instance than login1?

The absence of services running typically indicate the configuration with Puppet has either not finished or has encountered a problem. You can look at journalctl -u puppet and potentially provide a copy of it via gist.github.com.

Best,
Felix

odiezg · 2024-11-15T20:52:56Z

Hi, I have installed again in parallel the version 13.5.0 of magic castle and it creates the cluster but I still have the issues with GPU server. I include below the information of the cluster. For the new version 14.0.0 the cluster is not been properly created.

module "azure" {
source = "./azure"
config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git"
config_version = "13.5.0"

cluster_name = "hpcie"
domain = "labs.faculty.ie.edu"

image = {
publisher = "almalinux",
offer = "almalinux-x86_64",
sku = "9-gen2",
version = "9.4.2024050902"
}

instances = {
mgmt = { type = "Standard_B2ms", count = 1, tags = ["mgmt", "puppet", "nfs"] },
login = { type = "Standard_B2s", count = 1, tags = ["login", "public", "proxy"] },
node = { type = "Standard_B2s", count = 4, tags = ["node"] },
gpu-node = { type = "Standard_NV6ads_A10_v5", count = 1, tags = ["node"] }

odiezg · 2024-11-15T21:20:38Z

Sorry Felix,
I just saw your previous message now. What I meant before with I cannot connect to the new servers is that the servers are up but not responding to ssh, it is very strange, it does not happen with login 1 or with any server of the cluster with version 13.5 I think is related to the network interface. I managed to connect to the mgmt1 server using the Azure bastion connect but when I execute journalctl -u puppet I get no entries. I tried to start puppet: sudo systemctl status puppetserver

cmd-ntrf · 2024-11-19T18:19:17Z

@OscarDiez : good news, I think I found the issue. The security was not properly associated with the instances.

The fix will be included in next release 14.1.2 that should come out today.

odiezg · 2024-11-20T20:17:55Z

Many thanks, I have run it again and I got initially an error. This is the error I get when creating the cluster:

Error: file provisioner error
│
│   with module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"],
│   on common/provision/main.tf line 79, in resource "terraform_data" "deploy_puppetserver_files":
│   79:   provisioner "file" {
│
│ Upload failed: Process exited with status 255

But I run it a second time and it has created the cluster. The cluster was created and I can connect to jupyter environment and launch a session on a normal node, but not in the GPU node. It tries but get a lot of messages saying "pending in queue."

Also, I can only connect via ssh from my laptop to login1 server. It does not work for the others (mgmt1 or nodes...), and when I try to connect from login1 to mgmt1 it asking me for password, it is not using the key file.

Info about the config file:

terraform {
  required_version = ">= 1.4.0"
}

variable "pool" {
  description = "Slurm pool of compute nodes"
  default = []
}

module "azure" {
  source         = "./azure"
  config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git"
  config_version = "14.1.2"

  cluster_name = "hpc"
  domain       = "labs.faculty.ie.edu"


  image        = {
    publisher = "almalinux",
    offer     = "almalinux-x86_64",
    sku       = "9-gen2",
    version   = "9.3.2023111602"
  }

  instances = {
    #mgmt  = { type = "Standard_DS2_v2",  count = 1, tags = ["mgmt", "puppet", "nfs"] },
    #login = { type = "Standard_DS1_v2", count = 1, tags = ["login", "public", "proxy"] },
    #node  = { type = "Standard_DS1_v2",  count = 1, tags = ["node"] }
    mgmt  = { type = "Standard_B2ms", count = 1, tags = ["mgmt", "puppet", "nfs"] },
    login = { type = "Standard_B2s",  count = 1, tags = ["login", "public", "proxy"] },
    node  = { type = "Standard_B2s",  count = 2, tags = ["node"] },
    gpu-node = { type = "Standard_NV6ads_A10_v5", count = 1, tags = ["node"] } 
  }

And when checking in the login1 the status of puppet:

[centos@login1 ~]$ puppet agent --test
Error: Connection to https://puppet:8140/puppet-ca/v1 failed, trying next route:                          Request to https://puppet:8140/puppet-ca/v1 failed after 0.144 seconds: Failed                          to open TCP connection to puppet:8140 (getaddrinfo: Name or service not known)
Wrapped exception:
Failed to open TCP connection to puppet:8140 (getaddrinfo: Name or service not k                         nown)
Error: No more routes to ca
Error: No more routes to ca"

module.azure.module.configuration.tls_private_key.rsa["mgmt"]: Creation complete after 13s [id=a19a025e9a1841ac5412b62ba5c1b58b76a52615]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Creating...
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Creating...
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Creating...
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Creating...
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Creating...
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [50s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [50s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [50s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [50s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [50s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [1m0s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [1m0s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [1m0s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [1m0s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [1m0s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [1m10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [1m10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [1m10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Still creating... [1m10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [1m10s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["gpu-node1"]: Creation complete after 1m19s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-gpu-node1]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [1m20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [1m20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [1m20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [1m20s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [1m30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [1m30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [1m30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [1m30s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Still creating... [1m40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Still creating... [1m40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Still creating... [1m40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Still creating... [1m40s elapsed]
module.azure.azurerm_linux_virtual_machine.instances["mgmt1"]: Creation complete after 1m48s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-mgmt1]
module.azure.azurerm_linux_virtual_machine.instances["node1"]: Creation complete after 1m48s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-node1]
module.azure.azurerm_linux_virtual_machine.instances["login1"]: Creation complete after 1m49s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-login1]
module.azure.azurerm_linux_virtual_machine.instances["node2"]: Creation complete after 1m50s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-node2]
module.dns.module.record_generator.data.external.key2fp["login1"]: Reading...
module.azure.module.provision.data.archive_file.puppetserver_files: Reading...
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-home"]: Creating...
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-scratch"]: Creating...
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-project"]: Creating...
module.azure.module.provision.data.archive_file.puppetserver_files: Read complete after 0s [id=a64f220c292237bfdd75ac9314e560ff9f155f66]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Creating...
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Provisioning with 'file'...
module.dns.module.record_generator.data.external.key2fp["login1"]: Read complete after 1s [id=-]
module.dns.local_file.dns_record: Creating...
module.dns.local_file.dns_record: Creation complete after 0s [id=1f7fbe58d7c3914f5352279f91472dd27801f1b7]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-home"]: Creation complete after 10s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-mgmt1/dataDisks/hpc-mgmt1-nfs-home]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-scratch"]: Still creating... [10s elapsed]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-project"]: Still creating... [10s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [10s elapsed]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-scratch"]: Still creating... [20s elapsed]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-project"]: Still creating... [20s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [20s elapsed]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-scratch"]: Creation complete after 21s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-mgmt1/dataDisks/hpc-mgmt1-nfs-scratch]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-project"]: Still creating... [30s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [30s elapsed]
module.azure.azurerm_virtual_machine_data_disk_attachment.attachments["mgmt1-nfs-project"]: Creation complete after 32s [id=/subscriptions/e0b9cada-61bc-4b5a-bd7a-52c606726b3b/resourceGroups/hpc_resource_group/providers/Microsoft.Compute/virtualMachines/hpc-mgmt1/dataDisks/hpc-mgmt1-nfs-project]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [40s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [50s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [1m0s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [1m10s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [1m20s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [1m30s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [1m40s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [1m50s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [2m0s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [2m10s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [2m20s elapsed]
module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"]: Still creating... [2m30s elapsed]
╷
│ Error: file provisioner error
│
│   with module.azure.module.provision.terraform_data.deploy_puppetserver_files["mgmt1"],
│   on common/provision/main.tf line 79, in resource "terraform_data" "deploy_puppetserver_files":
│   79:   provisioner "file" {
│
│ Upload failed: Process exited with status 255

Please let me know if you need anything else.

odiezg · 2024-11-27T09:16:17Z

Dear Felix-Antoine, do you need any extra logs or information from my side?

cmd-ntrf · 2024-11-27T16:13:23Z

Sorry Oscar, I was away for Supercomputing when you last wrote.

It is possible that you have to run terraform apply twice. AlmaLinux images on Azure are currently missing rsync which is essential for the deploy_puppetserver_files resource to complete. I have added the installation of rsync in the cloud-init, but it is possible the provisioner runs before the installation complete and you run into the error. You have to re-run terraform apply until it completes. Wait 30s - 1min between apply to minimize the chances of error.
To be able to connect to mgmt1 from login1, you will need to use an ssh-agent and forward your ssh key with the -A flag of the ssh client. It is by design that only login1 is available from the internet.
To figure out why you cannot launch a GPU job with jupyter, you will need to look at Slurm state with sinfo and potentially ssh to gpu-node1 to look at the puppet log : journalctl -u puppet. If sinfo reports gpu-node1 is idle, the issue is with JupyterHub, it is reported as down, there was a problem with the configuration and the error should appear in gpu-node1 puppet log.
puppet agent --test fails because it uses the default name for the puppet server. It is not a valid indicator wether puppet is properly working or not. If you want to know if puppet is working properly, look at the logs: journalctl -u puppet.

We'll look into scheduling a call if you cannot find out why the GPU node does work.

odiezg · 2024-11-27T22:24:33Z

Many thanks for the swift reply and do not worry. I hope you had a good time at SC24.
I can connect to the others servers using forward as you said. I did not have that feature with the previous version.
I have run it again, but I did not destroy the cluster, just ran again the terraform apply. I have tried what you told me (see below) but no success. We can setup a call, I can tomorrow Thursday 28th or Friday 29th from 13:00 Quebec time (19:00 Brussels time). Many thanks again.

Regarding the problem of the server what I get from sinfo is:

[centos@login1 ]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpubase_bycore_b1* up infinite 1 down gpu-node1
cpubase_bycore_b1* up infinite 2 idle node[1-2]
gpu-node up infinite 1 down~ gpu-node1
node up infinite 2 idle node[1-2]
[centos@login1 ~]$

from the gpu-node when I execute: journalctl -u puppet

Nov 27 18:51:57 gpu-node1.int.hpc.labs.faculty.ie.edu systemd[1]: Started Puppet agent.
Nov 27 18:52:01 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1317]: Starting Puppet client version 7.32.1
Nov 27 18:52:01 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.002 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:01 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:01 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:01 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: No more routes to fileserver
Nov 27 18:52:12 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.001 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:12 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:12 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:12 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Could not retrieve catalog from remote server: No more routes to puppet
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.002 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Profile::Ssh::Base/File[/etc/ssh/sshd_config.d/49-magic_castle.conf]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/base/opensshserver-9.config: No more routes to files>
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.001 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Profile::Base/File[/usr/sbin/prepare4image.sh]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/base/prepare4image.sh: No more routes to fileserver
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.002 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:14 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Profile::Consul::Puppet_watch/File[/usr/bin/puppet_event_handler.sh]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/consul/puppet_event_handler.sh: No more routes to fi>
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.001 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Profile::Freeipa::Client/File[/sbin/mc-ipa-client-install]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/freeipa/mc-ipa-client-install: No more routes to fileserver
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.007 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:26 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Profile::Freeipa::Base/File[/etc/rsyslog.d/ignore-systemd-session-slice.conf]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/freeipa/ignore-systemd-session-slice.conf: >
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.001 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Jupyterhub::Node::Install/File[/opt/jupyterhub/lib/usercustomize/usercustomize.py]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/jupyterhub/usercustomize.py: No more routes to>
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Connection to https://mgmt1:8140/puppet/v3 failed, trying next route: Request to https://mgmt1:8140/puppet/v3 failed after 0.002 seconds: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for >
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Wrapped exception:
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: Failed to open TCP connection to mgmt1:8140 (Connection refused - connect(2) for "mgmt1" port 8140)
Nov 27 18:52:27 gpu-node1.int.hpc.labs.faculty.ie.edu puppet-agent[1387]: (/Stage[main]/Profile::Slurm::Base/File[/etc/slurm/epilog]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/profile/slurm/epilog: No more routes to fileserver

and in the mgmt server puppet looks fine: [centos@mgmt1 ~]$ sudo systemctl status puppetserver
● puppetserver.service - puppetserver Service
Loaded: loaded (/usr/lib/systemd/system/puppetserver.service; enabled; preset: disabled)
Active: active (running) since Wed 2024-11-27 18:52:42 UTC; 3h 19min ago
Main PID: 1308 (java)
Tasks: 55 (limit: 4915)
Memory: 2.5G
CPU: 5min 3.059s
CGroup: /system.slice/puppetserver.service
└─1308 /usr/lib/jvm/jre-11/bin/java -Xms2g -Xmx2g -Djruby.logger.class=com.puppetlabs.jruby_utils.jruby.Slf4jLogger -Dlogappender=F1 -XX:>

Nov 27 18:52:10 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflect>
Nov 27 18:52:10 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: WARNING: All illegal access operations will be denied in a future release
Nov 27 18:52:14 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: WARNING: abs already refers to: #'clojure.core/abs in namespace: medley.core, be>
Nov 27 18:52:42 mgmt1.int.hpc.labs.faculty.ie.edu systemd[1]: Started puppetserver Service.
Nov 27 18:53:17 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: /etc/puppetlabs/code/environments/main/modules/lvm/lib/puppet/type/logical_volum>
Nov 27 18:53:17 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: /etc/puppetlabs/code/environments/main/modules/lvm/lib/puppet/type/logical_volum>
Nov 27 18:53:17 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: /etc/puppetlabs/code/environments/main/modules/lvm/lib/puppet/type/logical_volum>
Nov 27 18:53:17 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: /etc/puppetlabs/code/environments/main/modules/lvm/lib/puppet/type/logical_volum>
Nov 27 18:53:39 mgmt1.int.hpc.labs.faculty.ie.edu systemd[1]: /usr/lib/systemd/system/puppetserver.service:45: Standard output type syslog is obsolete>
Nov 27 18:53:40 mgmt1.int.hpc.labs.faculty.ie.edu systemd[1]: /usr/lib/systemd/system/puppetserver.service:45: Standard output type syslog is obsolete>
lines 1-20/20 (END)

and listening:
[centos@mgmt1 ~]$ sudo netstat -tuln | grep 8140
tcp6 0 0 :::8140 :::* LISTEN

and from mgmt1 node I can see mgmt1 server from the gpu-node:

[centos@gpu-node1 ~]$ nslookup mgmt1
Server: 10.0.1.4
Address: 10.0.1.4#53

Name: mgmt1.int.hpc.labs.faculty.ie.edu
Address: 10.0.1.4

[centos@gpu-node1 ~]$ ^C
[centos@gpu-node1 ~]$
[centos@gpu-node1 ~]$ ping mgmt1
PING mgmt1.int.hpc.labs.faculty.ie.edu (10.0.1.4) 56(84) bytes of data.
64 bytes from mgmt1.int.hpc.labs.faculty.ie.edu (10.0.1.4): icmp_seq=1 ttl=64 time=1.27 ms
64 bytes from mgmt1.int.hpc.labs.faculty.ie.edu (10.0.1.4): icmp_seq=2 ttl=64 time=1.05 ms
^C
--- mgmt1.int.hpc.labs.faculty.ie.edu ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 1.045/1.159/1.274/0.114 ms

[centos@gpu-node1 ~]$ sudo journalctl -u puppetserver
-- No entries --
[centos@gpu-node1 ~]$ cat /etc/puppetlabs/puppet/puppet.conf
[main]
server = mgmt1
certname = gpu-node1
waitforcert = 15s
report = false
postrun_command = /opt/puppetlabs/bin/postrun

This file can be used to override the default puppet settings.

See the following links for more details on what settings are available:

- https://puppet.com/docs/puppet/latest/config_important_settings.html

- https://puppet.com/docs/puppet/latest/config_about_settings.html

- https://puppet.com/docs/puppet/latest/config_file_main.html

- https://puppet.com/docs/puppet/latest/configuration.html

[centos@gpu-node1 ~]$

odiezg · 2024-12-06T10:06:28Z

Dear Felix-Antoine, do you need any extra logs or information from my side?

cmd-ntrf self-assigned this Nov 5, 2024

cmd-ntrf added the question Further information is requested label Nov 5, 2024

cmd-ntrf added bug Something isn't working azure labels Nov 12, 2024

cmd-ntrf added a commit that referenced this issue Nov 12, 2024

Define a value for sku in azurerm_public_ip

da844fb

issue #331

cmd-ntrf mentioned this issue Nov 12, 2024

Define a value for sku in azurerm_public_ip #332

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Adding New GPU Servers to Magic Castle Cluster #331

Issues with Adding New GPU Servers to Magic Castle Cluster #331

OscarDiez commented Nov 2, 2024 •

edited by cmd-ntrf

Loading

cmd-ntrf commented Nov 5, 2024

OscarDiez commented Nov 5, 2024 via email

cmd-ntrf commented Nov 5, 2024

OscarDiez commented Nov 6, 2024 via email

cmd-ntrf commented Nov 6, 2024

OscarDiez commented Nov 11, 2024 via email •

edited by cmd-ntrf

Loading

cmd-ntrf commented Nov 12, 2024

odiezg commented Nov 14, 2024 •

edited by cmd-ntrf

Loading

cmd-ntrf commented Nov 15, 2024 •

edited

Loading

odiezg commented Nov 15, 2024 •

edited

Loading

odiezg commented Nov 15, 2024

cmd-ntrf commented Nov 19, 2024

odiezg commented Nov 20, 2024 •

edited by cmd-ntrf

Loading

odiezg commented Nov 27, 2024

cmd-ntrf commented Nov 27, 2024

odiezg commented Nov 27, 2024

odiezg commented Dec 6, 2024

Issues with Adding New GPU Servers to Magic Castle Cluster #331

Issues with Adding New GPU Servers to Magic Castle Cluster #331

Comments

OscarDiez commented Nov 2, 2024 • edited by cmd-ntrf Loading

cmd-ntrf commented Nov 5, 2024

OscarDiez commented Nov 5, 2024 via email

cmd-ntrf commented Nov 5, 2024

OscarDiez commented Nov 6, 2024 via email

cmd-ntrf commented Nov 6, 2024

OscarDiez commented Nov 11, 2024 via email • edited by cmd-ntrf Loading

cmd-ntrf commented Nov 12, 2024

odiezg commented Nov 14, 2024 • edited by cmd-ntrf Loading

cmd-ntrf commented Nov 15, 2024 • edited Loading

odiezg commented Nov 15, 2024 • edited Loading

odiezg commented Nov 15, 2024

cmd-ntrf commented Nov 19, 2024

odiezg commented Nov 20, 2024 • edited by cmd-ntrf Loading

odiezg commented Nov 27, 2024

cmd-ntrf commented Nov 27, 2024

odiezg commented Nov 27, 2024

This file can be used to override the default puppet settings.

See the following links for more details on what settings are available:

- https://puppet.com/docs/puppet/latest/config_important_settings.html

- https://puppet.com/docs/puppet/latest/config_about_settings.html

- https://puppet.com/docs/puppet/latest/config_file_main.html

- https://puppet.com/docs/puppet/latest/configuration.html

odiezg commented Dec 6, 2024

OscarDiez commented Nov 2, 2024 •

edited by cmd-ntrf

Loading

OscarDiez commented Nov 11, 2024 via email •

edited by cmd-ntrf

Loading

odiezg commented Nov 14, 2024 •

edited by cmd-ntrf

Loading

cmd-ntrf commented Nov 15, 2024 •

edited

Loading

odiezg commented Nov 15, 2024 •

edited

Loading

odiezg commented Nov 20, 2024 •

edited by cmd-ntrf

Loading