-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with Adding New GPU Servers to Magic Castle Cluster #331
Comments
Hi Oscar, Problem 2 and 3 stem from problem 1. When GPU drivers cannot be installed properly, slurmd won't start and the node will never available for jobs.
It should be pretty straightforward to find the culprit of your problem once you provide these 3 informations. Best, |
Many thanks for the reply, it is much appreciated.
I attach the main.tf file but the responses to your questions are as
follows:
MAgic Castle is 13.5.0,
Cloud is Azure
OS is almalinux 9-gen2
module "azure" {
source = "./azure"
config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git
"
config_version = "13.5.0"
image = {
publisher = "almalinux",
offer = "almalinux-x86_64",
sku = "9-gen2",
version = "9.4.2024050902"
}
instances = {
mgmt = { type = "Standard_B2ms", count = 1, tags = ["mgmt", "puppet",
"nfs"] },
login = { type = "Standard_B2s", count = 1, tags = ["login", "public",
"proxy"] },
node = { type = "Standard_B2s", count = 5, tags = ["node"] },
gpu-node = { type = "Standard_NV6ads_A10_v5", count = 3, tags =
["node", "gpu-node"] }
…On Tue, Nov 5, 2024 at 2:59 PM Félix-Antoine Fortin < ***@***.***> wrote:
Hi Oscar,
Problem 2 and 3 stem from problem 1. When GPU drivers cannot be installed
properly, slurmd won't start and the node will never available for jobs.
1. Which version of Magic Castle are you using?
2. Which cloud provider are you using?
3. What image / operating system are you using?
It should be pretty straightforward to find the culprit of your problem
once you provide these 3 informations.
Best,
Felix
—
Reply to this email directly, view it on GitHub
<#331 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFSHEHWHYTQN5IBDDUIZGADZ7DFKTAVCNFSM6AAAAABRBPHGCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJXGI2TGMRUGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Regards,
Oscar Diez
|
Could you try from scratch but using the latest beta release instead? 14.0.0 is just days from being officially release, and it will probably solve your problem. |
Many thanks for your reply,
I will test it later today, but if I try to run again the system from
scratch, will I maintain the users created and their data or will it be
deleting everything and I will need to recreate the users again?
…On Tue, 5 Nov 2024, 20:52 Félix-Antoine Fortin, ***@***.***> wrote:
Could you try from scratch but using the latest beta release instead?
https://github.com/ComputeCanada/magic_castle/releases/tag/14.0.0-beta.6
14.0.0 is just days from being officially release, and it will probably
solve your problem.
—
Reply to this email directly, view it on GitHub
<#331 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFSHEHSGWZ3G4BQCFSZA2DTZ7EV2HAVCNFSM6AAAAABRBPHGCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJYGEZTAMBTGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Yes, unfortunately you will have to recreate users and upload data if you start from scratch. You could start a new clusters next to the one your already have and move data before destroying it, but you will have to recreate users. |
Dear Félix,
Thank you for your assistance with my previous inquiry.
I have destroyed the previous installation and installed the new one. But
it is not working. The new configuration is below, I got the latest version
of magic castle and I am using the newest almalinux version. :
```hcl
module "azure" {
source = "./azure"
config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git
"
config_version = "14.0.0-beta.7"
cluster_name = "hpcie"
domain = "labs.faculty.ie.edu"
# Using the AZure CLI, you can list the image versions that are available
to use. For example,
# az vm image list --location eastus --publisher almalinux --offer
almalinux-x86_64 --sku 9-gen2 --all --output table
# az vm image list --location eastus --publisher almalinux --offer
almalinux-arm --sku 9-arm-gen2 --all --output table
# (Note: available versions may be location specific!)
image = {
publisher = "almalinux",
offer = "almalinux-x86_64",
sku = "9-gen2",
version = "9.4.2024050902"
}
instances = {
mgmt = { type = "Standard_DS2_v2", count = 1, tags = ["mgmt",
"puppet", "nfs"] },
login = { type = "Standard_DS1_v2", count = 1, tags = ["login",
"public", "proxy"] },
node = { type = "Standard_DS1_v2", count = 4, tags = ["node"] },
gpu = { type = "Standard_NV6ads_A10_v5", count = 2, tags =
["gpu-node"] }
}
```
*Issues Encountered:*
When I try to apply the Terraform configuration to deploy the cluster with
the new GPU nodes, I receive the following error:
```
Error: static IP allocation must be used when creating Standard SKU
public IP addresses
with module.azure.azurerm_public_ip.public_ip["gpu1"],
on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
18: resource "azurerm_public_ip" "public_ip" {
```
This error repeats for each public IP resource being created.
*Troubleshooting Steps and Changes Tried:*
1.
*Understanding the Error:*
- The error suggests that when creating Standard SKU public IP
addresses, the allocation_method must be set to "Static", but in the
configuration, some public IPs are set to "Dynamic".
2.
*Examining network.tf <http://network.tf>:*
Here's the relevant portion of my network.tf:
```
# Create public IPs
resource "azurerm_public_ip" "public_ip" {
for_each = module.design.instances
name = format("%s-%s-public-ipv4",
var.cluster_name, each.key)
location = var.location
resource_group_name = local.resource_group_name
allocation_method = contains(each.value.tags, "public") ?
"Static" : "Dynamic"
}
```
3.
*Attempted Fixes:*
-
*Option 1:* Explicitly set the sku to "Basic" in the azurerm_public_ip
resource to allow "Dynamic" allocation:
```
resource "azurerm_public_ip" "public_ip" {
for_each = module.design.instances
name = format("%s-%s-public-ipv4",
var.cluster_name, each.key)
location = var.location
resource_group_name = local.resource_group_name
allocation_method = contains(each.value.tags, "public") ?
"Static" : "Dynamic"
sku = "Basic"
}
```
- *Result:* The error was resolved, but I'm unsure if using the Basic
SKU is appropriate for my use case.
-
4.
*Constraints:*
-
I prefer not to modify the module files (network.tf) directly to keep
the deployment process consistent and maintainable.
-
I attempted to make changes in main.tf to resolve the issue without
modifying network.tf, but was unsuccessful.
*Questions:*
- Is there a recommended way to address this issue without modifying the
module files?
- Is there an updated version of Magic Castle that resolves this problem?
- If I upgrade to version 14.0.0-beta.7 as suggested, will it resolve
this issue, and what are the implications for existing users and data?
*Additional Information:*
- I'm concerned about redeploying the cluster from scratch due to the
potential loss of existing user data.
- If upgrading to the latest beta version is the best solution, could
you advise on the best way to migrate existing data and users?
Thank you for your assistance.
Best regards,
Oscar Diez
…On Wed, Nov 6, 2024 at 3:17 PM Félix-Antoine Fortin < ***@***.***> wrote:
Yes, unfortunately you will have to recreate users and upload data if you
start from scratch.
You could start a new clusters next to the one your already have and move
data before destroying it, but you will have to recreate users.
—
Reply to this email directly, view it on GitHub
<#331 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFSHEHRRH6HSIOOBADWR5MTZ7IQFZAVCNFSM6AAAAABRBPHGCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJZHA3TGMJRGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Regards,
Oscar Diez
|
Hi Oscar, In the future, I would appreciate if you could use the GitHub web interface to comment on issue. Replying via email disable the markdown rendering which makes your comment somewhat tougher to read. Your assessment of the issue is correct. A recent change in AzureRM Terraform provider changed the default value of the To your question:
Finally remark, the tag |
Many thanks,
I tried to use this to solve it, it creates the servers, but after I only can connect via ssh to the login1 server.
The cluster is not working and it is not staring any service. |
Hi Oscar, Sorry, it appears I got the sku wrong in my patch. Since the
Can you clarify this statement?
Does it mean you cannot ssh in other instances from login1 or you cannot SSH directly from the internet to any other instance than login1? The absence of services running typically indicate the configuration with Puppet has either not finished or has encountered a problem. You can look at Best, |
Hi, I have installed again in parallel the version 13.5.0 of magic castle and it creates the cluster but I still have the issues with GPU server. I include below the information of the cluster. For the new version 14.0.0 the cluster is not been properly created. module "azure" { cluster_name = "hpcie" image = { instances = { |
@OscarDiez : good news, I think I found the issue. The security was not properly associated with the instances. The fix will be included in next release 14.1.2 that should come out today. |
Many thanks, I have run it again and I got initially an error. This is the error I get when creating the cluster:
But I run it a second time and it has created the cluster. The cluster was created and I can connect to jupyter environment and launch a session on a normal node, but not in the GPU node. It tries but get a lot of messages saying "pending in queue." Also, I can only connect via ssh from my laptop to login1 server. It does not work for the others (mgmt1 or nodes...), and when I try to connect from login1 to mgmt1 it asking me for password, it is not using the key file. Info about the config file: terraform {
required_version = ">= 1.4.0"
}
variable "pool" {
description = "Slurm pool of compute nodes"
default = []
}
module "azure" {
source = "./azure"
config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git"
config_version = "14.1.2"
cluster_name = "hpc"
domain = "labs.faculty.ie.edu"
image = {
publisher = "almalinux",
offer = "almalinux-x86_64",
sku = "9-gen2",
version = "9.3.2023111602"
}
instances = {
#mgmt = { type = "Standard_DS2_v2", count = 1, tags = ["mgmt", "puppet", "nfs"] },
#login = { type = "Standard_DS1_v2", count = 1, tags = ["login", "public", "proxy"] },
#node = { type = "Standard_DS1_v2", count = 1, tags = ["node"] }
mgmt = { type = "Standard_B2ms", count = 1, tags = ["mgmt", "puppet", "nfs"] },
login = { type = "Standard_B2s", count = 1, tags = ["login", "public", "proxy"] },
node = { type = "Standard_B2s", count = 2, tags = ["node"] },
gpu-node = { type = "Standard_NV6ads_A10_v5", count = 1, tags = ["node"] }
} And when checking in the login1 the status of puppet:
Please let me know if you need anything else. |
Dear Felix-Antoine, do you need any extra logs or information from my side? |
Sorry Oscar, I was away for Supercomputing when you last wrote.
We'll look into scheduling a call if you cannot find out why the GPU node does work. |
Many thanks for the swift reply and do not worry. I hope you had a good time at SC24. Regarding the problem of the server what I get from sinfo is: [centos@login1 from the gpu-node when I execute: journalctl -u puppet Nov 27 18:51:57 gpu-node1.int.hpc.labs.faculty.ie.edu systemd[1]: Started Puppet agent. and in the mgmt server puppet looks fine: [centos@mgmt1 ~]$ sudo systemctl status puppetserver Nov 27 18:52:10 mgmt1.int.hpc.labs.faculty.ie.edu puppetserver[1308]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflect> and listening: and from mgmt1 node I can see mgmt1 server from the gpu-node: [centos@gpu-node1 ~]$ nslookup mgmt1 Name: mgmt1.int.hpc.labs.faculty.ie.edu [centos@gpu-node1 ~]$ ^C [centos@gpu-node1 ~]$ sudo journalctl -u puppetserver This file can be used to override the default puppet settings.See the following links for more details on what settings are available:- https://puppet.com/docs/puppet/latest/config_important_settings.html- https://puppet.com/docs/puppet/latest/config_about_settings.html- https://puppet.com/docs/puppet/latest/config_file_main.html- https://puppet.com/docs/puppet/latest/configuration.html[centos@gpu-node1 ~]$ |
Dear Felix-Antoine, do you need any extra logs or information from my side? |
I’ve been working on adding 3 new GPU servers to the Magic Castle cluster, but unfortunately, I’ve been facing multiple issues with the setup, and I’m at a bit of a standstill.
Issues Encountered:
I’ve been trying to get the NVIDIA drivers and kernel modules properly installed, but Puppet keeps returning the following error:
As a result, several stages are being skipped due to failed dependencies, including services like nvidia-persistenced and nvidia-dcgm. Despite manually trying to install the correct drivers (such as nvidia-driver-cuda), the error persists.
I’ve checked the logs and Puppet config files but haven’t been able to pinpoint the root cause. Here’s a portion of the error from the Puppet run:
SLURM also seems to be having issues with recognizing the new nodes. The nodes (gpu-node[1-3]) are showing up as down# in SLURM:
When I try to submit jobs to these nodes, I get the following error:
Batch job submission failed: Invalid account or account/partition combination specified
Additionally, jobs remain pending with the reason:
I’ve checked the slurmd service on the nodes and confirmed that it’s running. I’ve also reviewed the following logs and config files:
/var/log/slurmctld.log on the controller shows node availability issues.
/var/log/slurmd.log on the GPU nodes themselves doesn't reveal much beyond the standard communication errors.
The slurm.conf file appears to correctly define the GPU nodes, but they are still marked as down# in SLURM.
Attempts and Outcome
I’ve tried multiple fixes over the last few days, including:
Manually installing drivers and reconfiguring Puppet.
Restarting SLURM and resuming the nodes via scontrol.
Ensuring Munge is running properly on all nodes.
Updating the SLURM node state using scontrol update nodename=gpu-node1 state=RESUME.
Despite my best efforts, the nodes remain unavailable for job scheduling and spawn via Jupiter, and I’m starting to feel a bit desperate at this point.
I would really appreciate your help with this issue or any pointers to documentation or someone who could assist. It’s been a challenging process, and any guidance you can provide would be invaluable.
The text was updated successfully, but these errors were encountered: