Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lustreinfiniband #289

Open
wants to merge 43 commits into
base: master
Choose a base branch
from
Open

Lustreinfiniband #289

wants to merge 43 commits into from

Conversation

chadnar2
Copy link
Contributor

@chadnar2 chadnar2 commented Jun 5, 2020

  • lustre-ipoib - This is a created implementation of Lustre using ip over infiniband (IPoIB)
  • lustre-rdma - This is a created implementation of Lustre using native Remote Direct Memory Access (RDMA)

lustre_rdma_nvmedrives:
Changes to files to enable Infiniband functionality:
lfsmaster.sh
lfsoss.sh
lfsclient.sh
lfsrepo.sh
lfspkgs.sh

Addition for the installation of new Mellanox OFED (MOFED) for the Lustre kernel : installMOFED.sh

Addition for correct drives placement of OSSes : installdrives.sh
*installdrives.sh takes about 15 minutes to run so please either remote this entity, or wait it out.

Additions for correct Lustre kernel :
lustreinstall1.sh
lustreinstall2.sh

Addition for pause after MDS/OSS reboot : waitforreboot.sh

Narjit Chadha and others added 30 commits May 11, 2020 08:02
…over infiniband (IPoIB)

Changes to files to enable Infiniband functionality:
lfsmaster.sh
lfsoss.sh
lfsclient.sh
lfsrepo.sh

Addition for correct drives placement of OSSes : instaldrives.sh
*installdrives.sh takes about 15 minutes to run so please either remote this entity, or wait it out.
…e using IP over infiniband (IPoIB) using the existing 700GB NVMe drives in the H series nodes

Changes to files to enable Infiniband functionality:
lfsmaster.sh
lfsoss.sh
lfsclient.sh
lfsrepo.sh
…over infiniband (IPoIB)

- lustre-rdma - This is a created implementation of Lustre using native Remote Direct Memory Access (RDMA)

Changes to files to enable Infiniband functionality:
lfsmaster.sh
lfsoss.sh
lfsclient.sh
lfsrepo.sh
lfspkgs.sh

Addition for the installation of new OFED : installOFED.sh

Addition for correct Lustre kernel : lustreinstall1.sh
Lustre packages : lustreinstall2.sh

Addition for rebooting of Lustre MDS/OSS: rebootlustre.sh
Addition for pause after MDS/OSS reboot : waitforreboot.sh
Copy link
Collaborator

@hmeiland hmeiland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • please use something smaller than hc44 for headnode (no hpc/ib requirements)
  • do not rely on sleep for az resources to be made, better actively check
  • why modify .ssh dir in lsf scripts? that should already be set up
  • waagent isn't the only one trying to manage sdb; cloud-init is here as well...
  • why install ofed on centos-hpc image?
  • modify waagent.conf, but not restart?
  • double modification of waagent.conf?
  • sakey for saskey?
    please update, so I can start functional tests.... thanks!

@chadnar2
Copy link
Contributor Author

chadnar2 commented Jul 14, 2020

I have reduced the size of the headnode to a 'Standard_D8s_v3' since there is no infiniband connectivity with the Lustre servers anyway.
These are not HPC images, hence the need to install OFED. MOFED never worked for IB Lustre when we tried to use the HPC image. This may be something to look at.
.ssh is now required for the root user too, not just hpcuser. This is because the infiniband addition. The sudo functionality for hpcuser did not work for me, but this was awhile ago.
The waagent restart comes from another Lustre script.
What do you suggest to actively check if the Lustre kernel has been installed and the node is up? (ssh -q hpcuser@hostname exit ??)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants