Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support for launching an MAPDL instance in an SLURM HPC cluster #3497

Merged
merged 145 commits into from
Oct 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
ce11a7e
feat: adding env vars needed for multinode
germa89 Oct 4, 2024
61ad61b
feat: adding env vars needed for multinode
germa89 Oct 4, 2024
b50eeb6
Merge branch 'feat/passing-tight-integration-env-vars-to-MAPDL' of ht…
germa89 Oct 4, 2024
e9b91d4
feat: renaming hpc detection argument
germa89 Oct 7, 2024
c714d39
docs: adding documentation
germa89 Oct 7, 2024
492345b
chore: adding changelog file 3466.documentation.md
pyansys-ci-bot Oct 7, 2024
a289dab
feat: adding env vars needed for multinode
germa89 Oct 4, 2024
604bbf8
feat: renaming hpc detection argument
germa89 Oct 7, 2024
1d29651
docs: adding documentation
germa89 Oct 7, 2024
96929a8
chore: adding changelog file 3466.documentation.md
pyansys-ci-bot Oct 7, 2024
9b8e0e9
Merge branch 'feat/passing-tight-integration-env-vars-to-MAPDL' of ht…
germa89 Oct 7, 2024
6ab1d65
fix: vale issues
germa89 Oct 7, 2024
e45d2e5
chore: To fix sphinx build
germa89 Oct 7, 2024
bb2b90a
docs: expanding a bit troubleshooting advices and small format fix
germa89 Oct 7, 2024
330f33c
docs: fix vale
germa89 Oct 7, 2024
26f6dbd
Merge branch 'feat/passing-tight-integration-env-vars-to-MAPDL' of ht…
germa89 Oct 7, 2024
ac54f2c
fix: nproc tests
germa89 Oct 7, 2024
6985ee4
feat: adding env vars needed for multinode
germa89 Oct 4, 2024
03a05e6
feat: renaming hpc detection argument
germa89 Oct 7, 2024
d9e3b0d
docs: adding documentation
germa89 Oct 7, 2024
34bcfc4
chore: adding changelog file 3466.documentation.md
pyansys-ci-bot Oct 7, 2024
3bc1cc6
fix: vale issues
germa89 Oct 7, 2024
0f1606b
docs: fix vale
germa89 Oct 7, 2024
89552c9
docs: expanding a bit troubleshooting advices and small format fix
germa89 Oct 7, 2024
c3c6506
fix: nproc tests
germa89 Oct 7, 2024
db963c4
revert: "chore: To fix sphinx build"
germa89 Oct 7, 2024
7b386d0
chore: Merge branch 'feat/passing-tight-integration-env-vars-to-MAPDL…
germa89 Oct 7, 2024
1e31519
docs: clarifying where everything is running.
germa89 Oct 7, 2024
f8177a1
Merge branch 'main' into feat/passing-tight-integration-env-vars-to-M…
germa89 Oct 7, 2024
5c7967c
docs: expanding bash example
germa89 Oct 8, 2024
880a6b8
tests: fix
germa89 Oct 15, 2024
3cd005c
chore: merge remote-tracking branch 'origin/main' into feat/passing-t…
germa89 Oct 17, 2024
7514c31
docs: adding `PYMAPDL_NPROC` to env var section
germa89 Oct 17, 2024
4ccb146
feat: adding 'pymapdl_proc' to non-slurm run. Adding tests too.
germa89 Oct 17, 2024
fdf00d1
docs: fix vale issue
germa89 Oct 17, 2024
4aa477d
docs: fix vale issue
germa89 Oct 17, 2024
4dadc1d
fix: replacing env var name
germa89 Oct 17, 2024
f8c4994
chore: merge branch 'feat/accepting-nproc-env-var-even-if-we-are-not-…
germa89 Oct 17, 2024
5de0ab5
feat: first 'launch_mapdl_on_cluster` draft
germa89 Oct 17, 2024
fec3113
feat: added arguments to 'launch_mapdl_on_cluster'.
germa89 Oct 17, 2024
de403fd
feat: better error messages. Created 'generate_sbatch_command'.
germa89 Oct 17, 2024
d8348c4
refactor: rename 'detect_HPC' to 'detect_hpc'. Introducing 'launch_on…
germa89 Oct 17, 2024
7a6f7f0
refactor: move all the functionality to launch_mapdl
germa89 Oct 17, 2024
d0c3f25
feat: launched is fixed now in 'launcher' silently.
germa89 Oct 17, 2024
bd4606c
refactor: using `PYMAPDL_RUNNING_ON_HPC` as env var.
germa89 Oct 18, 2024
75280c6
chore: adding changelog file 3497.documentation.md [dependabot-skip]
pyansys-ci-bot Oct 18, 2024
f075702
refactor: rename to `scheduler_args`
germa89 Oct 18, 2024
3136616
fix: launching issues
germa89 Oct 18, 2024
80b96d5
fix: tests
germa89 Oct 18, 2024
e9ba446
docs: formatting changes.
germa89 Oct 18, 2024
993f7ff
docs: more cosmetic changes.
germa89 Oct 18, 2024
03aab7e
tests: adding 'launch_grpc' testing.
germa89 Oct 18, 2024
e90a9cb
Merge branch 'main' into feat/passing-tight-integration-env-vars-to-M…
germa89 Oct 21, 2024
22953a5
tests: adding some unit tests
germa89 Oct 21, 2024
60bf932
fix: unit tests
germa89 Oct 21, 2024
d027edd
chore: adding changelog file 3466.documentation.md [dependabot-skip]
pyansys-ci-bot Oct 21, 2024
2073f5a
chore: merge branch 'feat/passing-tight-integration-env-vars-to-MAPDL…
germa89 Oct 21, 2024
c41aa1e
chore: merge remote-tracking branch 'origin/main' into feat/adding-sb…
germa89 Oct 21, 2024
83a1d79
fix: adding missing import
germa89 Oct 21, 2024
6fb698d
refactoring: `check_mapdl_launch_on_hpc` and addressing codacity issues
germa89 Oct 21, 2024
524c4b4
fix: test
germa89 Oct 21, 2024
58549fb
refactor: exit method. Externalising to _exit_mapdl function.
germa89 Oct 23, 2024
327538e
fix: not running all tests.
germa89 Oct 23, 2024
d8f77a9
tests: adding test to __del__.
germa89 Oct 23, 2024
775c893
refactor: patching exit to avoid raising exception. I need to fix thi…
germa89 Oct 23, 2024
5e14add
refactor: not asking for version or checking exec_file path if 'launc…
germa89 Oct 23, 2024
b577f64
tests: increasing coverage
germa89 Oct 23, 2024
ee60582
test: adding stack for patching MAPDL launching.
germa89 Oct 23, 2024
f7f9572
refactor: to allow more coverage
germa89 Oct 23, 2024
7479173
feat: avoid checking the underlying processes when running on HPC
germa89 Oct 23, 2024
338e8ac
tests: increasing coverage
germa89 Oct 23, 2024
cf45184
chore: adding coverage to default pytesting. Adding _commands for che…
germa89 Oct 23, 2024
685640b
chore: merge remote-tracking branch 'origin/main' into feat/adding-sb…
germa89 Oct 23, 2024
715e3a7
fix: remote launcher
germa89 Oct 23, 2024
8ae518e
fix: raising exceptions in __del__ method
germa89 Oct 23, 2024
c7b9ede
fix: weird missing reference (import) when exiting
germa89 Oct 23, 2024
7e12b2e
chore/making sure we regress to the right state after the tests
germa89 Oct 23, 2024
7407411
test: fix test
germa89 Oct 23, 2024
76316c4
chore: merge remote-tracking branch 'origin/main' into feat/adding-sb…
germa89 Oct 23, 2024
a61f649
fix: not checking the mode
germa89 Oct 24, 2024
a14da83
refactor: reorg ip section on init. Adding better str representation …
germa89 Oct 24, 2024
b17ec37
feat: avoid killing MAPDL if not `finish_job_on_exit`. Adding also a …
germa89 Oct 24, 2024
b8898a8
feat: raising error if specifying IP when `launch_on_hpc`.
germa89 Oct 24, 2024
24cf555
feat: increasing grpc error handling options to 3s or 5 attempts.
germa89 Oct 24, 2024
4614b6e
feat: renaming to scheduler_options. Using variable default start_tim…
germa89 Oct 24, 2024
dc801f9
refactor: added types
germa89 Oct 24, 2024
ffb3ea8
refactor: launcher args order
germa89 Oct 24, 2024
519d4bb
refactor: tests
germa89 Oct 24, 2024
9269c27
chore: merge branch 'feat/adding-sbatch-support' of https://github.co…
germa89 Oct 24, 2024
71feaad
fix: reusing connection attr.
germa89 Oct 24, 2024
9989d20
chore: merge remote-tracking branch 'origin/main' into feat/adding-sb…
germa89 Oct 24, 2024
6461a2b
fix: pass start_timeout to `get_job_info`.
germa89 Oct 24, 2024
8fb5103
fix: test
germa89 Oct 24, 2024
64322a8
chore: Merge branch 'feat/adding-sbatch-support' of https://github.co…
germa89 Oct 24, 2024
00b1faa
fix: test
germa89 Oct 25, 2024
64f6e98
tests: not requiring warning if on minimal since ATP is not present.
germa89 Oct 25, 2024
55e09fc
feat: simplifying directory property
germa89 Oct 25, 2024
837e331
feat: using cached version of directory.
germa89 Oct 25, 2024
0ad8512
feat: simplifying directory property
germa89 Oct 25, 2024
07a4cf9
chore: adding changelog file 3517.miscellaneous.md [dependabot-skip]
pyansys-ci-bot Oct 25, 2024
4c6d122
test: adding test
germa89 Oct 25, 2024
f21ac20
Merge branch 'refactor--simplyfing-directory-setter' of https://githu…
germa89 Oct 25, 2024
f0a1423
feat: caching directory in cwd
germa89 Oct 25, 2024
faed340
refactor: mapdl patcher
germa89 Oct 25, 2024
f3438b5
feat: caching directory in cwd
germa89 Oct 25, 2024
6c7f718
feat: caching directory for sure.
germa89 Oct 25, 2024
d2e70be
feat: caching dir at the cwd level.
germa89 Oct 25, 2024
393d70d
feat: retry mechanism inside /INQUIRE
germa89 Oct 25, 2024
51f528b
feat: changing exception message
germa89 Oct 25, 2024
bc7a005
feat: adding tests
germa89 Oct 25, 2024
87711b7
feat: caching directory
germa89 Oct 25, 2024
bf141c6
chore: merge branch 'refactor--simplyfing-directory-setter' into feat…
germa89 Oct 25, 2024
b2faf93
chore: adding changelog file 3517.added.md [dependabot-skip]
pyansys-ci-bot Oct 25, 2024
f294058
refactor: avoid else in while.
germa89 Oct 25, 2024
02212c2
Merge branch 'refactor--simplyfing-directory-setter' of https://githu…
germa89 Oct 25, 2024
6811cc2
refactor: using a temporary variable to avoid overwrite self._path
germa89 Oct 25, 2024
4b9648f
fix: not keeping state between tests
germa89 Oct 25, 2024
ed714bd
fix: making sure the state is reset between tests
germa89 Oct 25, 2024
4071a71
chore: merge branch 'main' into refactor--simplyfing-directory-setter
germa89 Oct 25, 2024
d303997
chore: Merge branch 'main' into feat/adding-sbatch-support
germa89 Oct 25, 2024
42a3c92
chore: merge branch 'refactor--simplyfing-directory-setter' into feat…
germa89 Oct 25, 2024
ed2eb77
fix: warning when exiting.
germa89 Oct 25, 2024
521473f
fix: test
germa89 Oct 25, 2024
31fc90c
chore:merge remote-tracking branch 'origin/refactor--simplyfing-direc…
germa89 Oct 25, 2024
0aa6d0c
feat: using a trimmed version for delete.
germa89 Oct 25, 2024
6da6c0e
chore: merge branch 'main' into feat/adding-sbatch-support
germa89 Oct 25, 2024
a460227
refactor: test to pass
germa89 Oct 25, 2024
21236e3
refactor: removing all cleaning from __del__ except ending HPC job.
germa89 Oct 25, 2024
7298519
refactor: changing `detect_hpc` with `running_on_hpc`.
germa89 Oct 25, 2024
737757f
docs: adding-sbatch-support (#3513)
germa89 Oct 25, 2024
78fe1bd
feat: avoid exceptions on `__del__`
germa89 Oct 28, 2024
9bcdc9d
tests: adding tests for get_port and get_ip
germa89 Oct 28, 2024
c2c666e
feat: using a submitter function for grouping.
germa89 Oct 28, 2024
89e510d
tests: attempting clean exit
germa89 Oct 28, 2024
b047e12
feat: externalising to function getting the batchhost
germa89 Oct 28, 2024
db0c394
tests: increasing coverage
germa89 Oct 28, 2024
aac941d
tests: fix
germa89 Oct 28, 2024
c5f54a5
fix: doc builds
germa89 Oct 28, 2024
3063221
tests: increasing coverage
germa89 Oct 28, 2024
cddb76b
fix: not passing args
germa89 Oct 28, 2024
a08b378
tests: increase coverage
germa89 Oct 28, 2024
e3cbf01
fix: tests
germa89 Oct 28, 2024
3b799a2
fix: fixture
germa89 Oct 29, 2024
f89df70
ci: uploading bandit reports as artifact.
germa89 Oct 29, 2024
ac202b4
docs: adding descriptor to phrase
germa89 Oct 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ jobs:
token: ${{ secrets.PYANSYS_CI_BOT_TOKEN }}
python-package-name: ${{ env.PACKAGE_NAME }}
dev-mode: ${{ github.ref != 'refs/heads/main' }}
upload-reports: True

docs-build:
name: "Build documentation"
Expand Down Expand Up @@ -774,6 +775,7 @@ jobs:
env:
ON_LOCAL: true
ON_UBUNTU: true
TESTING_MINIMAL: true

steps:
- name: "Install Git and checkout project"
Expand Down
1 change: 1 addition & 0 deletions doc/changelog.d/3497.documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
feat: support for launching an MAPDL instance in an SLURM HPC cluster
1 change: 1 addition & 0 deletions doc/changelog.d/3513.documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
docs: adding-sbatch-support
229 changes: 229 additions & 0 deletions doc/source/user_guide/hpc/launch_mapdl_entrypoint.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@

Interactive MAPDL instance launched from the login node
=======================================================

Starting the instance
---------------------

If you are already logged in a login node, you can launch an MAPDL instance as a SLURM job and
connect to it.
To accomplish this, run these commands in your login node.

.. code:: pycon

>>> from ansys.mapdl.core import launch_mapdl
>>> mapdl = launch_mapdl(launch_on_hpc=True)

PyMAPDL submits a job to the scheduler using the appropriate commands.
In case of SLURM, it uses the ``sbatch`` command with the ``--wrap`` argument
to pass the MAPDL command line to start.
Other scheduler arguments can be specified using the ``scheduler_options``
argument as a Python :class:`dict`:

.. code:: pycon

>>> from ansys.mapdl.core import launch_mapdl
>>> scheduler_options = {"nodes": 10, "ntasks-per-node": 2}
>>> mapdl = launch_mapdl(launch_on_hpc=True, nproc=20, scheduler_options=scheduler_options)


.. note::
PyMAPDL cannot infer the number of CPUs that you are requesting from the scheduler.
Hence, you must specify this value using the ``nproc`` argument.

The double minus (``--``) common in the long version of some scheduler commands
are added automatically if PyMAPDL detects it is missing and the specified
command is long more than 1 character in length).
For instance, the ``ntasks-per-node`` argument is submitted as ``--ntasks-per-node``.

Or, a single Python string (:class:`str`) is submitted:

.. code:: pycon

>>> from ansys.mapdl.core import launch_mapdl
>>> scheduler_options = "-N 10"
>>> mapdl = launch_mapdl(launch_on_hpc=True, scheduler_options=scheduler_options)

.. warning::
Because PyMAPDL is already using the ``--wrap`` argument, this argument
cannot be used again.

The values of each scheduler argument are wrapped in single quotes (`'`).
This might cause parsing issues that can cause the job to fail after successful
submission.

PyMAPDL passes all the environment variables of the
user to the new job and to the MAPDL instance.
This is usually convenient because many environmental variables are
needed to run the job or MAPDL command.
For instance, the license server is normally stored in the :envvar:`ANSYSLMD_LICENSE_FILE` environment variable.
If you prefer not to pass these environment variables to the job, use the SLURM argument
``--export`` to specify the desired environment variables.
For more information, see `SLURM documentation <slurm_docs_>`_.


Working with the instance
-------------------------

Once the :class:`Mapdl <ansys.mapdl.core.mapdl.MapdlBase>` object has been created,
it does not differ from a normal :class:`Mapdl <ansys.mapdl.core.mapdl.MapdlBase>`
instance.
You can retrieve the IP of the MAPDL instance as well as its hostname:

.. code:: pycon

>>> mapdl.ip
'123.45.67.89'
>>> mapdl.hostname
'node0'

You can also retrieve the SLURM job ID:

.. code:: pycon

>>> mapdl.jobid
10001

If you want to check whether the instance has been launched using a scheduler,
you can use the :attr:`mapdl_on_hpc <ansys.mapdl.core.mapdl_grpc.MapdlGrpc.mapdl_on_hpc>`
attribute:

.. code:: pycon

>>> mapdl.mapdl_on_hpc
True


Sharing files
^^^^^^^^^^^^^

Most of the HPC clusters share the login node filesystem with the compute nodes,
which means that you do not need to do extra work to upload or download files to the MAPDL
instance. You only need to copy them to the location where MAPDL is running.
You can obtain this location with the
:attr:`directory <ansys.mapdl.core.mapdl_grpc.MapdlGrpc.directory>` attribute.

If no location is specified in the :func:`launch_mapdl() <ansys.mapdl.core.launcher.launch_mapdl>`
function, then a temporal location is selected.
It is a good idea to set the ``run_location`` argument to a directory that is accessible
from all the compute nodes.
Normally anything under ``/home/user`` is available to all compute nodes.
If you are unsure where you should launch MAPDL, contact your cluster administrator.

Additionally, you can use methods like the :meth:`upload <ansys.mapdl.core.mapdl_grpc.MapdlGrpc.upload>`
and :meth:`download <ansys.mapdl.core.mapdl_grpc.MapdlGrpc.download>` to
upload and download files to and from the MAPDL instance respectively.
You do not need ``ssh`` or another similar connection.
However, for large files, you might want to consider alternatives.


Exiting MAPDL
-------------

Exiting MAPDL, either intentionally or unintentionally, stops the job.
This behavior occurs because MAPDL is the main process at the job. Thus, when finished,
the scheduler considers the job done.

To exit MAPDL, you can use the :meth:`exit() <ansys.mapdl.core.Mapdl.exit>` method.
This method exits MAPDL and sends a signal to the scheduler to cancel the job.

.. code-block:: python

mapdl.exit()

When the Python process you are running PyMAPDL on finishes without errors, and you have not
issued the :meth:`exit() <ansys.mapdl.core.Mapdl.exit>` method, the garbage collector
kills the MAPDL instance and its job. This is intended to save resources.

If you prefer that the job is not killed, set the following attribute in the
:class:`Mapdl <ansys.mapdl.core.mapdl.MapdlBase>` class:

.. code-block:: python

mapdl.finish_job_on_exit = False


In this case, you should set a timeout in your job to avoid having the job
running longer than needed.


Handling crashes on an HPC
^^^^^^^^^^^^^^^^^^^^^^^^^^

If MAPDL crashes while running on an HPC, the job finishes right away.
In this case, MAPDL disconnects from MAPDL.
PyMAPDL retries to reconnect to the MAPDL instance up to 5 times, waiting
for up to 5 seconds.
If unsuccessful, you might get an error like this:

.. code-block:: text

MAPDL server connection terminated unexpectedly while running:
/INQUIRE,,DIRECTORY,,
called by:
_send_command

Suggestions:
MAPDL *might* have died because it executed a not-allowed command or ran out of memory.
Check the MAPDL command output for more details.
Open an issue on GitHub if you need assistance: https://github.com/ansys/pymapdl/issues
Error:
failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)
Full error:
<_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-10-24T08:25:04.054559811+00:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)"}"
>

The data of that job is available at :attr:`directory <ansys.mapdl.core.Mapdl.directory>`.
You should set the run location using the ``run_location`` argument.

While handling this exception, PyMAPDL also cancels the job to avoid resources leaking.
Therefore, the only option is to start a new instance by launching a new job using
the :func:`launch_mapdl() <ansys.mapdl.core.launcher.launch_mapdl>` function.

User case on a SLURM cluster
----------------------------

Assume a user wants to start a remote MAPDL instance in an HPC cluster
to interact with it.
The user would like to request 10 nodes, and 1 task per node (to avoid clashes
between MAPDL instances).
The user would like to also request 64 GB of RAM.
Because of administration logistics, the user must use the machines in
the ``supercluster01`` partition.
To make PyMAPDL launch an instance like that on SLURM, run the following code:

.. code-block:: python

from ansys.mapdl.core import launch_mapdl
from ansys.mapdl.core.examples import vmfiles

scheduler_options = {
"nodes": 10,
"ntasks-per-node": 1,
"partition": "supercluster01",
"memory": 64,
}
mapdl = launch_mapdl(launch_on_hpc=True, nproc=10, scheduler_options=scheduler_options)

num_cpu = mapdl.get_value("ACTIVE", 0, "NUMCPU") # It should be equal to 10

mapdl.clear() # Not strictly needed.
mapdl.prep7()

# Run an MAPDL script
mapdl.input(vmfiles["vm1"])

# Let's solve again to get the solve printout
mapdl.solution()
output = mapdl.solve()
print(output)

mapdl.exit() # Kill the MAPDL instance


PyMAPDL automatically sets MAPDL to read the job configuration (including machines,
number of CPUs, and memory), which allows MAPDL to use all the resources allocated
to that job.
62 changes: 47 additions & 15 deletions doc/source/user_guide/hpc/pymapdl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,35 +19,34 @@ on whether or not you run them both on the HPC compute nodes.
Additionally, you might be able interact with them (``interactive`` mode)
or not (``batch`` mode).

For information on supported configurations, see :ref:`ref_pymapdl_batch_in_cluster_hpc`.
PyMAPDL takes advantage of HPC clusters to launch MAPDL instances
with increased resources.
PyMAPDL automatically sets these MAPDL instances to read the
scheduler job configuration (which includes machines, number
of CPUs, and memory), which allows MAPDL to use all the resources
allocated to that job.
For more information, see :ref:`ref_tight_integration_hpc`.

The following configurations are supported:

Since v0.68.5, PyMAPDL can take advantage of the tight integration
between the scheduler and MAPDL to read the job configuration and
launch an MAPDL instance that can use all the resources allocated
to that job.
For instance, if a SLURM job has allocated 8 nodes with 4 cores each,
then PyMAPDL launches an MAPDL instance which uses 32 cores
spawning across those 8 nodes.
This behavior can turn off if passing the :envvar:`PYMAPDL_ON_SLURM`
environment variable or passing the ``detect_HPC=False`` argument
to the :func:`launch_mapdl() <ansys.mapdl.core.launcher.launch_mapdl>` function.
* :ref:`ref_pymapdl_batch_in_cluster_hpc`.
* :ref:`ref_pymapdl_interactive_in_cluster_hpc_from_login`


.. _ref_pymapdl_batch_in_cluster_hpc:

Submit a PyMAPDL batch job to the cluster from the entrypoint node
==================================================================
Batch job submission from the login node
========================================

Many HPC clusters allow their users to log into a machine using
``ssh``, ``vnc``, ``rdp``, or similar technologies and then submit a job
to the cluster from there.
This entrypoint machine, sometimes known as the *head node* or *entrypoint node*,
This login machine, sometimes known as the *head node* or *entrypoint node*,
might be a virtual machine (VDI/VM).

In such cases, once the Python virtual environment with PyMAPDL is already
set and is accessible to all the compute nodes, launching a
PyMAPDL job from the entrypoint node is very easy to do using the ``sbatch`` command.
PyMAPDL job from the login node is very easy to do using the ``sbatch`` command.
When the ``sbatch`` command is used, PyMAPDL runs and launches an MAPDL instance in
the compute nodes.
No changes are needed on a PyMAPDL script to run it on an SLURM cluster.
Expand Down Expand Up @@ -98,6 +97,8 @@ job by setting the :envvar:`PYMAPDL_NPROC` environment variable to the desired v

(venv) user@entrypoint-machine:~$ PYMAPDL_NPROC=4 sbatch main.py

For more applicable environment variables, see :ref:`ref_environment_variables`.

You can also add ``sbatch`` options to the command:

.. code-block:: console
Expand Down Expand Up @@ -181,3 +182,34 @@ This bash script performs tasks such as creating environment variables,
moving files to different directories, and printing to ensure your
configuration is correct.


.. _ref_pymapdl_interactive_in_cluster_hpc:


.. _ref_pymapdl_interactive_in_cluster_hpc_from_login:

.. include:: launch_mapdl_entrypoint.rst


.. _ref_tight_integration_hpc:

Tight integration between MAPDL and the HPC scheduler
=====================================================

Since v0.68.5, PyMAPDL can take advantage of the tight integration
between the scheduler and MAPDL to read the job configuration and
launch an MAPDL instance that can use all the resources allocated
to that job.
For instance, if a SLURM job has allocated 8 nodes with 4 cores each,
then PyMAPDL launches an MAPDL instance that uses 32 cores
spawning across those 8 nodes.

This behavior can turn off by passing the
:envvar:`PYMAPDL_RUNNING_ON_HPC` environment variable
with a ``'false'`` value or passing the ``detect_hpc=False`` argument
to the :func:`launch_mapdl() <ansys.mapdl.core.launcher.launch_mapdl>` function.

Alternatively, you can override these settings by either specifying
custom settings in the :func:`launch_mapdl() <ansys.mapdl.core.launcher.launch_mapdl>`
function's arguments or using specific environment variables.
For more information, see :ref:`ref_environment_variables`.
3 changes: 2 additions & 1 deletion doc/source/user_guide/mapdl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1092,6 +1092,7 @@ are unsupported.
| * ``LSWRITE`` | |:white_check_mark:| Available (Internally running in :attr:`Mapdl.non_interactive <ansys.mapdl.core.Mapdl.non_interactive>`) | |:white_check_mark:| Available | |:exclamation:| Only in :attr:`Mapdl.non_interactive <ansys.mapdl.core.Mapdl.non_interactive>` | |
+---------------+---------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+

.. _ref_environment_variables:

Environment variables
=====================
Expand Down Expand Up @@ -1189,7 +1190,7 @@ environment variable. The following table describes all arguments.
| | user@machine:~$ export PYMAPDL_MAPDL_VERSION=22.2 |
| | |
+---------------------------------------+----------------------------------------------------------------------------------+
| :envvar:`PYMAPDL_ON_SLURM` | With this environment variable set to ``FALSE``, you can avoid |
| :envvar:`PYMAPDL_RUNNING_ON_HPC` | With this environment variable set to ``FALSE``, you can avoid |
| | PyMAPDL from detecting that it is running on a SLURM HPC cluster. |
+---------------------------------------+----------------------------------------------------------------------------------+
| :envvar:`PYMAPDL_MAX_MESSAGE_LENGTH` | Maximum gRPC message length. If your |
Expand Down
1 change: 1 addition & 0 deletions doc/styles/config/vocabularies/ANSYS/accept.txt
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ CentOS7
Chao
ci
container_layout
CPUs
datas
delet
Dependabot
Expand Down
2 changes: 0 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -148,8 +148,6 @@ src_paths = ["doc", "src", "tests"]
[tool.coverage.run]
source = ["ansys/pymapdl"]
omit = [
# omit commands
"ansys/mapdl/core/_commands/*",
# ignore legacy interfaces
"ansys/mapdl/core/mapdl_console.py",
"ansys/mapdl/core/jupyter.py",
Expand Down
6 changes: 3 additions & 3 deletions src/ansys/mapdl/core/errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,9 +307,9 @@ def wrapper(*args, **kwargs):
old_handler = signal.signal(signal.SIGINT, handler)

# Capture gRPC exceptions
n_attempts = 3
initial_backoff = 0.05
multiplier_backoff = 3
n_attempts = 5
initial_backoff = 0.1
multiplier_backoff = 2

i_attemps = 0

Expand Down
Loading
Loading