Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend information #148

Open
jameshcorbett opened this issue Apr 26, 2022 · 2 comments
Open

Backend information #148

jameshcorbett opened this issue Apr 26, 2022 · 2 comments
Labels

Comments

@jameshcorbett
Copy link
Collaborator

libEnsemble asked in a meeting today whether we could provide an interface to get the list of nodes from within a job. I thought it sounded reasonable and I was wondering if we might want to generalize to a kind of backend interface for getting job information out. Maybe the only thing is the nodelist, but maybe job ID too? Mostly I think this would come down to reading environment variables.

@jameshcorbett jameshcorbett changed the title Backend interformation Backend information Apr 26, 2022
@andre-merzky
Copy link
Collaborator

andre-merzky commented May 6, 2022

FYI, RP exposes the following environment variables to tasks (additionally to RP specific variables):

$RP_TASK_ID          : unique task id
$RP_TASK_NAME        : user specified task name
$RP_TASK_SANDBOX     : user spefified or derived task workdir
$RP_RANK             : task rank
$RP_RANKS            : number of task ranks

Additionally, we provide an equivalent to a nodefile - a slots file which informs the application about the nodes, cores and gpus used to run the task. Here an example for an MPI task which uses one core and no gpus per rank:

{'partition_id': None,
 'ranks': [{'core_map': [3],
            'gpu_map': [],
            'lfs': 0,
            'mem': 0,
            'node_id': 'c206-024',
            'node_name': 'c206-024'},
           {'core_map': [4],
            'gpu_map': [],
            'lfs': 0,
            'mem': 0,
            'node_id': 'c206-024',
            'node_name': 'c206-024'},
           {'core_map': [5],
            'gpu_map': [],
            'lfs': 0,
            'mem': 0,
            'node_id': 'c206-024',
            'node_name': 'c206-024'},
            ...

lfs: local file storage allocated for rank
mem: memory allocated for that rank

We mostly use the information though for our internal operation and have very few use cases outside of that. The RP_TASK_ID and RP_TASK_NAME are used by applications sometimes. But anyway, just wanted to document the state of affairs in RCT in the context of this ticket...

I should add that exporting env variables is really cheap, writing node files not so much when we consider large numbers of jobs.

@hategan
Copy link
Collaborator

hategan commented Sep 13, 2023

This is related to ExaWorks/psij-python#200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants