DaCe VRAM pooling #295

FlorianDeconinck · 2022-08-23T19:30:54Z

Purpose

Add VRAM pooling by moving all Persistent arrays in Transient withing the DaCe pipeline then use the new DaCe auto-pool of transients

Code changes:

Orchestration pipeline change
Added tooling (NaN checker, memory count) of SDFG
Deactivating distributed caching due to potential bug with Grid

Requirements changes:

python -m driver.tools tooling
DaCe now >= 0.14

Infrastructure changes:

CUDA_CRAY_MPS needs to be deactivated because mallocAsync fails when turned on

Checklist

You have followed the coding standards guidelines established at Code Review Checklist.
Docstrings and type hints are added to new and updated routines, as appropriate
All relevant documentation has been updated or added (e.g. README, CONTRIBUTING docs)

Prevent patch class to be badly renamed Make DaCeOrchestration.Run resilient to no communicator, defaulting to `.gt_cache` Change conftest to adjust for changes in DaceConfig reqs

Remove custom lazy compile

Restore proper restart config save

Fix OOB with passing origin to the __sdfg__

Orchestrate dyncore: delnflux

…ules

Remove unused parameter in Remap Orchestrate: dyn_core Fix translate parallel test comm passing to dace_config

Bad fix for multi-process yaml load

FlorianDeconinck · 2022-08-26T15:41:59Z

launch jenkins

FlorianDeconinck · 2022-08-26T19:31:44Z

launch jenkins

jdahm · 2022-08-26T22:50:21Z

launch jenkins

jdahm · 2022-08-26T23:22:16Z

launch jenkins

jdahm · 2022-08-26T23:29:03Z

launch jenkins

jdahm · 2022-08-29T16:34:09Z

constraints.txt

@@ -34,7 +34,7 @@ attrs==21.2.0
    #   pytest
 babel==2.9.1
    # via sphinx
-backports.entry-points-selectable==1.1.1
+backports-entry-points-selectable==1.1.1


Do we know why this changed?

dsl/pace/dsl/dace/dace_config.py

dsl/pace/dsl/dace/orchestration.py

dsl/pace/dsl/dace/utils.py

requirements_dev.txt

jdahm · 2022-08-29T16:52:52Z

constraints.txt

-cmake==3.22.4
-    # via dace


I don't see this dependency added back anywhere. Was that removed?

-_o_-
I am guessing DaCe changed their dependency tree

dsl/pace/dsl/dace/utils.py

jdahm · 2022-08-29T17:01:17Z

dsl/pace/dsl/dace/utils.py

+                        f"\t   {detail.name}\n"
+                    )
+
+    return report


Just a suggestion: it would be more flexible to separate the counting memory from the report generation, i.e. split this into two functions. The first would create something like [(name, size)] for an SDFG, then the next one would the english representation.

I am prepping another PR that renames that tool and add a kernel timing analysis. I'll break it up there

jdahm · 2022-08-29T17:04:35Z

dsl/pace/dsl/dace/orchestration.py

+            memory_pooled = 0.0
+            for _sd, _aname, arr in sdfg.arrays_recursive():
+                if arr.lifetime == dace.AllocationLifetime.Persistent:
+                    arr.pool = True
+                    memory_pooled += arr.total_size * arr.dtype.bytes
+                    arr.lifetime = dace.AllocationLifetime.Scope
+            memory_pooled = float(memory_pooled) / (1024 * 1024)


Is my understanding correct that this is the moment when the arrays are switched to using pooled memory?

Yes. DaCe will automatically, at code generation, pool all Scoped arrays flagged. So we swap persistent arrays (e.g. arrays in sub-SDFG not passed as parameters to the top SDFG) into scope, and flag them.
Actual pooling is done at code gen time

I am clarifying in the comment

FlorianDeconinck added 30 commits June 14, 2022 07:27

Allow for env var to control orchestration

df4c245

(some) Translate tests

afd41a0

Failing flllz orchestration

03a3ba4

(Re)Orchestrate remapping

b9f9146

Fix orchestrate for new DaCe

1b79998

Prevent patch class to be badly renamed Make DaCeOrchestration.Run resilient to no communicator, defaulting to `.gt_cache` Change conftest to adjust for changes in DaceConfig reqs

Removing extra guard irrelevant since load_once is gone

9250283

Correct type hint & return

a4939d7

Use lazy_stencil when orchestrating

c0e326c

Remove custom lazy compile

Making sure lazy_stencil doesn't trigger before __call__

65f3164

Remove the need to cache Communicator in dace_config

97eac2d

Restore proper restart config save

Fixing communicator removed from DaceConfig

2680451

Merge branch 'DaceConfig_RemoveComm' into reorchestrate_all_modules

257cb06

Integrate LazyStencil.field_info fix

ac1aec5

Fix OOB with passing origin to the __sdfg__

Remove unused _frozen_stencil() in stencil

4394d99

Add domain to stencil __sdfg__

01b40e3

Orchestrate tracers

43eca42

Orchestrate dyncore: delnflux

Orchestrate: microphysics (minus driver)

c119b93

Merge branch 'orchestrate_on_AOT_stencils' into reorchestrate_all_mod…

9d06b5f

…ules

Change orchestration build pipe to more efficient trf passes

31934be

Remove unused parameter in Remap Orchestrate: dyn_core Fix translate parallel test comm passing to dace_config

Minor

faf18c0

Orchestrate: fv_dynamics

7d1f3d2

Minor

4f721d1

Updating dace.conf

7882901

Bad fix for multi-process yaml load

gitignore; DaCe & test

4e582f0

Verbose

53e8f03

Boolean logic be hard

15f6d8e

Removing dace.conf and replacing it with direct call to conf API

25e5bac

Restrict dace.config setup to orchestration

da6e9d2

Orchestration: FV_Dynamics

2798f8f

Move parsing in orchestration to commong fn, time.

72e9ae2

FlorianDeconinck added 7 commits August 24, 2022 16:21

Fix logging

836a064

Merge remote-tracking branch 'origin/main' into dace_transient_pooled

41f8a92

Merge remote-tracking branch 'origin/main' into dace_transient_pooled

4642f85

Deactivate distributed compile

b8398e9

Move flag to dace_config

f8c0d9d

dace >= 0.14 for VRAM pooling

8287dbc

Remove -e from constraint.txt + lint

f27e449

FlorianDeconinck marked this pull request as ready for review August 26, 2022 14:20

FlorianDeconinck added 2 commits August 26, 2022 16:36

Make CUDA timer safe for non-CUDA context

fe29275

Reuse GPU availability code & optional import

761ade2

FlorianDeconinck enabled auto-merge (squash) August 29, 2022 13:26