imported os and added ckpt scripts #41

dgourab-aws · 2024-12-19T23:36:52Z

No description provided.

dgourab-aws · 2024-12-26T18:14:51Z

Tested it locally using: pytest seed_test.py

HahTK

Most of the stuff we need is missing.

Most features are missing (setting the seed, parameterizing fuji.py, all features from GPU)
We are only have GPU training scripts for TRN repo? Where is the TRN script?
Likely completely untested. Did we run a TRN job?

HahTK · 2024-12-27T01:00:59Z

run_trainer.sh

+
+# export JAX_PLATFORMS=cpu
+
+#Perf Tuning Guideline here : https://github.com/NVIDIA/JAX-Toolbox/blob/main/rosetta/docs/PGLE.md


why are GPU flags in TRN runs?

This is just the checkpointing script, I did not do any cleanup here.

HahTK · 2024-12-27T01:01:14Z

run_trainer.sh

+###export NCCL_DEBUG_SUBSYS=COLL
+
+#HAH quick fix
+export XLA_FLAGS="--xla_dump_hlo_as_text --xla_dump_to=${HLO_DUMP_PATH} --xla_dump_hlo_pass_re='.*' --xla_dump_hlo_as_proto --xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_while_loop_double_buffering=true --xla_gpu_enable_pipelined_all_gather=true --xla_gpu_enable_pipelined_reduce_scatter=true --xla_gpu_enable_pipelined_all_reduce=true --xla_gpu_multi_streamed_windowed_einsum=true --xla_gpu_enable_custom_fusions=true" # --xla_gpu_enable_address_computation_fusion=true"


why are GPU flags in TRN runs?

This is just the checkpointing script.
@HahTK I need a walkthrough of this script to actually clean it up. I havent used this to launch Trn jobs, I was using Apoorv's launch script.

HahTK · 2024-12-27T01:01:31Z

run_trainer.sh

+    echo "ERROR : ${TEST_SETUP} for ${N_EXPECTED_NODES} was launched with ${num_nodes}"
+    exit 1
+fi
+MESH_SELECTOR="gpu-${num_nodes}node-baseline"


again everything is GPU here

HahTK · 2024-12-27T01:02:04Z

run_trainer.sh

@@ -0,0 +1,150 @@
+#!/usr/bin/env bash


this needs to be fixed. This is a GPU script used to run TRN. We need to use the TRN script and just add ckpt resume to it.

HahTK · 2024-12-27T01:04:18Z

axlearn/common/seed_test.py

+import importlib
+
+
+class SeedTest(test_utils.TestCase):


where do we actually set the seed?

The seed has to be set as an environment variable from any launch script like this:
export DATA_SEED=42
The launch script has not been added to the PR.

HahTK · 2024-12-27T02:13:12Z

Also the branch was created from the wrong commit id. It should have been
33ec152

but it seems to be branched from this instead
c20387c

dgourab-aws · 2024-12-27T05:38:44Z

Also the branch was created from the wrong commit id. It should have been 33ec152

but it seems to be branched from this instead c20387c

The GPU branch was created from 33ec152, the TRN branch was to be created from the AXLearn upstream branch.

imported os and added ckpt scripts

c84cb33

dgourab-aws requested review from HahTK and aws-zhenguo December 19, 2024 23:37

added unit test for dataloader seed

89f54e7

removed user mentions

766acdb

HahTK reviewed Dec 27, 2024

View reviewed changes

added TRN launch scripts

968bfb6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imported os and added ckpt scripts #41

imported os and added ckpt scripts #41

dgourab-aws commented Dec 19, 2024

dgourab-aws commented Dec 26, 2024

HahTK left a comment

HahTK Dec 27, 2024

dgourab-aws Dec 27, 2024

HahTK Dec 27, 2024

dgourab-aws Dec 27, 2024

HahTK Dec 27, 2024

HahTK Dec 27, 2024

HahTK Dec 27, 2024

dgourab-aws Dec 27, 2024

HahTK commented Dec 27, 2024

dgourab-aws commented Dec 27, 2024


		# export JAX_PLATFORMS=cpu

		#Perf Tuning Guideline here : https://github.com/NVIDIA/JAX-Toolbox/blob/main/rosetta/docs/PGLE.md

		import importlib


		class SeedTest(test_utils.TestCase):

imported os and added ckpt scripts #41

Are you sure you want to change the base?

imported os and added ckpt scripts #41

Conversation

dgourab-aws commented Dec 19, 2024

dgourab-aws commented Dec 26, 2024

HahTK left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HahTK commented Dec 27, 2024

dgourab-aws commented Dec 27, 2024