reorg main function; add 'hpu' option w/o implementation #329

JamesKunstle · 2024-11-11T23:05:26Z

The main training function needed to be broken down into smaller
functions for readability/testability.

WORLD_SIZE, LOCAL_RANK, and RANK have also been extracted and made
global constants since they are set by the administrating
multiprocessing launcher (torchrun, in our case).

HPU configuration options and checks are also added.

Signed-off-by: James Kunstle [email protected]

The `main` training function needed to be broken down into smaller functions for readability/testability. WORLD_SIZE, LOCAL_RANK, and RANK have also been extracted and made global constants since they are set by the administrating multiprocessing launcher (torchrun, in our case). HPU configuration options and checks are also added. Signed-off-by: James Kunstle <[email protected]>

JamesKunstle · 2024-11-11T23:07:20Z

src/instructlab/training/main_ds.py

+        disable_flash_attn=args.disable_flash_attn, use_dolomite=args.use_dolomite
+    )
+
+    tokenizer = configure_tokenizer(


I would have named this setup_tokenizer but that name would shadow the imported function.

JamesKunstle · 2024-11-11T23:24:53Z

src/instructlab/training/main_ds.py

-    train_loader = setup_dataloader(
-        dataset,
-        tokenizer.pad_token_id,
+    this happens sometimes when we have more GPUs than data to process. In this case


will fix this docstring

RobotSail · 2024-11-12T04:19:12Z

src/instructlab/training/main_ds.py

-        mock_len=args.mock_len,
-    )
+    # will try to make multipack work if possible.
+    sampler_type: str = "multipack"


: str here is redundant - it's inferred by the assignment

RobotSail

From a high level these changes look good, we'll just need to test them to verify.

JamesKunstle commented Nov 11, 2024

View reviewed changes

mergify bot added the ci-failure label Nov 11, 2024

JamesKunstle marked this pull request as ready for review November 11, 2024 23:23

JamesKunstle requested review from Maxusmusti and RobotSail November 11, 2024 23:23

JamesKunstle commented Nov 11, 2024

View reviewed changes

RobotSail reviewed Nov 12, 2024

View reviewed changes

JamesKunstle mentioned this pull request Nov 12, 2024

gaudi support training #330

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reorg main function; add 'hpu' option w/o implementation #329

reorg main function; add 'hpu' option w/o implementation #329

JamesKunstle commented Nov 11, 2024

JamesKunstle Nov 11, 2024

JamesKunstle Nov 11, 2024

RobotSail Nov 12, 2024

RobotSail left a comment

reorg main function; add 'hpu' option w/o implementation #329

Are you sure you want to change the base?

reorg main function; add 'hpu' option w/o implementation #329

Conversation

JamesKunstle commented Nov 11, 2024

JamesKunstle Nov 11, 2024

Choose a reason for hiding this comment

JamesKunstle Nov 11, 2024

Choose a reason for hiding this comment

RobotSail Nov 12, 2024

Choose a reason for hiding this comment

RobotSail left a comment

Choose a reason for hiding this comment