[Iluvatar] Support tensor parallel heterogeneous training #134

njuerect · 2024-06-07T01:19:42Z

This PR adds support for training LLM on heterogeneous devices with different tensor parallel sizes.

now only support: cp = 1 vp = None ep = 1 decode-only modeltype 2 tensor parallel model group

zhaoyinglia · 2024-06-07T03:56:48Z

flagscale/train/hetero/train_llama.py

@@ -0,0 +1,292 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.


Do not copy a train_llama.py, which should be unified training entrance. You can put the functional changes that must be rewritten in hetero folder.

Great idea! This will definitely be improved in the next pull request.

zhaoyinglia · 2024-06-07T03:58:00Z

flagscale/train/hetero/training.py

@@ -0,0 +1,680 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.


Same as the train_llama.py above.

aoyulong

LGTM

yu.song and others added 6 commits May 30, 2024 18:14

new feature: support different tp setting for llm training.

6752ace

now only support: cp = 1 vp = None ep = 1 decode-only modeltype 2 tensor parallel model group

Code Refactor

76c1e70

relocate hetero file

f20a9b4

fix import error

9ab8270

rename train_llama_hetero.py

d5c38a9

Merge branch 'main' into hetero_tp

27577d9

zhaoyinglia reviewed Jun 7, 2024

View reviewed changes

yu.song added 2 commits June 7, 2024 14:14

[Iluvatar] add example of training llama2-7b with tp hetero mode enabled

3579803

[Iluvatar] relocate config_hetero.yaml

227c14b

aoyulong approved these changes Jun 7, 2024

View reviewed changes

aoyulong merged commit 9838ede into FlagOpen:main Jun 7, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Iluvatar] Support tensor parallel heterogeneous training #134

[Iluvatar] Support tensor parallel heterogeneous training #134

njuerect commented Jun 7, 2024

zhaoyinglia Jun 7, 2024

aoyulong Jun 7, 2024

zhaoyinglia Jun 7, 2024

aoyulong left a comment

		@@ -0,0 +1,292 @@
		# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,680 @@
		# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.

[Iluvatar] Support tensor parallel heterogeneous training #134

[Iluvatar] Support tensor parallel heterogeneous training #134

Conversation

njuerect commented Jun 7, 2024

zhaoyinglia Jun 7, 2024

Choose a reason for hiding this comment

aoyulong Jun 7, 2024

Choose a reason for hiding this comment

zhaoyinglia Jun 7, 2024

Choose a reason for hiding this comment

aoyulong left a comment

Choose a reason for hiding this comment