-
Notifications
You must be signed in to change notification settings - Fork 308
Add GKE A3 Ultra support #940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
08d5a03
Add a3 ultra support
samos123 e842543
update Dockerfile
samos123 565a187
remove original gpu dockerfile contents
samos123 2e01da0
remove comments
samos123 56fe7c5
remove jax version override
samos123 dc1e7cd
Merge branch 'main' into a3u-support
samos123 65e979f
Merge branch 'main' into a3u-support
samos123 19e9a10
revert job_test.py to main
samos123 58a85ce
fix tests
samos123 dd6096b
fix env variables
samos123 324294b
cleanup Dockerfile
samos123 8939c70
consistent Dockerfile whitespace
samos123 beb5e86
Merge branch 'main' into a3u-support
samos123 6321110
class method avoid in-place update
samos123 c02885f
fix tests
samos123 ea6b301
fix tests
samos123 d1e6a61
fix A3HighReplicatedJobTests
samos123 7dc480b
add A3UltraReplicatedJob tests
samos123 f198481
address pr comments
samos123 4602276
remove "NVTE_FUSED_ATTN": "1"
samos123 fda65e1
comment out PGLE
samos123 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,10 +48,14 @@ | |
from axlearn.cloud.gcp.bundler import ArtifactRegistryBundler | ||
from axlearn.cloud.gcp.config import gcp_settings | ||
from axlearn.cloud.gcp.event_queue import event_queue_from_config | ||
from axlearn.cloud.gcp.job import GCPJob, GKEJob, GPUGKEJob, TPUGKEJob | ||
from axlearn.cloud.gcp.job import GCPJob, GKEJob, TPUGKEJob | ||
from axlearn.cloud.gcp.jobs import runner_utils | ||
from axlearn.cloud.gcp.jobs.tpu_runner import with_tpu_training_defaults | ||
from axlearn.cloud.gcp.jobset_utils import BASTION_JOB_VERSION_LABEL | ||
from axlearn.cloud.gcp.jobset_utils import ( | ||
BASTION_JOB_VERSION_LABEL, | ||
A3HighReplicatedJob, | ||
A3UltraReplicatedJob, | ||
) | ||
from axlearn.cloud.gcp.node_pool import ( | ||
PRE_PROVISIONER_LABEL, | ||
delete_node_pools, | ||
|
@@ -143,6 +147,10 @@ def validate_inner(cls): | |
if cls.inner is None: | ||
raise ValueError(f"A GKERunnerJob should subclass {cls} and define `inner`.") | ||
|
||
@classmethod | ||
def with_inner(cls, inner: type[GKEJob]): | ||
return type(f"{cls.__name__}_{inner.__name__}", (cls,), {"inner": inner}) | ||
|
||
@classmethod | ||
def define_flags(cls, fv: flags.FlagValues = FLAGS): | ||
super().define_flags(fv) | ||
|
@@ -531,19 +539,13 @@ def from_flags(cls, fv: flags.FlagValues, **kwargs): | |
return cfg | ||
|
||
|
||
class GPUGKERunnerJob(GKERunnerJob): | ||
"""A GKERunnerJob that uses GPUGKEJob.""" | ||
|
||
inner = GPUGKEJob | ||
|
||
|
||
def _get_runner_or_exit(instance_type: str): | ||
if instance_type.startswith("tpu"): | ||
return TPUGKERunnerJob | ||
elif instance_type.startswith("gpu-a3"): | ||
# TODO(markblee): We can directly construct: | ||
# GKERunnerJob.with_inner(GKEJob.with_jobset(A3ReplicatedJob)) | ||
return GPUGKERunnerJob | ||
elif instance_type.startswith("gpu-a3-ultra"): | ||
return GKERunnerJob.with_inner(GKEJob.with_builder(A3UltraReplicatedJob)) | ||
elif instance_type.startswith("gpu-a3-high"): | ||
return GKERunnerJob.with_inner(GKEJob.with_builder(A3HighReplicatedJob)) | ||
Comment on lines
+545
to
+548
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Neat! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is like praising yourself since you came up with it lol |
||
else: | ||
raise app.UsageError(f"Unknown instance_type {instance_type}") | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!