Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for Pathways proxy #690

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

jesus-orozco
Copy link
Contributor

Enable JAX+Pathways single-controller architecture for coordination of accelerators.

  • Set jax_backend to "proxy"
  • Additional logic to handle new jax_backend type
  • Import previewutilities library to identify pathways-enabled workloads

@jesus-orozco jesus-orozco marked this pull request as ready for review October 2, 2024 15:49
pyproject.toml Outdated
@@ -92,6 +92,7 @@ gcp = [
"google-cloud-build==3.24.1",
"ml_goodput_measurement==0.0.2",
"pyOpenSSL>=22.1.0", # compat with cryptography version.
"pathwaysutils@git+https://github.com/google/pathways-utils", # for JAX+Pathways single-controller accelerator coordinator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about depending a github repo directly without specifying a specific version, as any changes to the repo will be automatically reflected. Does pathways-utils have versioned releases that can be installed via pip install?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @ruomingp ! The short-term roadmap for the pathways-utils library doesn't account for a Pypi release. Adding a tag to the dependency so we can control which version gets installed with Axlearn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jesus-orozco , is there an ETA for a pypi release? IIRC, introducing git hashes to pyproject causes issues for our own pypi release. We prefer to avoid this if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm following up with the product team to get a better picture of the roadmap. At the moment Github is the only path forward to install the package, unless we bake the whl directly into the docker image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markblee Pypi release ETA is Oct 18

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markblee Pathwaysutils was released on Pypi this week. We have tested and integrated its latest version into this PR.

pyproject.toml Outdated Show resolved Hide resolved
axlearn/common/launch_trainer_main.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ruomingp ruomingp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

axlearn/common/launch_trainer_main.py Outdated Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
axlearn/common/utils_spmd.py Show resolved Hide resolved
pyproject.toml Outdated
@@ -92,6 +92,7 @@ gcp = [
"google-cloud-build==3.24.1",
"ml_goodput_measurement==0.0.2",
"pyOpenSSL>=22.1.0", # compat with cryptography version.
"pathwaysutils@git+https://github.com/google/pathways-utils", # for JAX+Pathways single-controller accelerator coordinator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jesus-orozco , is there an ETA for a pypi release? IIRC, introducing git hashes to pyproject causes issues for our own pypi release. We prefer to avoid this if possible.

@Ethanlm Ethanlm self-requested a review October 25, 2024 19:28
Copy link
Contributor

@ruomingp ruomingp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 455 to 457
flags.DEFINE_boolean(
"use_pathways", False, "Wether the workload is pathways-enabled.", **common_kwargs
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a boolean flag, could we define a list flag specifying the additional modules to import, potentially including "pathwaysutils"?

See https://github.com/apple/axlearn/blob/main/docs/ml_api_style.md#avoid-booleans-in-apis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this makes sense. Refactored the flag and using dynamic import instead.

jax.distributed.initialize(**init_kwargs)
_jax_distributed_initialized = True
if jax_backend != "proxy":
jax.distributed.initialize(**init_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So using pathways does not require jax.distributed.initialize? Please add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased and added comment for pathways

axlearn/cloud/gcp/job.py Outdated Show resolved Hide resolved
@@ -457,6 +459,9 @@ def define_flags(cls, fv: flags.FlagValues):
"not all TPU types support this flag.",
**common_kwargs,
)
flags.DEFINE_list(
"import_pathways", [], "Modules to enable pathways proxy.", **common_kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants