[1/N] Initial implementation of local SPMD support #8810

lsy323 · 2025-03-09T08:03:28Z

Before this PR, user must uses all global devices in their SPMD program, this limits the flexibility of running MPMD + SPMD under multi-host environment.

This PR enables local SPMD by setting environment variable XLA_USE_LOCAL_SPMD. Local SPMD and global SPMD can be switched in the same python program by switching on and off the environment variable.

Usage

Example usage is:

xr.use_spmd()
os.environ['XLA_USE_LOCAL_SPMD'] = '1'
local_mesh = Mesh(local_device_ids, mesh_shape, axis_names)
# Run local SPMD program
...

os.environ['XLA_USE_LOCAL_SPMD'] = '0'
global_mesh = Mesh(global_device_ids, global_mesh_shape, axis_names)
# Run global SPMD program
...

An example script is here

To make a local SPMD work in multi-host setting, we need to ensure:

In the lowered HLO graph, the global ordinals are logical indices (starts from zero), instead of physical device ids.
During XLA compilation, XLA needs to be configured to target at only local devices.

Implementation:

SPMD Python API:

Allow create mesh with only local devices.
Update _get_tile_assignment to support local device mesh, so that the device id will start from 0 in the HLO sharding annotation

Sharing Utilities:

Support local SPMD in GetShardReplicaAndIndicesForDevices. This is achieved by deriving global ordinal from physical device and SPMD sharding annotation.

PJRT computation client:

(Env var is used here) Configure XLA compilation option to only use local devices for the compiled program when local SPMD is enabled.

Test:

Added c++ test for sharding util change for local spmd use case
Verified on multi-host TPU VM, on an example script of running a small VAE with local SPMD, with different input resolutions on different host. Note that the global SPMD and local SPMD are not bridged, we need to have an API to create global SPMD tensor from local SPMD.

Future work:

Avoid using env var for local SPMD control.

…owering context

lsy323 · 2025-03-10T05:40:42Z

I tried hard avoid using env var to control local SPMD enablement. (In this commit

My thought is:

We need to find a way in PjrtComputationClient::Compile to determine the target devices of the XLA program we want to compile and execute.
The information of sharding is stored in 2 places:
a) xla::OpSharding in the lowered xla::XlaOp which has sharding annotation
b) SPMD mesh

Determining the target device from SPMD mesh is the much more straightforward. However, it's not passed down to LTC and LTC cannot be easily extended to pass the SPMD mesh from Python layer (Because torch_xla work on tensors, attaching mesh to each sharded tensor seems to be redundant)

Therefore we have to derive the target devices from the xla::OpSharding.

For each XLA computation, it will be constructed from a dedicated lowering context. Therefore I planned to derive the target device information from the xla::OpSharding of lowered XlaOp, by checking the tile_assignment_devices field.
After XLA lowering, we will create a compilation instance, we store the target device info in the compilation instance object and then retrieve it in computation client before compilation.

The issue I have is in step 3:
It's not a valid solution to derive target SPMD devices from the tile_assignment_devices of Xla::Opsharding. In some cases, it's empty.

Since many test cases are failing with the above attempt, and there are many flavors of sharding annotation. I decided to use the env var to control the local SPMD enablement, which is clean since it's only used in the PjrtComputationClient

lsy323 · 2025-03-10T05:52:42Z

The GPU test failing is irrelevant, head is failing with the same error.

tengyifei · 2025-03-12T08:07:18Z

torch_xla/distributed/spmd/xla_sharding.py

  assert num_devices > 0, "This requires XLA supported device(s)."
-  assert mesh.size() == num_devices, \
+  assert mesh.size() == num_devices or mesh.size() == num_local_devices, \


This seem like a smell. Why do we only check the mesh size during mark_sharding? Furthermore, should we check that the mesh is a local mesh only when XLA_USE_LOCAL_SPMD==1?

Seems that these checks should be inside the Mesh constructor IMO

This is the mesh constructor : )

I agree, should we branch the local and global variants? These asserts are technically alleviating the constraints for both cases - e.g., I can define a mesh size of 32 for 64 global devices, if that host only has 32 addressable devices.

On the downside, I can see how this will complicate in how we get rid of the env var.

We should check if the mesh only contains local devices under local spmd, updated the condition

This is the mesh constructor : )

Wait, this is not the mesh constructor. This is the mark_sharding function. It seems very strange that we're doing these localized checks in the mark_sharding function. IMO this check should be moved to the Mesh constructor.

torch_xla/csrc/xla_sharding_util.cpp

tengyifei · 2025-03-12T08:15:13Z

We need to find a way in PjrtComputationClient::Compile to determine the target devices

Do you know how JAX figures out which target devices to run the graph on? E.g. if I create two smaller meshes [0, 1] and [2, 3] in JAX in a v6e-4, and jit a computation that uses both meshes, I suppose can JAX figure out that all 4 devices are involved in the computation, and launch the graph on all 4 devices (although some ops will only use 2 out of 4 devices)?

pgmoka · 2025-03-12T22:22:04Z

test/cpp/test_xla_sharding.cpp

+  std::vector<std::string> devices = {"TPU:8",  "TPU:9",  "TPU:10", "TPU:11",
+                                      "TPU:12", "TPU:13", "TPU:14", "TPU:15"};
+
+  // 1D tiled


For the different tiled methods, could we not do parasitized testing to significantly save code, and make things more readable?

INSTANTIATE_TEST_CASE_P reference: https://www.sandordargo.com/blog/2019/04/24/parameterized-testing-with-gtest

Thank you for the suggestion! But I personally think it's hard to parameterize in this case. We need to programmatically derive the mesh and expected output from the test parameters. It's hard to generalize in this case (1d tile, 2d tile and etc). I feel like in the current implementation, even if it's long, but it's more readable.

I suppose making paramatied functions in Python might be easier than C++. If we want to keep the way we are doing this, making these different tests might make things easier to test.

Currently if this test fails, it is hard to tell which case it is related to.

test/cpp/test_xla_sharding.cpp

torch_xla/distributed/spmd/xla_sharding.py

lsy323 · 2025-03-13T05:02:58Z

We need to find a way in PjrtComputationClient::Compile to determine the target devices

Do you know how JAX figures out which target devices to run the graph on? E.g. if I create two smaller meshes [0, 1] and [2, 3] in JAX in a v6e-4, and jit a computation that uses both meshes, I suppose can JAX figure out that all 4 devices are involved in the computation, and launch the graph on all 4 devices (although some ops will only use 2 out of 4 devices)?

I'm not sure if it's possible to use multiple meshed in a single computation. I think in the same SPMD program, we have to use only 1 mesh. I can have a try and update here later.

rpsilva-aws · 2025-03-13T06:54:24Z

Nice, this is great! Thanks for following up on it :)

A few other general questions:

Note that the global SPMD and local SPMD are not bridged, we need to have an API to create global SPMD tensor from local SPMD.

This is interesting, do we have a plan for this? We were stumbling on this too.

nit: We could consider adding an helper xs method for generating local SPMD meshes (similar to how we have one with 1D mesh), that includes the following taken from your ref above, given a partition_spec and mesh_shape:

process_id = xr.process_index()
num_local_devices = xr.addressable_runtime_device_count()

device_id_start = process_id * num_local_devices
device_ids = np.arange(device_id_start, device_id_start + num_local_devices)
return Mesh(device_ids, mesh_shape, partition_spec)

It's not a valid solution to derive target SPMD devices from the tile_assignment_devices of Xla::Opsharding. In some cases, it's empty.

The tile_assignment_devices is empty, for an existing OpSharding of type OTHER?

rpsilva-aws · 2025-03-13T05:24:24Z

torch_xla/distributed/spmd/xla_sharding.py

  assert num_devices > 0, "This requires XLA supported device(s)."
-  assert mesh.size() == num_devices, \
+  assert mesh.size() == num_devices or mesh.size() == num_local_devices, \


I agree, should we branch the local and global variants? These asserts are technically alleviating the constraints for both cases - e.g., I can define a mesh size of 32 for 64 global devices, if that host only has 32 addressable devices.

On the downside, I can see how this will complicate in how we get rid of the env var.

rpsilva-aws · 2025-03-13T06:27:32Z

torch_xla/csrc/runtime/pjrt_computation_client.cc

@@ -559,7 +559,10 @@ std::vector<ComputationClient::ComputationPtr> PjRtComputationClient::Compile(
          .set_allow_spmd_sharding_propagation_to_output(
              {instance.allow_spmd_sharding_propagation_to_output});

-      int num_partitions = client_->device_count();
+      int num_partitions = GetNumGlobalDevices();
+      if (runtime::sys_util::GetEnvBool("XLA_USE_LOCAL_SPMD", false)) {


nit: We could do bool use_local_spmd = runtime::sys_util::GetEnvBool("XLA_USE_LOCAL_SPMD", false); and use in both places below.

rpsilva-aws · 2025-03-13T06:28:16Z

torch_xla/csrc/runtime/pjrt_computation_client.cc

@@ -589,11 +592,18 @@ std::vector<ComputationClient::ComputationPtr> PjRtComputationClient::Compile(
      }

      // TODO(244391366) verify this is correct for the collectives ops
-      xla::DeviceAssignment device_assignment(1, client_->device_count());
+      xla::DeviceAssignment device_assignment(1, num_partitions);
      // DeviceAssignment values must be the PjRtDevice ID, so we need to
      // unwind the global ordinal mapping.


We can adapt the comment, since we don't only "unwind the global ordinal mapping" now.

rpsilva-aws · 2025-03-13T06:29:19Z

torch_xla/csrc/runtime/pjrt_computation_client.cc

-      for (const auto& [device_id, global_ordinal] : global_ordinals_) {
-        device_assignment(0, global_ordinal) = device_id;
+      if (runtime::sys_util::GetEnvBool("XLA_USE_LOCAL_SPMD", false)) {
+        auto local_pjrt_devices = client_->addressable_devices();


nit const auto&

rpsilva-aws · 2025-03-13T06:30:42Z

torch_xla/csrc/runtime/ifrt_computation_client.cc

@@ -613,10 +613,14 @@ IfrtComputationClient::ExecuteReplicated(
  return data_handles;
 }

-size_t IfrtComputationClient::GetNumDevices() const {
+size_t IfrtComputationClient::GetNumLocalDevices() const {


nit: We could also consider naming to include "addressable"/"visible". Technically, we can have N addressable devices for a process, but not necessarily all the host's devices - which could be considered to be local devices too. If we have 2 processes in one host, each with one addressable device, we can think of how we want the terminology to play out here.

torch_xla/csrc/init_python_bindings.cpp

rpsilva-aws · 2025-03-13T06:35:47Z

torch_xla/csrc/tensor_impl.cpp

@@ -57,7 +57,7 @@ struct XLAGuardImpl : public c10::impl::DeviceGuardImplInterface {
      return 0;
    }

-    return client->GetNumDevices();
+    return client->GetNumLocalDevices();


Q: Where do we use this?

I just rename the API to make explicitly say it's number of local devices, orignal name is confusing.

rpsilva-aws · 2025-03-13T06:42:02Z

torch_xla/distributed/spmd/xla_sharding.py

+  we need to normalize the physical device ids to generate the correct HLO
+  sharding annotation.
+  """
+  device_id_min = np.min(device_mesh)


nit: We could avoid the copy if device_id_min == 0 by exiting early (e.g. thousands of hosts).

rpsilva-aws · 2025-03-13T07:09:11Z

torch_xla/distributed/spmd/xla_sharding.py

+    # device ids are continous
+    if os.environ['XLA_USE_LOCAL_SPMD'] == '1':
+      # In local SPMD mesh only contains local devices.
+      min_device_idx = xr.process_index() * xr.addressable_runtime_device_count(


This assumes homogeneous distribution of addressable devices - will this be a requirement? Say we have a different amount of addressable devices per MPMD, e.g. [0,1], [2,3,4,5,6,7].

Yes, this is a requirement for now

pgmoka

Not much to add. I am interested in follow-ups to other comments

pgmoka · 2025-03-14T17:22:45Z

test/cpp/test_xla_sharding.cpp

+  std::vector<std::string> devices = {"TPU:8",  "TPU:9",  "TPU:10", "TPU:11",
+                                      "TPU:12", "TPU:13", "TPU:14", "TPU:15"};
+
+  // 1D tiled


I suppose making paramatied functions in Python might be easier than C++. If we want to keep the way we are doing this, making these different tests might make things easier to test.

Currently if this test fails, it is hard to tell which case it is related to.

tengyifei · 2025-03-17T18:40:04Z

I'm not sure if it's possible to use multiple meshed in a single computation. I think in the same SPMD program, we have to use only 1 mesh. I can have a try and update here later.

Ok also to answer my own question, JAX does these things:

When compiling, it looks at the input/output/intermediate shardings to find the device assignment in all those, and validates that they're the same devices. 1. That means one cannot use two non-overlapping meshes in a single jitted computation.
Once it has done that, it just determines the number of partitions based on the number of devices 2.

So this roughly translates to every sharded tensor having a mesh and the device assignment is derived from their meshes.

(Because torch_xla work on tensors, attaching mesh to each sharded tensor seems to be redundant)
Therefore we have to derive the target devices from the xla::OpSharding.

Is it possible to put a DeviceAssignment object next to the xla::OpSharding object? The DeviceAssignment object would hold a vector of device IDs. It sounds like we already store xla::OpSharding for each sharded tensor, so storing an extra field shouldn't be that much overhead and can handle the situations where xla::OpSharding is insufficient.

torch_xla/csrc/xla_sharding_util.cpp

tengyifei · 2025-03-17T18:56:49Z

torch_xla/csrc/xla_sharding_util.cpp

+                        sharding.tile_assignment_dimensions().end(), 1,
+                        [](int a, int b) { return a * b; });
+    std::unordered_map<int, int> device_index =
+        build_index_map(devices, num_tiles);


If I don't fully shard the tensor over the mesh, is num_tiles the correct value to provide as num_mesh_devices to build_index_map? It seems that we'd need to actually plumb in the mesh (or at least the mesh device count) somehow.

For example, if I'm doing local SPMD with a 2D mesh of 2x2 and axis name 'x', 'y', then later I shard a tensor only over the x axis. Would num_tiles be 2 or 4?

tengyifei · 2025-03-17T18:57:19Z

torch_xla/csrc/xla_sharding_util.cpp

+    //             devices, host 1 has devices TPU:{4, 5, 6, 7}. In this case
+    //             the global ordinal of TPU:4 is 0, TPU:5 is 1, and so on.
+
+    int global_ordinal =


nit: should this still be called global_ordinal? or maybe mesh_ordinal, based on your explanation? "global" sounds like it's talking about all devices across different hosts.

tengyifei · 2025-03-17T21:02:15Z

Sorry for the late comment -- I'm honestly a bit worried of the subtle distinction between local device IDs and global device IDs, particularly the normalize_logical_mesh function and the logic scatter across Python and C++. I think I understand what's happening enough to have a suggestion. Here's my understanding -- LMK if I got any part wrong:

In order to do "local SPMD", we need to tell XLA that num_partitions == 4 or whatever the number of local addressable devices. We also need to set the IDs in any OpSharding proto to be local IDs i.e. begin at 0. Finally, we need to create a xla::DeviceAssignment that maps these device IDs to PjRtDevice::id()s. It's a bit confusing because the PJRT device ID is what everyone else thinks of as "device IDs" when it comes to XLA, and the PJRT device ID is in fact sparse (e.g. it can be 100001 in case of multi-slice). So some part of our code refers the "local/global device ID" as "global ordinals" instead, which is really just the index of a device in the mesh. XLA thinks in terms of the PjRtDevice::id()s and our mesh ordinal is just an intermediate contract. In the extreme case, we could shuffle them arbitrarily and as long as we provide an appropriate xla::DeviceAssignment, that will result in the correct collectives.
Currently, the mark_sharding call implementations also lack abstraction. They directly build a xla::OpSharding proto and the same xla::OpSharding proto is later given to the XLA compiler, which requires that IDs in there start from 0. IIUC that's what forces us to normalize_logical_mesh. If XLA didn't have this requirement, we could avoid normalize_logical_mesh and just need to have the right xla::DeviceAssignment.
As a result, we have many kinds of device ID concepts. There's a "global device ID" that is dense and goes from 0 to the number of chips in the environment. There's a "local device ID" that's derived by subtracting a bunch of device ID by their minimum, whose correctness is guaranteed by a subtle check in xs.Mesh that requires device IDs to be locally addressable iff in local SPMD mode. Not to mention there's a third device ID which is the non-dense PjRTDevice::id(). The worse part is that they're all untyped integers.

I'm wondering if it makes sense to do a prefactor, to hide the xla::OpSharding proto from the public API. I think instead of storing xla::OpSharding, we'd probably create a torch_xla::OpSharding type that stores the same things except that it always stores global IDs (i.e. the ones we use to index into xr.global_runtime_device_attributes(). We could also store any other necessary information that lets us recover the number of partitions.

Once we've done this refactor, the Python layer can reason in terms of torch_xla::OpSharding objects instead of xla::OpSharding. Then it's not forced to normalize the device IDs. We'll still have two kinds of device IDs (the global dense ID and the sparse PjRTDevice::id(), but that's better than having three kinds of device IDs).

If this feature is no longer urgently required, I wonder is it possible to do this refactor? I think it could also let us support other kinds of SPMD (e.g. using 2 chips out of 4 chips) instead of being restricted to either full local SPMD or global SPMD.

lsy323 added 11 commits March 10, 2025 02:57

update mesh/xla_sharding python api for local spmd

2d2f08a

make local spmd working

c4aa854

skip no tile assignment device case for num partition retrieving in l…

2f33433

…owering context

use env var for local spmd

b633c76

get num partitions from prod of tile dims

ba4b480

clang

7561786

add a test for shard tensor for local mesh

ff12d44

use env var for device assignment handling

9873129

add assertion, comment for xla sharding python api

e418350

fix assertion

e2d157b

remove debug print, attemp to derive num partitions from lowering

3865f67

lsy323 force-pushed the lsiyuan/local-spmd-impl branch from 0833b24 to 3865f67 Compare March 10, 2025 02:58

lsy323 marked this pull request as ready for review March 10, 2025 05:24

lsy323 changed the title ~~local spmd impl~~ [1/N] Initial implementation of local SPMD support Mar 10, 2025

remove unused var

d3feb5f

lsy323 requested review from tengyifei and qihqi March 10, 2025 17:20

miladm assigned miladm and lsy323 and unassigned miladm Mar 11, 2025

miladm requested a review from pgmoka March 11, 2025 22:51

miladm added the distributed SPMD and other distributed things. label Mar 11, 2025

tengyifei reviewed Mar 12, 2025

View reviewed changes

torch_xla/csrc/xla_sharding_util.cpp Show resolved Hide resolved

pgmoka reviewed Mar 12, 2025

View reviewed changes

add comment for the modular of in device index map util func

2f1fc1a

lsy323 added 3 commits March 13, 2025 06:17

udpate comment

a1eaaeb

assert on local devices in mesh contructor

e97401d

check local devices in mesh ctor

4bfd7b5

rpsilva-aws reviewed Mar 13, 2025

View reviewed changes

pgmoka reviewed Mar 14, 2025

View reviewed changes

tengyifei reviewed Mar 17, 2025

View reviewed changes

tengyifei mentioned this pull request Mar 25, 2025

Encapsulate Mesh invariants #8882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1/N] Initial implementation of local SPMD support #8810

[1/N] Initial implementation of local SPMD support #8810

lsy323 commented Mar 9, 2025 •

edited

Loading

lsy323 commented Mar 10, 2025 •

edited

Loading

lsy323 commented Mar 10, 2025

tengyifei Mar 12, 2025

tengyifei Mar 12, 2025

lsy323 Mar 13, 2025

rpsilva-aws Mar 13, 2025 •

edited

Loading

lsy323 Mar 13, 2025

lsy323 Mar 14, 2025

tengyifei Mar 17, 2025

tengyifei commented Mar 12, 2025

pgmoka Mar 12, 2025

lsy323 Mar 13, 2025

pgmoka Mar 14, 2025

lsy323 commented Mar 13, 2025

rpsilva-aws commented Mar 13, 2025 •

edited

Loading

rpsilva-aws Mar 13, 2025 •

edited

Loading

rpsilva-aws Mar 13, 2025

rpsilva-aws Mar 13, 2025

rpsilva-aws Mar 13, 2025

rpsilva-aws Mar 13, 2025 •

edited

Loading

rpsilva-aws Mar 13, 2025

lsy323 Mar 13, 2025

rpsilva-aws Mar 13, 2025 •

edited

Loading

rpsilva-aws Mar 13, 2025 •

edited

Loading

lsy323 Mar 13, 2025

pgmoka left a comment

pgmoka Mar 14, 2025

tengyifei commented Mar 17, 2025

tengyifei Mar 17, 2025

tengyifei Mar 17, 2025

tengyifei commented Mar 17, 2025 •

edited

Loading

[1/N] Initial implementation of local SPMD support #8810

Are you sure you want to change the base?

[1/N] Initial implementation of local SPMD support #8810

Conversation

lsy323 commented Mar 9, 2025 • edited Loading

Usage

Implementation:

SPMD Python API:

Sharing Utilities:

PJRT computation client:

Test:

Future work:

lsy323 commented Mar 10, 2025 • edited Loading

lsy323 commented Mar 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rpsilva-aws Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tengyifei commented Mar 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lsy323 commented Mar 13, 2025

rpsilva-aws commented Mar 13, 2025 • edited Loading

rpsilva-aws Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rpsilva-aws Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rpsilva-aws Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

rpsilva-aws Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgmoka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tengyifei commented Mar 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tengyifei commented Mar 17, 2025 • edited Loading

lsy323 commented Mar 9, 2025 •

edited

Loading

lsy323 commented Mar 10, 2025 •

edited

Loading

rpsilva-aws Mar 13, 2025 •

edited

Loading

rpsilva-aws commented Mar 13, 2025 •

edited

Loading

rpsilva-aws Mar 13, 2025 •

edited

Loading

rpsilva-aws Mar 13, 2025 •

edited

Loading

rpsilva-aws Mar 13, 2025 •

edited

Loading

rpsilva-aws Mar 13, 2025 •

edited

Loading

tengyifei commented Mar 17, 2025 •

edited

Loading