Adapt Splash Attention from TorchPrime #8911

zpcore · 2025-03-31T22:05:24Z

Adapt the PR AI-Hypercomputer/torchprime#145 from TorchPrime into PTXLA. Also simplified the code to use jit hashing from #8878.

tengyifei

Nice!

torch_xla/experimental/custom_kernel_from_jax.py

tengyifei · 2025-04-05T23:25:44Z

torch_xla/experimental/custom_kernel_from_jax.py

+from torch_xla.experimental.custom_kernel import requires_jax
+
+
+@dataclasses.dataclass


We can add eq=True and hash=True to the dataclasses decorator call. This way any instance of this config will be hashable and can be used as an argument to call_jax, avoiding the lru cache. (I had a comment with more details later)

tengyifei · 2025-04-05T23:29:01Z

torch_xla/experimental/custom_kernel_from_jax.py

+  )
+
+  mesh = config.maybe_convert_and_get_jax_mesh()
+  # input q,k,v shape: [batch, #head, seq_len, kv]


Should the last dim be called "head or "head_dim"?

tengyifei · 2025-04-05T23:31:19Z

torch_xla/experimental/custom_kernel_from_jax.py

+          query.shape[2] == decoder_segment_ids.q.shape[1]
+      ), "Sharding along sequence dimension not allowed in tpu kernel attention"
+    block_sizes = splash_attention_kernel.BlockSizes(
+        block_q=min(global_block_q, query.shape[2]),


Could we factor out a seq_len variable

tengyifei · 2025-04-05T23:32:27Z

torch_xla/experimental/custom_kernel_from_jax.py

+      ("data", "fsdp"),
+      None,
+  )
+  AttentionType_LOCAL_SLIDING: bool = False


nit: field names should be simple snake case

tengyifei · 2025-04-05T23:46:48Z

test/test_splash_attention_jax.py

+
+  def maybe_reduce_kv_grad(self, hidden_state_grad):
+    # For GQA, the kv grad shape is [BATCH_SIZE, NUM_Q_HEADS, SEQ_LEN,
+    # HEAD_DIM]. We need to convert it back to [BATCH_SIZE, NUM_Q_HEADS,


Convert back to NUM_KV_HEADS?

tengyifei · 2025-04-05T23:47:48Z

test/test_splash_attention_jax.py

+      self.k_grad = self.maybe_reduce_kv_grad(k_grad)
+      self.v_grad = self.maybe_reduce_kv_grad(v_grad)
+
+  def maybe_expend_kv(self, hidden_state):


nit: expand, or repeat

tengyifei · 2025-04-05T23:50:02Z

test/tpu/run_tests.sh

@@ -41,6 +41,7 @@ run_xla_hlo_debug python3 "$TEST_CDIR/scan/test_scan_debug.py"
 python3 "$TEST_CDIR/test_pallas.py" -v
 python3 "$TEST_CDIR/test_pallas_spmd.py"
 XLA_DISABLE_FUNCTIONALIZATION=1 python3 "$TEST_CDIR/test_pallas_spmd.py"
+python3 "$TEST_CDIR/test_splash_attention_jax.py"


nit: Jax is just an implementation detail. We could simply call this file test_splash_attention.py

tengyifei · 2025-04-05T23:51:16Z

torch_xla/experimental/custom_kernel_from_jax.py

@@ -0,0 +1,397 @@
+import dataclasses


nit: I think we could call this splash_attention.py to be more specific. If there are generic Jax utilities, those could be put in a custom_kernels_from_jax.py or similar

tengyifei · 2025-04-05T23:53:09Z

torch_xla/experimental/custom_kernel_from_jax.py

+    k: torch.Tensor,
+    v: torch.Tensor,
+    config: str,
+    decoder_segment_ids: torch.Tensor | None = None,


Could we at least document the segment IDs thing and the soft cap in triple quote style doc comments?

Maybe also document that splash attention vs flash attention, e.g. splash attention can be faster if the attention mask is sparse by skipping blocks, etc etc

zpcore added 6 commits April 5, 2025 00:14

Adapt Splash Attention from TorchPrime

b259d44

specify jax platform

afadbf5

put flash attention into setup

d40ecb2

refine test

7cab4c9

retain grad

1579911

retain grad

90eabe6

zpcore force-pushed the piz/port_sa branch from b5f9c51 to 90eabe6 Compare April 5, 2025 00:21

zpcore marked this pull request as ready for review April 5, 2025 21:00

zpcore requested a review from tengyifei April 5, 2025 21:00

bring test back

391caec

zpcore enabled auto-merge (squash) April 5, 2025 23:28

tengyifei requested changes Apr 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt Splash Attention from TorchPrime #8911

Adapt Splash Attention from TorchPrime #8911

zpcore commented Mar 31, 2025

tengyifei left a comment

tengyifei Apr 5, 2025 •

edited

Loading

tengyifei Apr 5, 2025

tengyifei Apr 5, 2025

tengyifei Apr 5, 2025

tengyifei Apr 5, 2025

tengyifei Apr 5, 2025

tengyifei Apr 5, 2025

tengyifei Apr 5, 2025

tengyifei Apr 5, 2025

		from torch_xla.experimental.custom_kernel import requires_jax


		@dataclasses.dataclass

Adapt Splash Attention from TorchPrime #8911

Are you sure you want to change the base?

Adapt Splash Attention from TorchPrime #8911

Conversation

zpcore commented Mar 31, 2025

tengyifei left a comment

Choose a reason for hiding this comment

tengyifei Apr 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tengyifei Apr 5, 2025 •

edited

Loading