Add subclass based method for inference w/ MXFP8 #2132

drisspg · 2025-04-25T22:37:05Z

Stacked PRs:

->Add subclass based method for inference w/ MXFP8 #2132

add subclass based method for inference

Perf comparisons

Micro

https://fburl.com/whh557d1

I am seeing extra overhead then expected

Macro

Runnng

python benchmarks/benchmark_serving.py \
 --backend vllm \
 --model "Qwen/Qwen2-7B-Instruct" \
 --endpoint /v1/completions \
 --dataset-name sharegpt \
 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
 --num-prompts 1024

 ============ Serving Benchmark Result ============
Successful requests:                     1024      
Benchmark duration (s):                  13.14     
Total input tokens:                      225502    
Total generated tokens:                  185804    
Request throughput (req/s):              77.91     
Output token throughput (tok/s):         14137.59  
Total Token throughput (tok/s):          31295.75  
---------------Time to First Token----------------
Mean TTFT (ms):                          2659.95   
Median TTFT (ms):                        2613.38   
P99 TTFT (ms):                           4457.53   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.95     
Median TPOT (ms):                        28.56     
P99 TPOT (ms):                           167.73    
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.52     
Median ITL (ms):                         15.42     
P99 ITL (ms):                            148.78    
==================================================

Running against MXFP8:

python benchmarks/benchmark_serving.py \  
  --backend vllm \
  --model "/home/drisspg/meta/scripts/data/mxfp8-Qwen2-7B-Instruct" \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 1024

============ Serving Benchmark Result ============
Successful requests:                     1024      
Benchmark duration (s):                  13.43     
Total input tokens:                      225502    
Total generated tokens:                  185297    
Request throughput (req/s):              76.26     
Output token throughput (tok/s):         13800.30  
Total Token throughput (tok/s):          30594.93  
---------------Time to First Token----------------
Mean TTFT (ms):                          1119.68   
Median TTFT (ms):                        1100.86   
P99 TTFT (ms):                           1721.80   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.11     
Median TPOT (ms):                        27.07     
P99 TPOT (ms):                           46.91     
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.38     
Median ITL (ms):                         17.28     
P99 ITL (ms):                            49.45     
==================================================

Trace NonQuant:
https://fburl.com/sput3bmn

Trace Quant:
https://fburl.com/0pgmyrge

pytorch-bot · 2025-04-25T22:37:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2132

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

CI workflows being skipped on PR

❌ 1 New Failure

As of commit 44a878b with merge base 955ebb0 ():

NEW FAILURE - The following job has failed:

Run TorchAO Experimental Tests / test-mps-ops (macos-m1-stable) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #2132, branch: drisspg/stack/50

jerryzh168 · 2025-04-25T23:09:14Z

torchao/__init__.py

@@ -53,4 +53,5 @@
    "swizzle",
    "testing",
    "ops",
+    "quantization",


why is swizzle top level?

Thats a really good question

jerryzh168 · 2025-04-25T23:10:50Z

torchao/prototype/mx_formats/mx_subclass.py

+    return f"in_features={self.weight.shape[1]}, out_features={self.weight.shape[0]}, weight={repr(self.weight)}"
+
+
+def _input_activation_quant_func_mxfp(


nit: add to safe global for serialization?

ao/torchao/quantization/quant_api.py

Line 1993 in 58502e3

torch.serialization.add_safe_globals(

drisspg · 2025-04-26T00:32:07Z

Sorry, this should be a draft still, I am just dumping all my commits for now

@jerryzh168

stack-info: PR: #2132, branch: drisspg/stack/50

drisspg added a commit that referenced this pull request Apr 25, 2025

add subclass based method for inference

34fa252

stack-info: PR: #2132, branch: drisspg/stack/50

drisspg force-pushed the drisspg/stack/50 branch from 93067f0 to 34fa252 Compare April 25, 2025 22:37

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 25, 2025

drisspg mentioned this pull request Apr 25, 2025

[MX] Remove mxfp8 kernel and rely on cublas #2130

Merged

jerryzh168 reviewed Apr 25, 2025

View reviewed changes

drisspg changed the base branch from drisspg/stack/49 to main April 26, 2025 01:05

drisspg changed the base branch from main to drisspg/stack/49 April 26, 2025 01:05

drisspg mentioned this pull request Apr 26, 2025

various fixes #2133

Closed

drisspg changed the base branch from drisspg/stack/49 to main April 26, 2025 01:18

drisspg changed the base branch from main to drisspg/stack/49 April 26, 2025 01:18

drisspg changed the base branch from drisspg/stack/49 to main May 2, 2025 03:09

drisspg changed the base branch from main to drisspg/stack/49 May 2, 2025 03:10

drisspg changed the base branch from drisspg/stack/49 to main May 2, 2025 17:12

drisspg changed the base branch from main to drisspg/stack/49 May 2, 2025 17:12

drisspg added quantize topic: new feature Use this tag if this PR adds a new feature labels May 2, 2025

drisspg changed the base branch from drisspg/stack/49 to main May 2, 2025 19:54

drisspg force-pushed the drisspg/stack/50 branch from 34fa252 to 4304218 Compare May 2, 2025 19:54

drisspg mentioned this pull request May 2, 2025

VLLM Workaround #2165

Closed

Add subclass based method for inference w/ MXFP8

44a878b

stack-info: PR: #2132, branch: drisspg/stack/50

drisspg force-pushed the drisspg/stack/50 branch from 4304218 to 44a878b Compare May 2, 2025 21:04

drisspg changed the title ~~add subclass based method for inference~~ Add subclass based method for inference w/ MXFP8 May 2, 2025

drisspg closed this May 2, 2025

drisspg reopened this May 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add subclass based method for inference w/ MXFP8 #2132

Add subclass based method for inference w/ MXFP8 #2132

drisspg commented Apr 25, 2025 •

edited

Loading

pytorch-bot bot commented Apr 25, 2025 •

edited

Loading

jerryzh168 Apr 25, 2025

drisspg Apr 26, 2025

jerryzh168 Apr 25, 2025

drisspg commented Apr 26, 2025 •

edited

Loading

		return f"in_features={self.weight.shape[1]}, out_features={self.weight.shape[0]}, weight={repr(self.weight)}"


		def _input_activation_quant_func_mxfp(

Add subclass based method for inference w/ MXFP8 #2132

Are you sure you want to change the base?

Add subclass based method for inference w/ MXFP8 #2132

Conversation

drisspg commented Apr 25, 2025 • edited Loading

Perf comparisons

Micro

Macro

pytorch-bot bot commented Apr 25, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2132

❗ 1 Active SEVs

❌ 1 New Failure

jerryzh168 Apr 25, 2025

Choose a reason for hiding this comment

drisspg Apr 26, 2025

Choose a reason for hiding this comment

jerryzh168 Apr 25, 2025

Choose a reason for hiding this comment

drisspg commented Apr 26, 2025 • edited Loading

drisspg commented Apr 25, 2025 •

edited

Loading

pytorch-bot bot commented Apr 25, 2025 •

edited

Loading

drisspg commented Apr 26, 2025 •

edited

Loading