Skip to content

Add subclass based method for inference w/ MXFP8 #2132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

drisspg
Copy link
Contributor

@drisspg drisspg commented Apr 25, 2025

Stacked PRs:


add subclass based method for inference

Perf comparisons

Micro

https://fburl.com/whh557d1

I am seeing extra overhead then expected

Macro

Runnng

python benchmarks/benchmark_serving.py \
 --backend vllm \
 --model "Qwen/Qwen2-7B-Instruct" \
 --endpoint /v1/completions \
 --dataset-name sharegpt \
 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
 --num-prompts 1024
 ============ Serving Benchmark Result ============
Successful requests:                     1024      
Benchmark duration (s):                  13.14     
Total input tokens:                      225502    
Total generated tokens:                  185804    
Request throughput (req/s):              77.91     
Output token throughput (tok/s):         14137.59  
Total Token throughput (tok/s):          31295.75  
---------------Time to First Token----------------
Mean TTFT (ms):                          2659.95   
Median TTFT (ms):                        2613.38   
P99 TTFT (ms):                           4457.53   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.95     
Median TPOT (ms):                        28.56     
P99 TPOT (ms):                           167.73    
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.52     
Median ITL (ms):                         15.42     
P99 ITL (ms):                            148.78    
==================================================

Running against MXFP8:

python benchmarks/benchmark_serving.py \  
  --backend vllm \
  --model "/home/drisspg/meta/scripts/data/mxfp8-Qwen2-7B-Instruct" \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 1024
============ Serving Benchmark Result ============
Successful requests:                     1024      
Benchmark duration (s):                  13.43     
Total input tokens:                      225502    
Total generated tokens:                  185297    
Request throughput (req/s):              76.26     
Output token throughput (tok/s):         13800.30  
Total Token throughput (tok/s):          30594.93  
---------------Time to First Token----------------
Mean TTFT (ms):                          1119.68   
Median TTFT (ms):                        1100.86   
P99 TTFT (ms):                           1721.80   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.11     
Median TPOT (ms):                        27.07     
P99 TPOT (ms):                           46.91     
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.38     
Median ITL (ms):                         17.28     
P99 ITL (ms):                            49.45     
==================================================

Trace NonQuant:
https://fburl.com/sput3bmn

Trace Quant:
https://fburl.com/0pgmyrge

Copy link

pytorch-bot bot commented Apr 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2132

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure

As of commit 44a878b with merge base 955ebb0 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg added a commit that referenced this pull request Apr 25, 2025
stack-info: PR: #2132, branch: drisspg/stack/50
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 25, 2025
@@ -53,4 +53,5 @@
"swizzle",
"testing",
"ops",
"quantization",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is swizzle top level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a really good question

return f"in_features={self.weight.shape[1]}, out_features={self.weight.shape[0]}, weight={repr(self.weight)}"


def _input_activation_quant_func_mxfp(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add to safe global for serialization?

torch.serialization.add_safe_globals(

@drisspg
Copy link
Contributor Author

drisspg commented Apr 26, 2025

Sorry, this should be a draft still, I am just dumping all my commits for now

@jerryzh168

@drisspg drisspg changed the base branch from drisspg/stack/49 to main April 26, 2025 01:05
@drisspg drisspg changed the base branch from main to drisspg/stack/49 April 26, 2025 01:05
@drisspg drisspg mentioned this pull request Apr 26, 2025
@drisspg drisspg changed the base branch from drisspg/stack/49 to main April 26, 2025 01:18
@drisspg drisspg changed the base branch from main to drisspg/stack/49 April 26, 2025 01:18
@drisspg drisspg changed the base branch from drisspg/stack/49 to main May 2, 2025 03:09
@drisspg drisspg changed the base branch from main to drisspg/stack/49 May 2, 2025 03:10
@drisspg drisspg changed the base branch from drisspg/stack/49 to main May 2, 2025 17:12
@drisspg drisspg changed the base branch from main to drisspg/stack/49 May 2, 2025 17:12
@drisspg drisspg added quantize topic: new feature Use this tag if this PR adds a new feature labels May 2, 2025
@drisspg drisspg changed the base branch from drisspg/stack/49 to main May 2, 2025 19:54
@drisspg drisspg force-pushed the drisspg/stack/50 branch from 34fa252 to 4304218 Compare May 2, 2025 19:54
@drisspg drisspg mentioned this pull request May 2, 2025
stack-info: PR: #2132, branch: drisspg/stack/50
@drisspg drisspg force-pushed the drisspg/stack/50 branch from 4304218 to 44a878b Compare May 2, 2025 21:04
@drisspg drisspg changed the title add subclass based method for inference Add subclass based method for inference w/ MXFP8 May 2, 2025
@drisspg drisspg closed this May 2, 2025
@drisspg drisspg reopened this May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. quantize topic: new feature Use this tag if this PR adds a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants