Skip to content

Add subclass based method for inference w/ MXFP8 #2132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 12, 2025
Merged

Conversation

drisspg
Copy link
Contributor

@drisspg drisspg commented Apr 25, 2025

Stacked PRs:


add subclass based method for inference

Perf comparisons

Micro

https://fburl.com/whh557d1

I am seeing extra overhead then expected

Macro

Runnng

python benchmarks/benchmark_serving.py \
 --backend vllm \
 --model "Qwen/Qwen2-7B-Instruct" \
 --endpoint /v1/completions \
 --dataset-name sharegpt \
 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
 --num-prompts 1024
 ============ Serving Benchmark Result ============
Successful requests:                     1024      
Benchmark duration (s):                  13.14     
Total input tokens:                      225502    
Total generated tokens:                  185804    
Request throughput (req/s):              77.91     
Output token throughput (tok/s):         14137.59  
Total Token throughput (tok/s):          31295.75  
---------------Time to First Token----------------
Mean TTFT (ms):                          2659.95   
Median TTFT (ms):                        2613.38   
P99 TTFT (ms):                           4457.53   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.95     
Median TPOT (ms):                        28.56     
P99 TPOT (ms):                           167.73    
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.52     
Median ITL (ms):                         15.42     
P99 ITL (ms):                            148.78    
==================================================

Running against MXFP8:

python benchmarks/benchmark_serving.py \  
  --backend vllm \
  --model "/home/drisspg/meta/scripts/data/mxfp8-Qwen2-7B-Instruct" \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 1024
============ Serving Benchmark Result ============
Successful requests:                     1024      
Benchmark duration (s):                  13.43     
Total input tokens:                      225502    
Total generated tokens:                  185297    
Request throughput (req/s):              76.26     
Output token throughput (tok/s):         13800.30  
Total Token throughput (tok/s):          30594.93  
---------------Time to First Token----------------
Mean TTFT (ms):                          1119.68   
Median TTFT (ms):                        1100.86   
P99 TTFT (ms):                           1721.80   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.11     
Median TPOT (ms):                        27.07     
P99 TPOT (ms):                           46.91     
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.38     
Median ITL (ms):                         17.28     
P99 ITL (ms):                            49.45     
==================================================

Trace NonQuant:
https://fburl.com/sput3bmn

Trace Quant:
https://fburl.com/0pgmyrge

Copy link

pytorch-bot bot commented Apr 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2132

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg added a commit that referenced this pull request Apr 25, 2025
stack-info: PR: #2132, branch: drisspg/stack/50
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 25, 2025
@drisspg
Copy link
Contributor Author

drisspg commented Apr 26, 2025

Sorry, this should be a draft still, I am just dumping all my commits for now

@jerryzh168

@drisspg drisspg changed the base branch from drisspg/stack/49 to main April 26, 2025 01:05
@drisspg drisspg changed the base branch from main to drisspg/stack/49 April 26, 2025 01:05
@drisspg drisspg mentioned this pull request Apr 26, 2025
@drisspg drisspg changed the base branch from drisspg/stack/49 to main April 26, 2025 01:18
@drisspg drisspg changed the base branch from main to drisspg/stack/49 April 26, 2025 01:18
@drisspg drisspg changed the base branch from drisspg/stack/49 to main May 2, 2025 03:09
@drisspg drisspg changed the base branch from main to drisspg/stack/49 May 2, 2025 03:10
@drisspg drisspg changed the base branch from drisspg/stack/49 to main May 2, 2025 17:12
@drisspg drisspg changed the base branch from main to drisspg/stack/49 May 2, 2025 17:12
@drisspg drisspg added quantize topic: new feature Use this tag if this PR adds a new feature labels May 2, 2025
@drisspg drisspg changed the base branch from drisspg/stack/49 to main May 2, 2025 19:54
@drisspg drisspg force-pushed the drisspg/stack/50 branch from 34fa252 to 4304218 Compare May 2, 2025 19:54
@drisspg drisspg mentioned this pull request May 2, 2025
drisspg added a commit that referenced this pull request May 2, 2025
stack-info: PR: #2132, branch: drisspg/stack/50
@drisspg drisspg force-pushed the drisspg/stack/50 branch from 4304218 to 44a878b Compare May 2, 2025 21:04
@drisspg drisspg changed the title add subclass based method for inference Add subclass based method for inference w/ MXFP8 May 2, 2025
@drisspg drisspg closed this May 2, 2025
@drisspg drisspg reopened this May 2, 2025
@drisspg drisspg requested a review from jerryzh168 May 3, 2025 23:49
@drisspg drisspg force-pushed the drisspg/stack/50 branch 3 times, most recently from 1ef5526 to 7147243 Compare May 7, 2025 23:08
drisspg added a commit that referenced this pull request May 8, 2025
stack-info: PR: #2132, branch: drisspg/stack/50
@drisspg drisspg force-pushed the drisspg/stack/50 branch from 7147243 to ef2490e Compare May 8, 2025 17:52
@drisspg drisspg changed the base branch from main to drisspg/stack/53 May 8, 2025 17:52
@drisspg drisspg changed the base branch from drisspg/stack/53 to main May 8, 2025 18:05
drisspg added a commit that referenced this pull request May 8, 2025
stack-info: PR: #2132, branch: drisspg/stack/50
@drisspg drisspg force-pushed the drisspg/stack/50 branch from ef2490e to 204f230 Compare May 8, 2025 18:05
@drisspg drisspg changed the base branch from main to drisspg/stack/53 May 8, 2025 18:05
@drisspg drisspg changed the base branch from drisspg/stack/53 to main May 8, 2025 20:30
drisspg added a commit that referenced this pull request May 8, 2025
stack-info: PR: #2132, branch: drisspg/stack/50
@drisspg drisspg force-pushed the drisspg/stack/50 branch from 204f230 to c244b32 Compare May 8, 2025 20:30
@drisspg drisspg changed the base branch from main to drisspg/stack/53 May 8, 2025 20:30
@drisspg drisspg changed the base branch from drisspg/stack/53 to main May 8, 2025 20:35
@drisspg drisspg changed the base branch from main to drisspg/stack/53 May 8, 2025 20:35
@drisspg drisspg changed the base branch from drisspg/stack/53 to main May 8, 2025 20:40
drisspg added a commit that referenced this pull request May 8, 2025
stack-info: PR: #2132, branch: drisspg/stack/50
@drisspg drisspg force-pushed the drisspg/stack/50 branch from c244b32 to 65d6729 Compare May 8, 2025 20:40
@drisspg drisspg changed the base branch from main to drisspg/stack/53 May 8, 2025 20:40
Copy link
Contributor

@vkuzo vkuzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg!

stack-info: PR: #2132, branch: drisspg/stack/50
@drisspg drisspg changed the base branch from drisspg/stack/53 to main May 12, 2025 17:00
@drisspg drisspg force-pushed the drisspg/stack/50 branch from 65d6729 to 17f928d Compare May 12, 2025 17:00
@drisspg drisspg merged commit 0607aa1 into main May 12, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. quantize topic: new feature Use this tag if this PR adds a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants