Add `pod_type` categorical feature to latency prediction models by RishabhSaini · Pull Request #1993 · kubernetes-sigs/gateway-api-inference-extension

RishabhSaini · 2025-12-12T19:15:38Z

Add pod_type support for prefill-decode (PD) disaggregated serving:

Added pod_type categorical feature (''=monolithic, 'prefill', 'decode') to both TTFT and TPOT prediction models
Implemented pod_type_cat encoding in feature preparation with proper categorical handling
Updated Bayesian Ridge models to one-hot encode pod_type (can't handle categoricals directly)
Modified XGBoost monotone constraints from 5 -> 6 features, with no constraint on pod_type
Added pod_type field to API models (PredictionRequest, TrainingEntry) in Python and Go
Ensured backward compatibility by defaulting to '' (monolithic) when pod_type is missing

Resolves: #1923

Links to related PRs:
llm-d/llm-d#596
llm-d/llm-d-inference-scheduler#564

netlify · 2025-12-12T19:15:44Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`1fd3277`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69812e3c87c6af0008468e65
😎 Deploy Preview	https://deploy-preview-1993--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-12-12T19:15:47Z

Hi @RishabhSaini. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kaushikmitr · 2026-01-10T05:41:52Z

/ok-to-test

config/charts/inferencepool/templates/epp-config.yaml

latencypredictor/prediction_server.py

kaushikmitr · 2026-02-02T18:48:18Z

@RishabhSaini this looks good, can you update here the the test_dual_server outputs?

RishabhSaini · 2026-02-02T22:57:20Z

> kubectl logs -n llm-d-pd-r pods/latency-predictor-test-5gnnq
============================= test session starts ==============================
platform linux -- Python 3.9.25, pytest-8.4.2, pluggy-1.6.0 -- /usr/local/bin/python3.9
cachedir: .pytest_cache
rootdir: /app
plugins: asyncio-1.2.0, anyio-4.12.0
asyncio: mode=strict, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collecting ... collected 32 items

test_dual_server_client.py::test_prediction_server_healthz Waiting for prediction server...
Waiting for training server...
PASSED
test_dual_server_client.py::test_training_server_healthz PASSED
test_dual_server_client.py::test_prediction_server_readyz PASSED
test_dual_server_client.py::test_training_server_readyz PASSED
test_dual_server_client.py::test_prediction_server_status Prediction server using model type: xgboost
Quantile: 0.9
Models ready: True
Models exist: {'ttft_model': True, 'tpot_model': True}
PASSED
test_dual_server_client.py::test_training_server_model_info Training server using model type: xgboost
PASSED
test_dual_server_client.py::test_training_server_models_list Model ttft: exists=True, size=143527 bytes
Model tpot: exists=True, size=143279 bytes
PASSED
test_dual_server_client.py::test_model_download_from_training_server Successfully downloaded ttft model (143527 bytes)
Successfully downloaded tpot model (143279 bytes)
PASSED
test_dual_server_client.py::test_lightgbm_endpoints_on_training_server Skipping LightGBM endpoint tests - not using LightGBM model
PASSED
test_dual_server_client.py::test_add_training_data_to_training_server Successfully sent training data to training server
PASSED
test_dual_server_client.py::test_prediction_server_model_sync Model reload result: synced=False, loaded=True
Prediction server models are ready!
PASSED
test_dual_server_client.py::test_prediction_via_prediction_server Prediction successful: TTFT=10.00ms, TPOT=10.00ms
Model type: xgboost
PASSED
test_dual_server_client.py::test_bulk_prediction_strict Testing bulk prediction strict endpoint...
✓ Bulk prediction strict endpoint test passed
PASSED
test_dual_server_client.py::test_bulk_prediction_with_validation_errors Testing bulk prediction validation error handling...
✓ Bulk prediction correctly failed when any request had validation errors
PASSED
test_dual_server_client.py::test_bulk_prediction_all_valid Testing bulk prediction with all valid requests...
✓ Bulk prediction succeeded with all valid requests
PASSED
test_dual_server_client.py::test_prediction_missing_prefix_cache_score ✓ Prediction correctly failed when prefix_cache_score was missing
PASSED
test_dual_server_client.py::test_prediction_with_pod_type_prefill Testing prediction with pod_type='prefill'...
✓ Prefill prediction: TTFT=10.00ms, TPOT=10.00ms
PASSED
test_dual_server_client.py::test_prediction_with_pod_type_decode Testing prediction with pod_type='decode'...
✓ Decode prediction: TTFT=10.00ms, TPOT=10.00ms
PASSED
test_dual_server_client.py::test_bulk_prediction_with_pod_type Testing bulk prediction with pod_type...
  Prefill: TTFT=10.00ms, TPOT=10.00ms
  Decode: TTFT=10.00ms, TPOT=10.00ms
  Legacy: TTFT=10.00ms, TPOT=10.00ms
✓ Bulk prediction with mixed pod types passed
PASSED
test_dual_server_client.py::test_training_data_with_pod_type Testing training data with pod_type...
✓ Successfully sent 20 training samples with pod_type
PASSED
test_dual_server_client.py::test_invalid_pod_type Testing invalid pod_type handling...
✓ Invalid pod_type accepted with fallback behavior (permissive validation)
PASSED
test_dual_server_client.py::test_training_server_metrics Training server metrics endpoint working correctly
✓ Prefix cache score feature found in metrics
PASSED
test_dual_server_client.py::test_model_consistency_between_servers Model type consistent across servers: xgboost
PASSED
test_dual_server_client.py::test_model_specific_endpoints_on_training_server Testing XGBoost tree endpoints on training server...
✓ TTFT XGBoost trees available: 200 trees
✓ TPOT XGBoost trees available: 200 trees
PASSED
test_dual_server_client.py::test_dual_server_quantile_regression_learns_distribution Relative-err accuracy (≤15%): 88.5%
Coverage: TTFT=0.875, TPOT=0.905 (target 0.900 ± 0.05)
PASSED
test_dual_server_client.py::test_prediction_server_stress_test Running prediction server stress test...
Waiting for 100007 prediction requests to complete...
Target QPS: 1000.0, Actual QPS: 1000.1

==================================================
PREDICTION SERVER STRESS TEST RESULTS
==================================================
Total Requests: 100007
Successful: 100007 (100.0%)
Failed: 0 (0.0%)
Average Response Time: 15.30ms

Model Types in Predictions:
  xgboost: 100007

Status Code Distribution:
  200: 100007

Response Time Percentiles:
  P50: 4.77ms
  P95: 75.67ms
  P99: 179.83ms
Prediction server stress test completed with 100.0% success rate
PASSED
test_dual_server_client.py::test_bulk_prediction_stress_test Running bulk prediction stress test...

Testing with batch size 5...
Waiting for 100000 bulk prediction requests to complete...
Target RPS: 1000.0, Actual RPS: 1000.0
Total Predictions: 500000, Predictions/sec: 5000.0

==================================================
BULK PREDICTION STRESS TEST RESULTS
==================================================
Total Bulk Requests: 100000
Successful: 100000 (100.0%)
Failed: 0 (0.0%)
Total Individual Predictions: 500000
Total Batch Size: 500000
Average Response Time: 16.49ms
Average Batch Size: 5.0
Prediction Success Rate: 100.0%

Status Code Distribution:
  200: 100000

Response Time Percentiles:
  P50: 8.86ms
  P95: 67.90ms
  P99: 149.37ms
Bulk prediction stress test (batch size 5) completed with 100.0% success rate

Testing with batch size 10...
Waiting for 100001 bulk prediction requests to complete...
Target RPS: 1000.0, Actual RPS: 1000.0
Total Predictions: 1000010, Predictions/sec: 10000.1

==================================================
BULK PREDICTION STRESS TEST RESULTS
==================================================
Total Bulk Requests: 100001
Successful: 100001 (100.0%)
Failed: 0 (0.0%)
Total Individual Predictions: 1000010
Total Batch Size: 1000010
Average Response Time: 33.80ms
Average Batch Size: 10.0
Prediction Success Rate: 100.0%

Status Code Distribution:
  200: 100001

Response Time Percentiles:
  P50: 8.99ms
  P95: 202.07ms
  P99: 299.30ms
Bulk prediction stress test (batch size 10) completed with 100.0% success rate

Testing with batch size 25...
Waiting for 99998 bulk prediction requests to complete...
Target RPS: 1000.0, Actual RPS: 1000.0
Total Predictions: 2499950, Predictions/sec: 24999.5

==================================================
BULK PREDICTION STRESS TEST RESULTS
==================================================
Total Bulk Requests: 99998
Successful: 99998 (100.0%)
Failed: 0 (0.0%)
Total Individual Predictions: 2499950
Total Batch Size: 2499950
Average Response Time: 58.63ms
Average Batch Size: 25.0
Prediction Success Rate: 100.0%

Status Code Distribution:
  200: 99998

Response Time Percentiles:
  P50: 13.77ms
  P95: 277.73ms
  P99: 369.84ms
Bulk prediction stress test (batch size 25) completed with 100.0% success rate
PASSED
test_dual_server_client.py::test_large_batch_prediction_stress_test Running bulk prediction stress test...

Testing with batch size 1000...
Waiting for 9999 bulk prediction requests to complete...
Target RPS: 100.0, Actual RPS: 100.0
Total Predictions: 9999000, Predictions/sec: 99990.0

==================================================
BULK PREDICTION STRESS TEST RESULTS
==================================================
Total Bulk Requests: 9999
Successful: 9999 (100.0%)
Failed: 0 (0.0%)
Total Individual Predictions: 9999000
Total Batch Size: 9999000
Average Response Time: 38.42ms
Average Batch Size: 1000.0
Prediction Success Rate: 100.0%

Status Code Distribution:
  200: 9999

Response Time Percentiles:
  P50: 30.80ms
  P95: 82.13ms
  P99: 160.23ms
Bulk prediction stress test (batch size 1000) completed with 100.0% success rate
PASSED
test_dual_server_client.py::test_end_to_end_workflow Testing end-to-end workflow...
Step 1: Sending training data to training server...
Step 2: Waiting for training...
Step 3: Syncing models to prediction server...
Step 4: Making predictions...
  Prediction 1: TTFT=1001.76ms, TPOT=300.85ms (prefix_cache=0.44)
  Prediction 2: TTFT=646.67ms, TPOT=242.58ms (prefix_cache=0.48)
  Prediction 3: TTFT=1384.65ms, TPOT=414.87ms (prefix_cache=0.55)
  Prediction 4: TTFT=1388.67ms, TPOT=412.35ms (prefix_cache=0.28)
  Prediction 5: TTFT=1513.01ms, TPOT=452.89ms (prefix_cache=0.33)
✓ End-to-end workflow completed successfully!
PASSED
test_dual_server_client.py::test_server_configuration Testing server configuration...
Prediction server: HTTP-based Quantile Latency Predictor is running
  Model type: xgboost
  Is ready: True
  Sync interval: 10s
  Training server URL: http://training-service:8000
Training server: Latency Predictor is running.
  Model type: xgboost
PASSED
test_dual_server_client.py::test_training_server_flush_api Testing training server flush API...
Step 1: Checking initial data status...
  Initial training samples: TTFT=2772, TPOT=2763
  Initial test samples: TTFT=318, TPOT=317
Step 2: Adding training data...
  Added 100 training samples
Step 3: Verifying data was added...
  After adding - Training: 5721, Test: 649, Total: 6370
Step 4: Testing flush with only training data...
  Flushed 2865 TTFT training samples
  Flushed 2856 TPOT training samples
  Test samples flushed: 0 TTFT, 0 TPOT (should be 0)
  After training flush - Training: 0, Test: 649
Step 5: Adding more training data...
Step 6: Testing flush everything...
  Complete flush message: Successfully flushed: 44 TTFT and 44 TPOT training samples, 331 TTFT and 330 TPOT test samples, all metric scores
  After complete flush - Training: 0, Test: 0
Step 7: Testing default flush (no body)...
  Default flush result: Successfully flushed: 16 TTFT and 16 TPOT training samples, 4 TTFT and 4 TPOT test samples, all metric scores
Step 8: Testing flush with only test data...
  Test data flush: 3 TTFT, 3 TPOT
  After test flush - Training: 94, Test: 0
Step 9: Testing bucket distribution in status...
  Bucket distribution available: 0 buckets with data
✓ Flush API tests passed!
PASSED
test_dual_server_client.py::test_training_server_flush_error_handling Testing flush API error handling...
✓ Invalid JSON handled correctly
✓ Flush error handling tests passed!
PASSED

======================== 32 passed in 563.81s (0:09:23) ========================

to both TTFT and TPOT prediction models - Add pod_type field to PredictionRequest and TrainingEntry models - Encode pod_type as categorical in _prepare_features_with_interaction - Handle pod_type_cat in both TTFT and TPOT feature columns - One-hot encode pod_type_cat for Bayesian Ridge models - Add pod_type to XGBoost/LightGBM feature orders with monotone constraints - Add comprehensive tests for pod_type functionality - Update Go types to include PodType field

kaushikmitr · 2026-02-02T23:38:40Z

/lgtm

kaushikmitr · 2026-02-02T23:38:53Z

/approve

RishabhSaini · 2026-02-02T23:46:42Z

cc maintainers: @danehans @kfswain @ahg-g @nirrozenbaum for approval

kfswain · 2026-02-02T23:51:50Z

/approve

k8s-ci-robot · 2026-02-02T23:51:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaushikmitr, kfswain, RishabhSaini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…') (kubernetes-sigs#1993) to both TTFT and TPOT prediction models - Add pod_type field to PredictionRequest and TrainingEntry models - Encode pod_type as categorical in _prepare_features_with_interaction - Handle pod_type_cat in both TTFT and TPOT feature columns - One-hot encode pod_type_cat for Bayesian Ridge models - Add pod_type to XGBoost/LightGBM feature orders with monotone constraints - Add comprehensive tests for pod_type functionality - Update Go types to include PodType field

k8s-ci-robot requested review from kfswain and nirrozenbaum December 12, 2025 19:15

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 12, 2025

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 12, 2025

This was referenced Dec 12, 2025

[WIP] Update helm charts to include the SLO Aware routing sidecars #1834

Closed

SLO Aware Routing PD Disaggregation Support llm-d/llm-d#442

Closed

SLO Aware Routing PD Disaggregation Support llm-d/llm-d-inference-scheduler#511

Closed

RishabhSaini force-pushed the slo-pd-support branch from f2b600b to d9556ef Compare December 12, 2025 20:21

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 12, 2025

RishabhSaini mentioned this pull request Dec 12, 2025

Predicted Latency based routing - support disagg scenarios #1923

Closed

kfswain added the gie-area/predictor Categorizes an issue or PR as relevant to GIE predictor label Dec 15, 2025

kfswain requested review from kaushikmitr and removed request for kfswain and nirrozenbaum December 15, 2025 23:13

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 10, 2026

kaushikmitr reviewed Jan 10, 2026

View reviewed changes

config/charts/inferencepool/templates/epp-config.yaml Outdated Show resolved Hide resolved

latencypredictor/prediction_server.py Show resolved Hide resolved

RishabhSaini force-pushed the slo-pd-support branch from 4497584 to efcc534 Compare January 13, 2026 21:52

RishabhSaini changed the title ~~SLO Aware Routing PD Disaggregation Support~~ Added pod_type categorical feature to latency prediction models Jan 13, 2026

RishabhSaini changed the title ~~Added pod_type categorical feature to latency prediction models~~ Add pod_type categorical feature to latency prediction models Jan 13, 2026

RishabhSaini mentioned this pull request Jan 13, 2026

Make PredictedLatencyScorer PD Aware #2145

Closed

RishabhSaini force-pushed the slo-pd-support branch from efcc534 to c52bde2 Compare February 2, 2026 22:34

RishabhSaini force-pushed the slo-pd-support branch from c52bde2 to 1fd3277 Compare February 2, 2026 23:07

k8s-ci-robot assigned kaushikmitr Feb 2, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2026

k8s-ci-robot merged commit fd2d71b into kubernetes-sigs:main Feb 2, 2026
13 checks passed

RishabhSaini deleted the slo-pd-support branch February 2, 2026 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `pod_type` categorical feature to latency prediction models#1993

Add `pod_type` categorical feature to latency prediction models#1993
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
RishabhSaini:slo-pd-support

RishabhSaini commented Dec 12, 2025 •

edited

Loading

Uh oh!

netlify bot commented Dec 12, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Dec 12, 2025

Uh oh!

kaushikmitr commented Jan 10, 2026

Uh oh!

Uh oh!

Uh oh!

kaushikmitr commented Feb 2, 2026

Uh oh!

RishabhSaini commented Feb 2, 2026

Uh oh!

kaushikmitr commented Feb 2, 2026

Uh oh!

kaushikmitr commented Feb 2, 2026

Uh oh!

RishabhSaini commented Feb 2, 2026

Uh oh!

kfswain commented Feb 2, 2026

Uh oh!

k8s-ci-robot commented Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

RishabhSaini commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Dec 12, 2025

Uh oh!

kaushikmitr commented Jan 10, 2026

Uh oh!

Uh oh!

Uh oh!

kaushikmitr commented Feb 2, 2026

Uh oh!

RishabhSaini commented Feb 2, 2026

Uh oh!

kaushikmitr commented Feb 2, 2026

Uh oh!

kaushikmitr commented Feb 2, 2026

Uh oh!

RishabhSaini commented Feb 2, 2026

Uh oh!

kfswain commented Feb 2, 2026

Uh oh!

k8s-ci-robot commented Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RishabhSaini commented Dec 12, 2025 •

edited

Loading

netlify bot commented Dec 12, 2025 •

edited

Loading