Skip to content

Add pod_type categorical feature to latency prediction models#1993

Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
RishabhSaini:slo-pd-support
Feb 2, 2026
Merged

Add pod_type categorical feature to latency prediction models#1993
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
RishabhSaini:slo-pd-support

Conversation

@RishabhSaini
Copy link
Contributor

@RishabhSaini RishabhSaini commented Dec 12, 2025

Add pod_type support for prefill-decode (PD) disaggregated serving:

  • Added pod_type categorical feature (''=monolithic, 'prefill', 'decode') to both TTFT and TPOT prediction models
  • Implemented pod_type_cat encoding in feature preparation with proper categorical handling
  • Updated Bayesian Ridge models to one-hot encode pod_type (can't handle categoricals directly)
  • Modified XGBoost monotone constraints from 5 -> 6 features, with no constraint on pod_type
  • Added pod_type field to API models (PredictionRequest, TrainingEntry) in Python and Go
  • Ensured backward compatibility by defaulting to '' (monolithic) when pod_type is missing

Resolves: #1923

Links to related PRs:
llm-d/llm-d#596
llm-d/llm-d-inference-scheduler#564

@netlify
Copy link

netlify bot commented Dec 12, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 1fd3277
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69812e3c87c6af0008468e65
😎 Deploy Preview https://deploy-preview-1993--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 12, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @RishabhSaini. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 12, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 12, 2025
@kfswain kfswain added the gie-area/predictor Categorizes an issue or PR as relevant to GIE predictor label Dec 15, 2025
@kfswain kfswain requested review from kaushikmitr and removed request for kfswain and nirrozenbaum December 15, 2025 23:13
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 16, 2025
@kaushikmitr
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 10, 2026
@RishabhSaini RishabhSaini changed the title SLO Aware Routing PD Disaggregation Support Added pod_type categorical feature to latency prediction models Jan 13, 2026
@RishabhSaini RishabhSaini changed the title Added pod_type categorical feature to latency prediction models Add pod_type categorical feature to latency prediction models Jan 13, 2026
@kaushikmitr
Copy link
Contributor

@RishabhSaini this looks good, can you update here the the test_dual_server outputs?

@RishabhSaini
Copy link
Contributor Author

> kubectl logs -n llm-d-pd-r pods/latency-predictor-test-5gnnq
============================= test session starts ==============================
platform linux -- Python 3.9.25, pytest-8.4.2, pluggy-1.6.0 -- /usr/local/bin/python3.9
cachedir: .pytest_cache
rootdir: /app
plugins: asyncio-1.2.0, anyio-4.12.0
asyncio: mode=strict, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collecting ... collected 32 items

test_dual_server_client.py::test_prediction_server_healthz Waiting for prediction server...
Waiting for training server...
PASSED
test_dual_server_client.py::test_training_server_healthz PASSED
test_dual_server_client.py::test_prediction_server_readyz PASSED
test_dual_server_client.py::test_training_server_readyz PASSED
test_dual_server_client.py::test_prediction_server_status Prediction server using model type: xgboost
Quantile: 0.9
Models ready: True
Models exist: {'ttft_model': True, 'tpot_model': True}
PASSED
test_dual_server_client.py::test_training_server_model_info Training server using model type: xgboost
PASSED
test_dual_server_client.py::test_training_server_models_list Model ttft: exists=True, size=143527 bytes
Model tpot: exists=True, size=143279 bytes
PASSED
test_dual_server_client.py::test_model_download_from_training_server Successfully downloaded ttft model (143527 bytes)
Successfully downloaded tpot model (143279 bytes)
PASSED
test_dual_server_client.py::test_lightgbm_endpoints_on_training_server Skipping LightGBM endpoint tests - not using LightGBM model
PASSED
test_dual_server_client.py::test_add_training_data_to_training_server Successfully sent training data to training server
PASSED
test_dual_server_client.py::test_prediction_server_model_sync Model reload result: synced=False, loaded=True
Prediction server models are ready!
PASSED
test_dual_server_client.py::test_prediction_via_prediction_server Prediction successful: TTFT=10.00ms, TPOT=10.00ms
Model type: xgboost
PASSED
test_dual_server_client.py::test_bulk_prediction_strict Testing bulk prediction strict endpoint...
✓ Bulk prediction strict endpoint test passed
PASSED
test_dual_server_client.py::test_bulk_prediction_with_validation_errors Testing bulk prediction validation error handling...
✓ Bulk prediction correctly failed when any request had validation errors
PASSED
test_dual_server_client.py::test_bulk_prediction_all_valid Testing bulk prediction with all valid requests...
✓ Bulk prediction succeeded with all valid requests
PASSED
test_dual_server_client.py::test_prediction_missing_prefix_cache_score ✓ Prediction correctly failed when prefix_cache_score was missing
PASSED
test_dual_server_client.py::test_prediction_with_pod_type_prefill Testing prediction with pod_type='prefill'...
✓ Prefill prediction: TTFT=10.00ms, TPOT=10.00ms
PASSED
test_dual_server_client.py::test_prediction_with_pod_type_decode Testing prediction with pod_type='decode'...
✓ Decode prediction: TTFT=10.00ms, TPOT=10.00ms
PASSED
test_dual_server_client.py::test_bulk_prediction_with_pod_type Testing bulk prediction with pod_type...
  Prefill: TTFT=10.00ms, TPOT=10.00ms
  Decode: TTFT=10.00ms, TPOT=10.00ms
  Legacy: TTFT=10.00ms, TPOT=10.00ms
✓ Bulk prediction with mixed pod types passed
PASSED
test_dual_server_client.py::test_training_data_with_pod_type Testing training data with pod_type...
✓ Successfully sent 20 training samples with pod_type
PASSED
test_dual_server_client.py::test_invalid_pod_type Testing invalid pod_type handling...
✓ Invalid pod_type accepted with fallback behavior (permissive validation)
PASSED
test_dual_server_client.py::test_training_server_metrics Training server metrics endpoint working correctly
✓ Prefix cache score feature found in metrics
PASSED
test_dual_server_client.py::test_model_consistency_between_servers Model type consistent across servers: xgboost
PASSED
test_dual_server_client.py::test_model_specific_endpoints_on_training_server Testing XGBoost tree endpoints on training server...
✓ TTFT XGBoost trees available: 200 trees
✓ TPOT XGBoost trees available: 200 trees
PASSED
test_dual_server_client.py::test_dual_server_quantile_regression_learns_distribution Relative-err accuracy (≤15%): 88.5%
Coverage: TTFT=0.875, TPOT=0.905 (target 0.900 ± 0.05)
PASSED
test_dual_server_client.py::test_prediction_server_stress_test Running prediction server stress test...
Waiting for 100007 prediction requests to complete...
Target QPS: 1000.0, Actual QPS: 1000.1

==================================================
PREDICTION SERVER STRESS TEST RESULTS
==================================================
Total Requests: 100007
Successful: 100007 (100.0%)
Failed: 0 (0.0%)
Average Response Time: 15.30ms

Model Types in Predictions:
  xgboost: 100007

Status Code Distribution:
  200: 100007

Response Time Percentiles:
  P50: 4.77ms
  P95: 75.67ms
  P99: 179.83ms
Prediction server stress test completed with 100.0% success rate
PASSED
test_dual_server_client.py::test_bulk_prediction_stress_test Running bulk prediction stress test...

Testing with batch size 5...
Waiting for 100000 bulk prediction requests to complete...
Target RPS: 1000.0, Actual RPS: 1000.0
Total Predictions: 500000, Predictions/sec: 5000.0

==================================================
BULK PREDICTION STRESS TEST RESULTS
==================================================
Total Bulk Requests: 100000
Successful: 100000 (100.0%)
Failed: 0 (0.0%)
Total Individual Predictions: 500000
Total Batch Size: 500000
Average Response Time: 16.49ms
Average Batch Size: 5.0
Prediction Success Rate: 100.0%

Status Code Distribution:
  200: 100000

Response Time Percentiles:
  P50: 8.86ms
  P95: 67.90ms
  P99: 149.37ms
Bulk prediction stress test (batch size 5) completed with 100.0% success rate

Testing with batch size 10...
Waiting for 100001 bulk prediction requests to complete...
Target RPS: 1000.0, Actual RPS: 1000.0
Total Predictions: 1000010, Predictions/sec: 10000.1

==================================================
BULK PREDICTION STRESS TEST RESULTS
==================================================
Total Bulk Requests: 100001
Successful: 100001 (100.0%)
Failed: 0 (0.0%)
Total Individual Predictions: 1000010
Total Batch Size: 1000010
Average Response Time: 33.80ms
Average Batch Size: 10.0
Prediction Success Rate: 100.0%

Status Code Distribution:
  200: 100001

Response Time Percentiles:
  P50: 8.99ms
  P95: 202.07ms
  P99: 299.30ms
Bulk prediction stress test (batch size 10) completed with 100.0% success rate

Testing with batch size 25...
Waiting for 99998 bulk prediction requests to complete...
Target RPS: 1000.0, Actual RPS: 1000.0
Total Predictions: 2499950, Predictions/sec: 24999.5

==================================================
BULK PREDICTION STRESS TEST RESULTS
==================================================
Total Bulk Requests: 99998
Successful: 99998 (100.0%)
Failed: 0 (0.0%)
Total Individual Predictions: 2499950
Total Batch Size: 2499950
Average Response Time: 58.63ms
Average Batch Size: 25.0
Prediction Success Rate: 100.0%

Status Code Distribution:
  200: 99998

Response Time Percentiles:
  P50: 13.77ms
  P95: 277.73ms
  P99: 369.84ms
Bulk prediction stress test (batch size 25) completed with 100.0% success rate
PASSED
test_dual_server_client.py::test_large_batch_prediction_stress_test Running bulk prediction stress test...

Testing with batch size 1000...
Waiting for 9999 bulk prediction requests to complete...
Target RPS: 100.0, Actual RPS: 100.0
Total Predictions: 9999000, Predictions/sec: 99990.0

==================================================
BULK PREDICTION STRESS TEST RESULTS
==================================================
Total Bulk Requests: 9999
Successful: 9999 (100.0%)
Failed: 0 (0.0%)
Total Individual Predictions: 9999000
Total Batch Size: 9999000
Average Response Time: 38.42ms
Average Batch Size: 1000.0
Prediction Success Rate: 100.0%

Status Code Distribution:
  200: 9999

Response Time Percentiles:
  P50: 30.80ms
  P95: 82.13ms
  P99: 160.23ms
Bulk prediction stress test (batch size 1000) completed with 100.0% success rate
PASSED
test_dual_server_client.py::test_end_to_end_workflow Testing end-to-end workflow...
Step 1: Sending training data to training server...
Step 2: Waiting for training...
Step 3: Syncing models to prediction server...
Step 4: Making predictions...
  Prediction 1: TTFT=1001.76ms, TPOT=300.85ms (prefix_cache=0.44)
  Prediction 2: TTFT=646.67ms, TPOT=242.58ms (prefix_cache=0.48)
  Prediction 3: TTFT=1384.65ms, TPOT=414.87ms (prefix_cache=0.55)
  Prediction 4: TTFT=1388.67ms, TPOT=412.35ms (prefix_cache=0.28)
  Prediction 5: TTFT=1513.01ms, TPOT=452.89ms (prefix_cache=0.33)
✓ End-to-end workflow completed successfully!
PASSED
test_dual_server_client.py::test_server_configuration Testing server configuration...
Prediction server: HTTP-based Quantile Latency Predictor is running
  Model type: xgboost
  Is ready: True
  Sync interval: 10s
  Training server URL: http://training-service:8000
Training server: Latency Predictor is running.
  Model type: xgboost
PASSED
test_dual_server_client.py::test_training_server_flush_api Testing training server flush API...
Step 1: Checking initial data status...
  Initial training samples: TTFT=2772, TPOT=2763
  Initial test samples: TTFT=318, TPOT=317
Step 2: Adding training data...
  Added 100 training samples
Step 3: Verifying data was added...
  After adding - Training: 5721, Test: 649, Total: 6370
Step 4: Testing flush with only training data...
  Flushed 2865 TTFT training samples
  Flushed 2856 TPOT training samples
  Test samples flushed: 0 TTFT, 0 TPOT (should be 0)
  After training flush - Training: 0, Test: 649
Step 5: Adding more training data...
Step 6: Testing flush everything...
  Complete flush message: Successfully flushed: 44 TTFT and 44 TPOT training samples, 331 TTFT and 330 TPOT test samples, all metric scores
  After complete flush - Training: 0, Test: 0
Step 7: Testing default flush (no body)...
  Default flush result: Successfully flushed: 16 TTFT and 16 TPOT training samples, 4 TTFT and 4 TPOT test samples, all metric scores
Step 8: Testing flush with only test data...
  Test data flush: 3 TTFT, 3 TPOT
  After test flush - Training: 94, Test: 0
Step 9: Testing bucket distribution in status...
  Bucket distribution available: 0 buckets with data
✓ Flush API tests passed!
PASSED
test_dual_server_client.py::test_training_server_flush_error_handling Testing flush API error handling...
✓ Invalid JSON handled correctly
✓ Flush error handling tests passed!
PASSED

======================== 32 passed in 563.81s (0:09:23) ========================

to both TTFT and TPOT prediction models

- Add pod_type field to PredictionRequest and TrainingEntry models
- Encode pod_type as categorical in _prepare_features_with_interaction
- Handle pod_type_cat in both TTFT and TPOT feature columns
- One-hot encode pod_type_cat for Bayesian Ridge models
- Add pod_type to XGBoost/LightGBM feature orders with monotone constraints
- Add comprehensive tests for pod_type functionality
- Update Go types to include PodType field
@kaushikmitr
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2026
@kaushikmitr
Copy link
Contributor

/approve

@RishabhSaini
Copy link
Contributor Author

cc maintainers: @danehans @kfswain @ahg-g @nirrozenbaum for approval

@kfswain
Copy link
Collaborator

kfswain commented Feb 2, 2026

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaushikmitr, kfswain, RishabhSaini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2026
@k8s-ci-robot k8s-ci-robot merged commit fd2d71b into kubernetes-sigs:main Feb 2, 2026
13 checks passed
@RishabhSaini RishabhSaini deleted the slo-pd-support branch February 2, 2026 23:53
vishbhat pushed a commit to vishbhat/gateway-api-inference-extension that referenced this pull request Feb 3, 2026
…') (kubernetes-sigs#1993)

to both TTFT and TPOT prediction models

- Add pod_type field to PredictionRequest and TrainingEntry models
- Encode pod_type as categorical in _prepare_features_with_interaction
- Handle pod_type_cat in both TTFT and TPOT feature columns
- One-hot encode pod_type_cat for Bayesian Ridge models
- Add pod_type to XGBoost/LightGBM feature orders with monotone constraints
- Add comprehensive tests for pod_type functionality
- Update Go types to include PodType field
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. gie-area/predictor Categorizes an issue or PR as relevant to GIE predictor lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Predicted Latency based routing - support disagg scenarios

4 participants