Skip to content

Add program aware plugin#707

Draft
praveingk wants to merge 41 commits intollm-d:mainfrom
praveingk:program-aware-plugin
Draft

Add program aware plugin#707
praveingk wants to merge 41 commits intollm-d:mainfrom
praveingk:program-aware-plugin

Conversation

@praveingk
Copy link
Copy Markdown

This PR introduces program-aware plugin would enable identifying a request from an agentic program based on its program-id (x-gateway-inference-fairness-id) and performing scheduling decisions based on program-level metrics. Additionally, this plugin captures program-level metrics and exports to the prometheus endpoint.

Program aware plugin implements following plugin interfaces:

  1. Prepare Data Interface: Extracts program information from request headers, subsequently updates the relevant program metrics and request metadata. The plugin assumes that a request arrives with a fairness ID (x-gateway-inference-fairness-id) to identify an agentic program.

  2. Fairness Interface (Flow Control): We implement flow-control fairness plugin's Pick interface. Here we enable multiple strategies which can be configured. Currently, we are implemented two strategies, a simple EWMA based, and Deficit Round Robin (DRR).

  3. Pre Request Interface: Updates program metrics immediately before dispatch. For example, this hook is used to calculate the wait time (time spent in EPP/queue) of requests per program, and keep track of requests sent to the vLLM inference pod. In future, this hook could also be used to add a vLLM priority to the request.

  4. Response Received Interface: Updates program metrics like deficit counters (for DRR) for tokens used.

We are currently evaluating the scheduling strategies using inference-perf benchmarks.

@github-actions
Copy link
Copy Markdown

🚨 Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

@github-actions github-actions bot requested review from elevran and nilig March 11, 2026 09:46
@praveingk praveingk force-pushed the program-aware-plugin branch from c170ff9 to bea0ea5 Compare March 11, 2026 09:54
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
@praveingk praveingk force-pushed the program-aware-plugin branch from bea0ea5 to 5883271 Compare March 11, 2026 10:02
D-Sai-Venkatesh and others added 6 commits March 11, 2026 15:38
Signed-off-by: Dasari Surya Sai Venkatesh <suryasai.venkatesh@gmail.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
…plete

The x-program-id response header is not consumed by any downstream code.
Remove the ResponseReceived implementation, its interface assertion, and
associated tests. Add a nil-request guard to ResponseComplete to match
the upstream framework pattern.

Signed-off-by: Dasari Surya Sai Venkatesh <suryasai.venkatesh@gmail.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
@praveingk praveingk force-pushed the program-aware-plugin branch from 5883271 to 6b202e2 Compare March 11, 2026 10:12
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
// For each queue in the band, the configured ScoringStrategy is given a chance
// to update its per-program state (OnPickStart), then the queue with the highest
// score is selected for dispatch.
func (p *ProgramAwarePlugin) Pick(_ context.Context, band flowcontrol.PriorityBandAccessor) (flowcontrol.FlowQueueAccessor, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: while it's true you have no consumers of the ctx in this method. should we perform this work if ctx.Err()?


if band == nil {
return nil, nil //nolint:nilnil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: nil band is not an error? this means the caller even when the err == nil must nil check the return is nil. also won't this give inaccurate latency metrics for work that was not done?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We return (nil, nil) instead of (nil, err) because this is how the contract for fairness policy has been defined. The existing Fairness plugins also follow the same convention for example, round-robin.

From the behaviour point of view, both cases have same behaviour, that is no item will be dispatched from that band. But in error case the error will be logged.

"sync/atomic"
)

const ewmaAlpha = 0.2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: how did you arrive at 0.2 for the EWMA alpha? is this based on benchmarking, or should it be configurable via the plugin config alongside strategy?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still experimenting with the parameters. Ideally, we can make it a plugin config once we finalize the programmable parameters for each strategy.

Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
D-Sai-Venkatesh and others added 11 commits April 8, 2026 09:49
Signed-off-by: Dasari Surya Sai Venkatesh <suryasai.venkatesh@gmail.com>
…d-inference-scheduler into program-aware-plugin-test
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Dasari Surya Sai Venkatesh <suryasai.venkatesh@gmail.com>
Signed-off-by: Dasari Surya Sai Venkatesh <suryasai.venkatesh@gmail.com>
Signed-off-by: Dasari Surya Sai Venkatesh <suryasai.venkatesh@gmail.com>
Signed-off-by: Dasari Surya Sai Venkatesh <suryasai.venkatesh@gmail.com>
Signed-off-by: Dasari Surya Sai Venkatesh <suryasai.venkatesh@gmail.com>
elevran pushed a commit to elevran/llm-d-inference-scheduler that referenced this pull request Apr 8, 2026
Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
D-Sai-Venkatesh and others added 6 commits April 8, 2026 16:06
Signed-off-by: Dasari Surya Sai Venkatesh <suryasai.venkatesh@gmail.com>
Add cycle-aware quantum allocation to DRRStrategy. The first Pick() in
a dispatch cycle allocates quantum to all non-empty queues; subsequent
Pick() calls only allocate to unseen programs. OnPreRequest resets the
cycle flag so the next dispatch gets fresh quantum.

Signed-off-by: Dasari Surya Sai Venkatesh <suryasai.venkatesh@gmail.com>
Move token extraction from ResponseComplete into each strategy's
OnCompleted, making the hook signature symmetric with OnPreRequest.

Signed-off-by: Dasari Surya Sai Venkatesh <suryasai.venkatesh@gmail.com>
When enabled, Pick() stores the selected program in pendingCursor
instead of advancing lastSelected. OnPreRequest() commits it after
a real dispatch, preventing speculative picks from moving the cursor.

Signed-off-by: Dasari Surya Sai Venkatesh <suryasai.venkatesh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants