Skip to content

Commit 751c465

Browse files
authored
docs: update epp design logics (#895)
**Description** 1. This PR updates the EPP integration design logics. 2. This PR addes the request flow diagram for EPP integration. --------- Signed-off-by: bitliu <[email protected]> Signed-off-by: Xunzhuo <[email protected]>
1 parent 1f6961d commit 751c465

File tree

1 file changed

+229
-12
lines changed

1 file changed

+229
-12
lines changed

docs/proposals/003-epp-integration-proposal/proposal.md

Lines changed: 229 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ This is a core functionality in EAGW`s vision, make the routing more intelligent
3232
## Goals
3333
+ Integrate with EPP to expand the Envoy AI Gateway abilities
3434
+ Integrate with the existing CRD and features well
35+
+ Support InferencePool in AIGatewayRoute
36+
+ Support InferencePool in HTTPRoute
3537

3638
## Background
3739

@@ -260,7 +262,6 @@ We will adopt **Option 1: Add InferencePool as a backendRef on AIGatewayRoute Le
260262

261263
This approach is preferred because InferencePool resources do not require BackendSecurityPolicy or schema configuration. The implementation assumes OpenAI format compatibility, which aligns with the Gateway API Inference Extension (GAIE) design principles.
262264

263-
264265
##### Example
265266

266267
+ When it matches gpt-4o-mini goes to AIServiceBackend `envoy-ai-gateway-basic-openai`
@@ -350,12 +351,13 @@ For the initial implementation, we will adopt the **static approach** to manage
350351

351352
This approach aligns with industry practices where external inference framework controllers typically manage EPP deployment logic. For reference, KServe implements EPP deployment through their `LLMInferenceService` API, demonstrating that EPP lifecycle management is better handled at the inference framework level rather than within Envoy AI Gateway. See [KServe LLMInferenceService](https://github.com/kserve/kserve/blob/master/pkg/apis/serving/v1alpha1/llm_inference_service_types.go#L171) for implementation details.
352353

353-
#### Work with Envoy Gateway
354+
#### Working with Envoy Gateway
354355

355-
There are two work-in-process PRs in upstream:
356+
There are some work-in-process PRs in upstream:
356357

357358
+ https://github.com/envoyproxy/gateway/pull/6271
358359
+ https://github.com/envoyproxy/gateway/pull/6342
360+
+ https://github.com/envoyproxy/gateway/pull/6524
359361

360362
##### Backend + EEP
361363

@@ -488,7 +490,9 @@ spec:
488490
messageTimeout: 5s
489491
```
490492

491-
This direction is to reuse the abilities of Envoy Gateway, and generate the Backend and EnvoyExtensionPolicy to deal with the InferencePool
493+
This direction is to reuse the capabilities of Envoy Gateway and generate the Backend and EnvoyExtensionPolicy to manage the InferencePool.
494+
495+
But it cannot provide rule-level InferencePool support. EnvoyExtensionPolicy can only target `HTTPRoute`/`Gateway`, therefore it cannot support multiple `InferencePool` resources or mixed use `AIServiceBackend` in one AIGatewayRoute.
492496

493497
##### EnvoyExtensionServer
494498

@@ -500,15 +504,18 @@ Cluste Modify Workflow is like:
500504

501505
**Envoy Gateway**
502506

503-
1. enabled the XDSCluster level XDSTranslatorHook, and define the custom backend resource in Envoy Gateway configuration (InferencePool CRD)
507+
1. enabled the XDSCluster, XDSRoute level XDSTranslatorHook, and define the custom backend resource in Envoy Gateway configuration (InferencePool CRD)
504508
2. Envoy Gateway will start to watch the InferencePools
505509
3. If httproute refers any resource with the same GVK, carry it with ExtensionRefs IR
506510
4. When EG doing xDS translation, checks if ExtensionRefs > 0, if so, it calls the PostClusterModifyHook and carry the unstructuredResources(InferencePool) to Envoy AI Gateway
507511

508512
**Envoy AI Gateway**
509513

514+
PostClusterModify Hook logics:
515+
510516
1. Implement the PostClusterModifyHook, iterates the unstructuredResources to group the inferencePool(only support one InferencePool per route rule)
511517
2. Modify the cluster type with ORIGINAL_DST, and add the original_dst_lb_config
518+
3. Send it back to Envoy Gateway
512519

513520
```yaml
514521
type: ORIGINAL_DST
@@ -519,8 +526,27 @@ Cluste Modify Workflow is like:
519526
lb_policy: CLUSTER_PROVIDED
520527
```
521528

522-
3. Send it back to Envoy Gateway
523-
4. Envoy Gateway xDS Server pushes the config to EnvoyProxy
529+
PostRouteModify Hook logics:
530+
531+
1. Modify the route metadata with InferencePool info
532+
2. Send it back to Envoy Gateway
533+
534+
```yaml
535+
metadata:
536+
filter_metadata:
537+
aigateway.envoy.io:
538+
per_route_rule_inference_pool: default/vllm-llama3-8b-instruct/vllm-llama3-8b-instruct-epp/9002
539+
```
540+
541+
PostTranslateModify Hook logics:
542+
543+
1. Find inferencepool relevant listener based on route metadata.
544+
2. Insert epp extproc configs into listener.
545+
3. Find unrelated routes under relevant listener.
546+
4. Insert extproc perroute to disable these routes.
547+
5. Send it back to Envoy Gateway
548+
549+
After all Hooks logics, Envoy Gateway xDS Server pushes the config to EnvoyProxy.
524550

525551
#### Conclusion
526552

@@ -531,7 +557,81 @@ We will adopt the **EnvoyExtensionServer approach** for integrating with Envoy G
531557
+ **Conformance**: Enables passing Gateway API conformance tests without requiring modifications ([#648](https://github.com/envoyproxy/ai-gateway/issues/648))
532558
+ **Maintainability**: Reduces coupling with upstream Envoy Gateway API changes
533559

534-
## Final Workflow
560+
We can natively support HTTPRoute + InferencePool as well as the AIGatewayRoute + InferencePool.
561+
562+
Configuration for AIGatewayRoute + InferencePool is like:
563+
564+
```yaml
565+
apiVersion: aigateway.envoyproxy.io/v1alpha1
566+
kind: AIGatewayRoute
567+
metadata:
568+
name: inference-pool-with-aigwroute
569+
namespace: default
570+
spec:
571+
schema:
572+
name: OpenAI
573+
targetRefs:
574+
- name: inference-pool-with-aigwroute
575+
kind: Gateway
576+
group: gateway.networking.k8s.io
577+
rules:
578+
- matches:
579+
- headers:
580+
- type: Exact
581+
name: x-ai-eg-model
582+
value: meta-llama/Llama-3.1-8B-Instruct
583+
backendRefs:
584+
- group: inference.networking.x-k8s.io
585+
kind: InferencePool
586+
name: vllm-llama3-8b-instruct
587+
- matches:
588+
- headers:
589+
- type: Exact
590+
name: x-ai-eg-model
591+
value: mistral:latest
592+
backendRefs:
593+
- group: inference.networking.x-k8s.io
594+
kind: InferencePool
595+
name: mistral
596+
- matches:
597+
- headers:
598+
- type: Exact
599+
name: x-ai-eg-model
600+
value: some-cool-self-hosted-model
601+
backendRefs:
602+
- name: envoy-ai-gateway-basic-testupstream
603+
```
604+
605+
Configuration for HTTPRoute + InferencePool is like:
606+
607+
```yaml
608+
apiVersion: gateway.networking.k8s.io/v1
609+
kind: HTTPRoute
610+
metadata:
611+
name: inference-pool-with-httproute
612+
namespace: default
613+
spec:
614+
parentRefs:
615+
- group: gateway.networking.k8s.io
616+
kind: Gateway
617+
name: inference-pool-with-httproute
618+
namespace: default
619+
rules:
620+
- backendRefs:
621+
- group: inference.networking.x-k8s.io
622+
kind: InferencePool
623+
name: vllm-llama3-8b-instruct
624+
namespace: default
625+
weight: 1
626+
matches:
627+
- path:
628+
type: PathPrefix
629+
value: /
630+
timeouts:
631+
request: 60s
632+
```
633+
634+
## Logic Workflow
535635

536636
The complete integration workflow follows these steps:
537637

@@ -540,13 +640,130 @@ The complete integration workflow follows these steps:
540640
3. **InferencePool Configuration**: Create InferencePool resource referencing the external processing service
541641
4. **Route Configuration**: Configure InferencePool as AIGatewayRoute backend (limited to one InferencePool per route rule)
542642
5. **HTTPRoute Generation**: Envoy AI Gateway synchronizes configuration to managed HTTPRoute with InferencePool BackendRef
543-
6. **Extension Policy Creation**: Generate EnvoyExtensionPolicy with external processing configuration targeting the HTTPRoute
544-
7. **Cluster Modification**: Envoy Gateway invokes PostClusterModify hook, carrying InferencePool resource information
545-
8. **Cluster Configuration**: Envoy AI Gateway configures Original Destination cluster with `x-gateway-destination-endpoint` header matching
546-
9. **Request Processing**: Client requests flow through EnvoyProxy to EPP service, which adds destination headers and metadata for endpoint selection
643+
6. **Invoke Cluster/Route/Listener Hooks**: Envoy Gateway invokes PostClusterModify/PostRouteModify/PostTranslateModify hook.
644+
7. **Cluster/Route/Listener Configuration**: Envoy AI Gateway configures Original Destination cluster with `x-gateway-destination-endpoint` header matching, and modifies Listener and Routes in a reverse approach (Enable epp extproc in Listener level and disabled the extproc in unrelated routes for preventing clearRouteCache`s impact)
645+
8. **Request Processing**: Client requests flow through EnvoyProxy to EPP service, which adds destination headers and metadata for endpoint selection
646+
647+
The logics flow for AIGatewayRoute + InferencePool is like:
648+
649+
```mermaid
650+
sequenceDiagram
651+
participant AIServer as Kubernetes API Server
652+
participant EAGC as Envoy AI Gateway Controller
653+
participant EG as Envoy Gateway
654+
participant EAGES as Envoy AI Gateway Extension Server
655+
participant Proxy as EnvoyProxy
656+
657+
AIServer->>EAGC: List/Watch Envoy AI Gateway CRDs (AIGatewayRoute,AIServiceBackend)
658+
EAGC-->>AIServer: Generate Envoy Gateway CRDs(HTTPRoute,EnvoyExtensionPolicy)
659+
AIServer->>EG: List/Watch Envoy Gateway CRDs (Gateway/HTTPRoute/EnvoyExtensionPolicy/Backend)
660+
Note over EG: Translate CRDs to xDS configuration
661+
Note over EAGES: Implemented PostRouteModify/PostClusterModify/PostTranslateModify Hooks
662+
EG->>EAGES: Invoke PostRouteModify Hook
663+
EAGES-->>EG: Modify Envoy Route Config (Add EPP Metadata)
664+
EG->>EAGES: Invoke PostClusterModify Hook
665+
EAGES-->>EG: Modify Envoy Cluster Config (Modify Cluster with HostOverride LBPolicy or Original Dst)
666+
EG->>EAGES: Invoke PostTranslateModify Hook
667+
EAGES-->>EG: Modify Envoy Listeners and Routes Config (Insert EPP extproc in relevant listeners and disable EPP extproc in un related routes)
668+
EG ->> Proxy: Generate final xDS configuration
669+
Note over Proxy: Ready to foward downstream requests
670+
```
671+
672+
The logics flow for HTTPRoute + InferencePool is like:
673+
674+
```mermaid
675+
sequenceDiagram
676+
participant AIServer as Kubernetes API Server
677+
participant EG as Envoy Gateway
678+
participant EAGES as Envoy AI Gateway Extension Server
679+
participant Proxy as EnvoyProxy
680+
681+
AIServer->>EG: List/Watch Envoy Gateway CRDs (Gateway/HTTPRoute/EnvoyExtensionPolicy/Backend)
682+
Note over EG: Translate CRDs to xDS configuration
683+
Note over EAGES: Implemented PostRouteModify/PostClusterModify/PostTranslateModify Hooks
684+
EG->>EAGES: Invoke PostRouteModify Hook
685+
EAGES-->>EG: Modify Envoy Route Config (Add EPP Metadata)
686+
EG->>EAGES: Invoke PostClusterModify Hook
687+
EAGES-->>EG: Modify Envoy Cluster Config (Modify Cluster with HostOverride LBPolicy or Original Dst)
688+
EG->>EAGES: Invoke PostTranslateModify Hook
689+
EAGES-->>EG: Modify Envoy Listeners and Routes Config (Insert EPP extproc in relevant listeners and disable EPP extproc in un related routes)
690+
EG ->> Proxy: Generate final xDS configuration
691+
Note over Proxy: Ready to foward downstream requests
692+
```
693+
694+
## Request Flow
695+
696+
The request flow for AIGatewayRoute + InferencePool is like:
697+
698+
```mermaid
699+
sequenceDiagram
700+
participant Client as Client (OpenAI SDK)
701+
participant Envoy as Envoy Proxy
702+
participant RLS as Rate Limit Service
703+
participant Processor as AI Gateway External Processor
704+
participant EPP as EndpointPicker
705+
participant SelfHosted as Self-Hosted Models
706+
707+
Client->>Envoy: Request
708+
Envoy->>RLS: Check Rate Limit
709+
RLS-->>Envoy: ;
710+
Envoy->>Processor: Router-level ExtProc Request
711+
Note over Processor: Extract Model Name & Routing
712+
Processor-->>Envoy: ClearRouteCache;
713+
Envoy->>EPP: Router-level ExtProc Request
714+
Note over EPP: Pick Endpoint in InferencePool
715+
EPP-->>Envoy: Add Picked Endpoint in Header and Metadata;
716+
717+
loop Retry/Fallback loop
718+
Note over Envoy: Foward Based on Picked Endpoint (Original Dst or HostOverride LbPolicy)
719+
Envoy->>Processor: Upstream level ExtProc Request
720+
Note over Processor: Request-Transform
721+
Processor-->>Envoy: ;
722+
Envoy->>SelfHosted: Forward Request
723+
SelfHosted-->>Envoy: Response
724+
end
725+
726+
Envoy->>Processor: Process Response
727+
Note over Processor: Response Transform & Extract Token Usage
728+
Processor-->>Envoy: Add Usage Metadata
729+
Envoy->>RLS: Reduce Rate Limit budget
730+
RLS-->>Envoy: ;
731+
Envoy->>Client: Response
732+
```
733+
734+
The request flow for HTTPRoute + InferencePool is like:
735+
736+
```mermaid
737+
sequenceDiagram
738+
participant Client as Client (OpenAI SDK)
739+
participant Envoy as Envoy Proxy
740+
participant EPP as EndpointPicker
741+
participant SelfHosted as Self-Hosted Models
742+
743+
Client->>Envoy: Request
744+
Envoy->>EPP: Router-level ExtProc Request
745+
Note over EPP: Pick Endpoint in InferencePool
746+
EPP-->>Envoy: Add Picked Endpoint in Header and Metadata;
747+
748+
loop Retry/Fallback loop
749+
Note over Envoy: Foward Based on Picked Endpoint (Original Dst or HostOverride LbPolicy)
750+
Envoy->>SelfHosted: Forward Request
751+
SelfHosted-->>Envoy: Response
752+
end
753+
754+
Envoy->>Client: Response
755+
```
547756

548757
## Implementation Considerations and Limitations
549758

759+
Current implementation is able to support InferencePool using AIGatewayRoute as well as HTTPRoute.
760+
761+
But there are some advantages for AIGatewayRoute + InferencePool:
762+
763+
+ Native OpenAI Support: we can use native Open AI Spec to call different inference pool in a same listener, because we support to parse the mode from the body, and refresh the cache with route level ai-gw extproc. But using HTTPRoute directly cannot do that, it has to add more routeMatch to do it in a same listener like headerMatch, or separate it into different listeners.
764+
+ Token Ratelimit: we can support token ratelimit, for the self hosted models, we can extract the token usage from the response and reduce the rate limit budget.
765+
+ Advanced Observability: we can expose more metrics in our ai-gw ext-proc.
766+
550767
### Load Balancing Policy
551768

552769
The initial implementation will use **Original Destination** cluster configuration for endpoint selection. Future iterations may consider **Host Override** policy as an alternative approach based on performance and operational requirements.

0 commit comments

Comments
 (0)