You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -32,6 +32,8 @@ This is a core functionality in EAGW`s vision, make the routing more intelligent
32
32
## Goals
33
33
+ Integrate with EPP to expand the Envoy AI Gateway abilities
34
34
+ Integrate with the existing CRD and features well
35
+
+ Support InferencePool in AIGatewayRoute
36
+
+ Support InferencePool in HTTPRoute
35
37
36
38
## Background
37
39
@@ -260,7 +262,6 @@ We will adopt **Option 1: Add InferencePool as a backendRef on AIGatewayRoute Le
260
262
261
263
This approach is preferred because InferencePool resources do not require BackendSecurityPolicy or schema configuration. The implementation assumes OpenAI format compatibility, which aligns with the Gateway API Inference Extension (GAIE) design principles.
262
264
263
-
264
265
##### Example
265
266
266
267
+ When it matches gpt-4o-mini goes to AIServiceBackend `envoy-ai-gateway-basic-openai`
@@ -350,12 +351,13 @@ For the initial implementation, we will adopt the **static approach** to manage
350
351
351
352
This approach aligns with industry practices where external inference framework controllers typically manage EPP deployment logic. For reference, KServe implements EPP deployment through their `LLMInferenceService` API, demonstrating that EPP lifecycle management is better handled at the inference framework level rather than within Envoy AI Gateway. See [KServe LLMInferenceService](https://github.com/kserve/kserve/blob/master/pkg/apis/serving/v1alpha1/llm_inference_service_types.go#L171) for implementation details.
352
353
353
-
#### Work with Envoy Gateway
354
+
#### Working with Envoy Gateway
354
355
355
-
There are two work-in-process PRs in upstream:
356
+
There are some work-in-process PRs in upstream:
356
357
357
358
+ https://github.com/envoyproxy/gateway/pull/6271
358
359
+ https://github.com/envoyproxy/gateway/pull/6342
360
+
+ https://github.com/envoyproxy/gateway/pull/6524
359
361
360
362
##### Backend + EEP
361
363
@@ -488,7 +490,9 @@ spec:
488
490
messageTimeout: 5s
489
491
```
490
492
491
-
This direction is to reuse the abilities of Envoy Gateway, and generate the Backend and EnvoyExtensionPolicy to deal with the InferencePool
493
+
This direction is to reuse the capabilities of Envoy Gateway and generate the Backend and EnvoyExtensionPolicy to manage the InferencePool.
494
+
495
+
But it cannot provide rule-level InferencePool support. EnvoyExtensionPolicy can only target `HTTPRoute`/`Gateway`, therefore it cannot support multiple `InferencePool` resources or mixed use `AIServiceBackend` in one AIGatewayRoute.
492
496
493
497
##### EnvoyExtensionServer
494
498
@@ -500,15 +504,18 @@ Cluste Modify Workflow is like:
500
504
501
505
**Envoy Gateway**
502
506
503
-
1. enabled the XDSCluster level XDSTranslatorHook, and define the custom backend resource in Envoy Gateway configuration (InferencePool CRD)
507
+
1. enabled the XDSCluster, XDSRoute level XDSTranslatorHook, and define the custom backend resource in Envoy Gateway configuration (InferencePool CRD)
504
508
2. Envoy Gateway will start to watch the InferencePools
505
509
3. If httproute refers any resource with the same GVK, carry it with ExtensionRefs IR
506
510
4. When EG doing xDS translation, checks if ExtensionRefs > 0, if so, it calls the PostClusterModifyHook and carry the unstructuredResources(InferencePool) to Envoy AI Gateway
507
511
508
512
**Envoy AI Gateway**
509
513
514
+
PostClusterModify Hook logics:
515
+
510
516
1. Implement the PostClusterModifyHook, iterates the unstructuredResources to group the inferencePool(only support one InferencePool per route rule)
511
517
2. Modify the cluster type with ORIGINAL_DST, and add the original_dst_lb_config
518
+
3. Send it back to Envoy Gateway
512
519
513
520
```yaml
514
521
type: ORIGINAL_DST
@@ -519,8 +526,27 @@ Cluste Modify Workflow is like:
519
526
lb_policy: CLUSTER_PROVIDED
520
527
```
521
528
522
-
3. Send it back to Envoy Gateway
523
-
4. Envoy Gateway xDS Server pushes the config to EnvoyProxy
529
+
PostRouteModify Hook logics:
530
+
531
+
1. Modify the route metadata with InferencePool info
7. **Cluster/Route/Listener Configuration**: Envoy AI Gateway configures Original Destination cluster with `x-gateway-destination-endpoint` header matching, and modifies Listener and Routes in a reverse approach (Enable epp extproc in Listener level and disabled the extproc in unrelated routes for preventing clearRouteCache`s impact)
645
+
8. **Request Processing**: Client requests flow through EnvoyProxy to EPP service, which adds destination headers and metadata for endpoint selection
646
+
647
+
The logics flow for AIGatewayRoute + InferencePool is like:
648
+
649
+
```mermaid
650
+
sequenceDiagram
651
+
participant AIServer as Kubernetes API Server
652
+
participant EAGC as Envoy AI Gateway Controller
653
+
participant EG as Envoy Gateway
654
+
participant EAGES as Envoy AI Gateway Extension Server
655
+
participant Proxy as EnvoyProxy
656
+
657
+
AIServer->>EAGC: List/Watch Envoy AI Gateway CRDs (AIGatewayRoute,AIServiceBackend)
EAGES-->>EG: Modify Envoy Cluster Config (Modify Cluster with HostOverride LBPolicy or Original Dst)
688
+
EG->>EAGES: Invoke PostTranslateModify Hook
689
+
EAGES-->>EG: Modify Envoy Listeners and Routes Config (Insert EPP extproc in relevant listeners and disable EPP extproc in un related routes)
690
+
EG ->> Proxy: Generate final xDS configuration
691
+
Note over Proxy: Ready to foward downstream requests
692
+
```
693
+
694
+
## Request Flow
695
+
696
+
The request flow for AIGatewayRoute + InferencePool is like:
697
+
698
+
```mermaid
699
+
sequenceDiagram
700
+
participant Client as Client (OpenAI SDK)
701
+
participant Envoy as Envoy Proxy
702
+
participant RLS as Rate Limit Service
703
+
participant Processor as AI Gateway External Processor
704
+
participant EPP as EndpointPicker
705
+
participant SelfHosted as Self-Hosted Models
706
+
707
+
Client->>Envoy: Request
708
+
Envoy->>RLS: Check Rate Limit
709
+
RLS-->>Envoy: ;
710
+
Envoy->>Processor: Router-level ExtProc Request
711
+
Note over Processor: Extract Model Name & Routing
712
+
Processor-->>Envoy: ClearRouteCache;
713
+
Envoy->>EPP: Router-level ExtProc Request
714
+
Note over EPP: Pick Endpoint in InferencePool
715
+
EPP-->>Envoy: Add Picked Endpoint in Header and Metadata;
716
+
717
+
loop Retry/Fallback loop
718
+
Note over Envoy: Foward Based on Picked Endpoint (Original Dst or HostOverride LbPolicy)
719
+
Envoy->>Processor: Upstream level ExtProc Request
720
+
Note over Processor: Request-Transform
721
+
Processor-->>Envoy: ;
722
+
Envoy->>SelfHosted: Forward Request
723
+
SelfHosted-->>Envoy: Response
724
+
end
725
+
726
+
Envoy->>Processor: Process Response
727
+
Note over Processor: Response Transform & Extract Token Usage
728
+
Processor-->>Envoy: Add Usage Metadata
729
+
Envoy->>RLS: Reduce Rate Limit budget
730
+
RLS-->>Envoy: ;
731
+
Envoy->>Client: Response
732
+
```
733
+
734
+
The request flow for HTTPRoute + InferencePool is like:
735
+
736
+
```mermaid
737
+
sequenceDiagram
738
+
participant Client as Client (OpenAI SDK)
739
+
participant Envoy as Envoy Proxy
740
+
participant EPP as EndpointPicker
741
+
participant SelfHosted as Self-Hosted Models
742
+
743
+
Client->>Envoy: Request
744
+
Envoy->>EPP: Router-level ExtProc Request
745
+
Note over EPP: Pick Endpoint in InferencePool
746
+
EPP-->>Envoy: Add Picked Endpoint in Header and Metadata;
747
+
748
+
loop Retry/Fallback loop
749
+
Note over Envoy: Foward Based on Picked Endpoint (Original Dst or HostOverride LbPolicy)
750
+
Envoy->>SelfHosted: Forward Request
751
+
SelfHosted-->>Envoy: Response
752
+
end
753
+
754
+
Envoy->>Client: Response
755
+
```
547
756
548
757
## Implementation Considerations and Limitations
549
758
759
+
Current implementation is able to support InferencePool using AIGatewayRoute as well as HTTPRoute.
760
+
761
+
But there are some advantages for AIGatewayRoute + InferencePool:
762
+
763
+
+ Native OpenAI Support: we can use native Open AI Spec to call different inference pool in a same listener, because we support to parse the mode from the body, and refresh the cache with route level ai-gw extproc. But using HTTPRoute directly cannot do that, it has to add more routeMatch to do it in a same listener like headerMatch, or separate it into different listeners.
764
+
+ Token Ratelimit: we can support token ratelimit, for the self hosted models, we can extract the token usage from the response and reduce the rate limit budget.
765
+
+ Advanced Observability: we can expose more metrics in our ai-gw ext-proc.
766
+
550
767
### Load Balancing Policy
551
768
552
769
The initial implementation will use **Original Destination** cluster configuration for endpoint selection. Future iterations may consider **Host Override** policy as an alternative approach based on performance and operational requirements.
0 commit comments