LLM-D Conformance Test Framework — Architecture Guide

This document provides a comprehensive overview of the test framework architecture, test case topologies, deployment modes, and component interactions. All diagrams use Mermaid syntax.

Framework Overview
9-Phase Test Lifecycle
Package Architecture
Test Case Topologies
Deployment Modes
Platform Differences
Manifest Patching Pipeline
Metrics Validation Matrix
Suggested Improvements

Framework Overview

graph TB
    subgraph "User Interface"
        MK[Makefile Targets]
        CLI[CLI Flags]
    end

    subgraph "Configuration"
        TC[Test Cases<br/>configs/testcases/*.yaml]
        PR[Profiles<br/>configs/profiles/*.yaml]
        MF[Manifests<br/>deploy/manifests/*.yaml]
    end

    subgraph "Test Runner"
        ST[suite_test.go<br/>Flag parsing]
        CT[conformance_test.go<br/>9-phase lifecycle]
    end

    subgraph "Framework Packages"
        CFG[config<br/>Load & resolve]
        DEP[deployer<br/>Deploy & manage]
        CL[client<br/>OpenAI API]
        MT[metrics<br/>Prometheus scrape]
        DL[model<br/>PVC download]
        CLN[cleanup<br/>Resource tracking]
        RPT[reporter<br/>JSON & HTML]
        RTY[retry<br/>Polling utils]
    end

    subgraph "Kubernetes Cluster"
        CRD[LLMInferenceService CRD]
        POD[vLLM Pods]
        SCH[Scheduler / EPP]
        SVC[Service + HTTPRoute]
        GW[Gateway]
        IP[InferencePool]
    end

    MK --> CLI --> ST --> CT
    ST --> CFG
    CFG --> TC & PR
    CT --> DEP & CL & MT & DL & CLN & RPT
    DEP --> MF
    DEP --> CRD & POD & SVC
    CL --> GW
    MT --> POD & SCH
    DL --> POD
    CLN --> DEP

9-Phase Test Lifecycle

Each test case follows this ordered sequence. A failure in any phase skips all subsequent phases.

flowchart TD
    START([Test Case Start]) --> P1

    P1[Phase 1: PREP<br/>Download model to PVC]
    P1 -->|"skip if hf:// or discover"| P2

    P2[Phase 2: PREREQ<br/>Check CRD installed]
    P2 -->|"skip if discover"| P3

    P3[Phase 3: DEPLOY<br/>Patch manifest + kubectl apply]
    P3 -->|"skip if discover"| P4

    P4[Phase 4: READY<br/>Wait for all sub-resources]
    P4 --> P4a & P4b & P4c & P4d & P4e & P4f

    P4a[4a: Service exists]
    P4b[4b: HTTPRoute exists]
    P4c[4c: Gateway Programmed=True]
    P4d[4d: HTTPRoute Accepted=True]
    P4e[4e: InferencePool created<br/>skip if no scheduler]
    P4f[4f: Pods Running<br/>no CrashLoopBackOff]

    P4a & P4b & P4c & P4d & P4e & P4f --> P5

    P5[Phase 5: MODEL FILES<br/>Check /mnt/models<br/>config.json + weights]
    P5 -->|"skip if mock or no init container"| P6

    P6[Phase 6: HEALTH + MODEL<br/>GET /health → 200<br/>GET /v1/models → list]
    P6 --> P7

    P7[Phase 7: INFERENCE<br/>POST /v1/chat/completions<br/>Validate non-empty response]
    P7 --> P7b

    P7b{Multi-pool<br/>enabled?}
    P7b -->|yes| P7c[Phase 7b: MULTI-POOL<br/>Route to each pool endpoint]
    P7b -->|no| P8
    P7c --> P8

    P8{Metrics<br/>enabled?}
    P8 -->|yes| P8a[Phase 8: METRICS<br/>Scrape vLLM + EPP<br/>Validate counters]
    P8 -->|no| P9
    P8 -->|"mock mode"| P9
    P8a --> P9

    P9[Phase 9: CLEANUP<br/>kubectl delete manifest<br/>Wait for pod termination]
    P9 -->|"skip if NO_CLEANUP"| DONE

    DONE([Test Case End])

    style P3 fill:#e1f5fe
    style P7 fill:#e8f5e9
    style P8a fill:#fff3e0
    style P9 fill:#fce4ec

Phase Details

Phase	Timeout	Skip Conditions
1. PREP	90m (configurable)	`hf://` source, discover mode, cache mode
2. PREREQ	30s	discover mode
3. DEPLOY	2m	discover mode
4. READY	10-90m (per test case)	discover mode
5. MODEL FILES	30s	mock mode, no storage-initializer
6. HEALTH	2m (retries)	—
7. INFERENCE	2m per prompt	—
7b. MULTI-POOL	2m per pool	only if `multiPool.enabled`
8. METRICS	2m	mock mode, `metricsCheck.enabled=false`
9. CLEANUP	2m	`NO_CLEANUP=1`, discover mode

Package Architecture

classDiagram
    class config {
        +LoadProfile(path) TestProfile
        +LoadTestCase(path) TestCase
        +ResolveProfileTestCases() []TestCase
        +FilterTestCasesByNames()
    }

    class deployer {
        +Deploy(ctx, tc) DeployResult
        +WaitForReady(ctx, tc)
        +Cleanup(ctx, tc)
        +GetServiceEndpoint(ctx, tc) string
        +GetPlatformInfo(ctx) map
        +GetVLLMVersion(ctx) string
        -patchManifest(ctx, path, tc) string
        -ensurePullSecrets(ctx, path, ns)
        -findSecret(ctx, names, namespaces)
        -copySecret(ctx, src, srcNS, dest, destNS)
    }

    class client {
        +HealthCheck(ctx) error
        +Completions(ctx, req) Response
        +ChatCompletions(ctx, req) Response
        +ListModels(ctx) ModelsResponse
    }

    class metrics {
        +ScrapePod(ctx, pod, port) ScrapeResult
        +ScrapeVLLMPods(ctx, name) []ScrapeResult
        +ScrapeEPPPods(ctx, name) []ScrapeResult
        +ValidateCacheAwareMetrics()
        +ValidatePDMetrics()
        +ValidateSchedulerMetrics()
    }

    class model {
        +DownloadModel(ctx, tc) CacheResult
        +Cleanup(ctx, tc)
        +ResolveSnapshotPath() string
    }

    class cleanup {
        +Track(tc)
        +CleanupAll(ctx) []error
        +CleanupOne(ctx, name) error
    }

    class reporter {
        +AddResult(result)
        +Finalize()
        +SetEnvironment(env)
    }

    deployer --> config : uses TestCase
    client --> config : uses TestCase
    metrics --> deployer : kubectl exec
    model --> deployer : kubectl apply
    cleanup --> deployer : Cleanup()
    reporter --> config : uses TestResult

Test Case Topologies

1. Single-GPU Smoke

Purpose: Fast baseline validation — confirms framework + scheduler work.

graph LR
    subgraph "Client"
        U[Test Runner]
    end

    subgraph "Kubernetes Cluster"
        GW[Gateway<br/>OSSM/Istio]
        HR[HTTPRoute<br/>single-gpu-smoke]
        SVC[Service<br/>single-gpu-smoke]
        IP[InferencePool]

        subgraph "Pod"
            VLLM[vLLM<br/>Qwen3-0.6B<br/>1 GPU]
            EPP[Scheduler<br/>sidecar]
        end
    end

    U -->|"POST /v1/chat/completions"| GW
    GW --> HR --> SVC --> EPP
    EPP -->|route| VLLM
    VLLM -.->|register| IP

    style VLLM fill:#4caf50,color:#fff
    style EPP fill:#2196f3,color:#fff

Property	Value
Replicas	1
GPUs	1
Scheduler	Yes (EPP sidecar)
Metrics	Disabled (smoke)
Prompts	1 chat prompt
Ready Timeout	10m

2. Single-GPU

Purpose: Full baseline with metrics validation — vLLM + EPP + scheduler metrics.

graph LR
    subgraph "Client"
        U[Test Runner]
    end

    subgraph "Kubernetes Cluster"
        GW[Gateway]
        HR[HTTPRoute]
        SVC[Service<br/>single-gpu]
        IP[InferencePool]

        subgraph "Pod"
            VLLM["vLLM<br/>Qwen3-0.6B<br/>1 GPU<br/>📊 /metrics:8000"]
            EPP["Scheduler EPP<br/>📊 /metrics:9090"]
        end
    end

    U -->|"POST /v1/chat/completions"| GW
    GW --> HR --> SVC --> EPP
    EPP -->|route| VLLM
    VLLM -.->|register| IP

    U -.->|"scrape metrics"| VLLM
    U -.->|"scrape metrics"| EPP

    style VLLM fill:#4caf50,color:#fff
    style EPP fill:#2196f3,color:#fff

Property	Value
Replicas	1
GPUs	1
Scheduler	Yes
Metrics	vLLM ✓ EPP ✓ Scheduler ✓
Prompts	2 chat prompts

3. Single-GPU No-Scheduler

Purpose: Validates K8s native load balancing without LLM-D scheduler. No InferencePool.

graph LR
    subgraph "Client"
        U[Test Runner]
    end

    subgraph "Kubernetes Cluster"
        GW[Gateway]
        HR[HTTPRoute]
        SVC["Service<br/>single-gpu-no-scheduler<br/>Round-Robin LB"]

        VLLM1[vLLM Pod 1<br/>1 GPU]
        VLLM2[vLLM Pod 2<br/>1 GPU]
        VLLM3[vLLM Pod 3<br/>1 GPU]
    end

    U -->|"POST /v1/chat/completions"| GW
    GW --> HR --> SVC
    SVC -->|random| VLLM1
    SVC -->|random| VLLM2
    SVC -->|random| VLLM3

    style VLLM1 fill:#4caf50,color:#fff
    style VLLM2 fill:#4caf50,color:#fff
    style VLLM3 fill:#4caf50,color:#fff
    style SVC fill:#ff9800,color:#fff

Property	Value
Replicas	3
GPUs	3 total
Scheduler	No — K8s Service round-robin
InferencePool	No
Metrics	Disabled

4. Cache-Aware

Purpose: Validates prefix KV cache-aware routing — scheduler routes requests with same system prefix to the same pod for cache reuse.

graph TB
    subgraph "Client"
        U["Test Runner<br/>3 requests with<br/>same system prefix<br/>(5s delay between)"]
    end

    subgraph "Kubernetes Cluster"
        GW[Gateway]
        HR[HTTPRoute]
        SVC[Service<br/>cache-aware]
        IP[InferencePool]

        subgraph "Scheduler EPP"
            PCS["precise-prefix-cache-scorer<br/>weight=3 (highest)"]
            QS[queue-scorer]
            KVS[kv-cache-utilization-scorer]
            MSP[max-score-picker]
            PI["Prefix Indexer<br/>pod→prefix mapping"]
        end

        subgraph "Pod A"
            VLLM_A["vLLM Replica A<br/>1 GPU<br/>prefix-cache enabled<br/>block-size=16"]
        end

        subgraph "Pod B"
            VLLM_B["vLLM Replica B<br/>1 GPU<br/>prefix-cache enabled<br/>block-size=16"]
        end
    end

    U -->|"req 1: system+Q1"| GW --> HR --> SVC --> PCS
    PCS --> QS --> KVS --> MSP
    MSP -->|"route to A (random first time)"| VLLM_A

    VLLM_A -->|"ZMQ: prefix cached on A"| PI
    PI -->|"req 2: same prefix → route to A"| VLLM_A
    PI -.->|"different prefix → route to B"| VLLM_B

    VLLM_A -.-> IP
    VLLM_B -.-> IP

    style VLLM_A fill:#4caf50,color:#fff
    style VLLM_B fill:#4caf50,color:#fff
    style PCS fill:#9c27b0,color:#fff
    style PI fill:#ff9800,color:#fff

Cache-Aware Request Flow:

sequenceDiagram
    participant C as Client
    participant S as Scheduler EPP
    participant PI as Prefix Indexer
    participant A as vLLM Pod A
    participant B as vLLM Pod B

    C->>S: Request 1 (system prefix + Q1)
    S->>PI: Lookup prefix hash
    PI-->>S: No match (first time)
    S->>A: Route to Pod A (random)
    A->>A: Compute prefix KV cache
    A->>PI: ZMQ: "prefix X cached on Pod A"
    A-->>C: Response 1

    Note over C: 5 second delay (indexer sync)

    C->>S: Request 2 (same prefix + Q2)
    S->>PI: Lookup prefix hash
    PI-->>S: Match! Pod A has this prefix
    S->>A: Route to Pod A (cache hit)
    A->>A: Reuse cached KV (skip recompute)
    A-->>C: Response 2 (faster)

    C->>S: Request 3 (same prefix + Q3)
    S->>PI: Lookup prefix hash
    PI-->>S: Match! Pod A
    S->>A: Route to Pod A (cache hit)
    A-->>C: Response 3 (faster)

Property	Value
Replicas	2
GPUs	2 total
Scheduler	Yes — prefix-cache-scorer (weight=3)
Key Config	`block-size=16`, `hash-algo=sha256`
ZMQ	vLLM → EPP (prefix cache events)
Prompts	3 chat prompts, same system prefix, 5s delay
Metrics	vLLM ✓ EPP ✓ Scheduler ✓ PrefixCache ✓

5. Prefill/Decode (P/D)

Purpose: Validates P/D disaggregation — prefill pods compute KV cache, transfer via NIXL to decode pod for token generation.

graph TB
    subgraph "Client"
        U[Test Runner]
    end

    subgraph "Kubernetes Cluster"
        GW[Gateway]
        HR[HTTPRoute]
        SVC[Service<br/>pd]
        IP[InferencePool]

        subgraph "Decode Pod (1 replica)"
            DECODE["vLLM Decode<br/>1 GPU<br/>Token generation<br/>NIXL receiver"]
            EPP["Scheduler EPP"]
        end

        subgraph "Prefill Pod A"
            PF_A["vLLM Prefill A<br/>1 GPU<br/>Prompt processing<br/>NIXL sender"]
        end

        subgraph "Prefill Pod B"
            PF_B["vLLM Prefill B<br/>1 GPU<br/>Prompt processing<br/>NIXL sender"]
        end
    end

    U -->|"POST /v1/chat/completions"| GW
    GW --> HR --> SVC --> EPP

    EPP -->|"route prompt to prefill"| PF_A
    EPP -.->|"or"| PF_B

    PF_A ==>|"NIXL/UCX/TCP<br/>KV cache transfer"| DECODE
    PF_B ==>|"NIXL/UCX/TCP<br/>KV cache transfer"| DECODE

    DECODE -->|"generated tokens"| U

    DECODE -.-> IP
    PF_A -.-> IP
    PF_B -.-> IP

    style DECODE fill:#e91e63,color:#fff
    style PF_A fill:#4caf50,color:#fff
    style PF_B fill:#4caf50,color:#fff
    style EPP fill:#2196f3,color:#fff

P/D Request Flow:

sequenceDiagram
    participant C as Client
    participant D as Decode Pod
    participant S as Scheduler EPP
    participant P as Prefill Pod A

    C->>D: POST /v1/chat/completions
    D->>S: Route prompt
    S->>P: Forward to Prefill A
    P->>P: Tokenize + compute KV cache
    P->>D: NIXL: Transfer KV cache (TCP/UCX)
    D->>D: Continue token generation
    D-->>C: Stream response tokens

Property	Value
Decode Replicas	1
Prefill Replicas	2
GPUs	3 total (1 decode + 2 prefill)
KV Transfer	NIXL over TCP/UCX
Scheduler	Yes — routes prompts to prefill pods
Metrics	vLLM ✓ PD ✓ NIXL ✓ Scheduler ✓

6. P/D + Cache-Aware

Purpose: Most complex topology — combines P/D disaggregation with prefix cache-aware routing. Scheduler routes to the prefill pod that already has the prefix cached.

graph TB
    subgraph "Client"
        U["Test Runner<br/>3 chat prompts<br/>same system prefix"]
    end

    subgraph "Kubernetes Cluster"
        GW[Gateway]
        SVC[Service]

        subgraph "Decode Pod"
            DECODE["vLLM Decode<br/>1 GPU<br/>NIXL receiver"]
            EPP["Scheduler EPP<br/>prefix-cache-scorer"]
            PI["Prefix Indexer"]
        end

        subgraph "Prefill Pod A"
            PF_A["vLLM Prefill A<br/>1 GPU<br/>prefix-cache ON<br/>NIXL sender"]
        end

        subgraph "Prefill Pod B"
            PF_B["vLLM Prefill B<br/>1 GPU<br/>prefix-cache ON<br/>NIXL sender"]
        end
    end

    U -->|requests| GW --> SVC --> EPP

    EPP -->|"route to prefill<br/>(cache-aware)"| PF_A
    EPP -.->|"or"| PF_B

    PF_A ==>|"NIXL: KV cache"| DECODE
    PF_B ==>|"NIXL: KV cache"| DECODE

    PF_A -->|"ZMQ: prefix cached"| PI
    PF_B -->|"ZMQ: prefix cached"| PI
    PI -->|"lookup: which pod<br/>has this prefix?"| EPP

    DECODE -->|response| U

    style DECODE fill:#e91e63,color:#fff
    style PF_A fill:#4caf50,color:#fff
    style PF_B fill:#4caf50,color:#fff
    style EPP fill:#2196f3,color:#fff
    style PI fill:#ff9800,color:#fff

Combined Flow:

sequenceDiagram
    participant C as Client
    participant EPP as Scheduler + Prefix Indexer
    participant PA as Prefill Pod A
    participant D as Decode Pod

    C->>EPP: Request 1 (prefix + Q1)
    EPP->>PA: Route to Prefill A (random first time)
    PA->>PA: Compute prefix KV cache
    PA->>D: NIXL: transfer KV cache
    PA->>EPP: ZMQ: "prefix X on Pod A"
    D-->>C: Response 1

    Note over C: 5s delay

    C->>EPP: Request 2 (same prefix + Q2)
    Note over EPP: Prefix indexer → Pod A has prefix
    EPP->>PA: Route to Prefill A (cache hit!)
    PA->>PA: Reuse prefix KV (fast)
    PA->>D: NIXL: transfer KV cache
    D-->>C: Response 2 (faster)

Property	Value
Decode Replicas	1
Prefill Replicas	2
GPUs	3 total
KV Transfer	NIXL over TCP/UCX
Prefix Cache	Yes — on prefill pods
ZMQ Events	Prefill → EPP (cache state)
Metrics	All: vLLM, PD, NIXL, EPP, Scheduler, PrefixCache

7. Mixture of Experts (MoE)

Purpose: Validates distributed MoE inference with expert/tensor parallelism across multi-GPU nodes connected via RoCE/InfiniBand.

graph TB
    subgraph "Client"
        U[Test Runner]
    end

    subgraph "Kubernetes Cluster"
        GW[Gateway]
        SVC[Service<br/>moe]
        IP[InferencePool]

        subgraph "Main Pod (8 GPUs)"
            VLLM_M["vLLM Main<br/>8× GPU<br/>Expert subset"]
            EPP[Scheduler EPP]
        end

        subgraph "Worker Pod 1 (8 GPUs)"
            W1["vLLM Worker<br/>8× GPU<br/>Expert subset"]
        end

        subgraph "Worker Pod N (8 GPUs)"
            WN["vLLM Worker<br/>8× GPU<br/>Expert subset"]
        end

        NET["RoCE / InfiniBand<br/>NCCL all-reduce<br/>GPU Direct RDMA"]
    end

    U -->|request| GW --> SVC --> EPP --> VLLM_M

    VLLM_M <-->|"NCCL collective ops"| NET
    W1 <-->|"NCCL collective ops"| NET
    WN <-->|"NCCL collective ops"| NET

    VLLM_M -.-> IP

    style VLLM_M fill:#9c27b0,color:#fff
    style W1 fill:#673ab7,color:#fff
    style WN fill:#673ab7,color:#fff
    style NET fill:#ff5722,color:#fff

Property	Value
Model	DeepSeek-R1-0528
GPUs	8 per pod (main + N workers)
Parallelism	data=32, expert=true, tensor=1
Network	RoCE/IB (NCCL, NVSHMEM, UCX)
Storage	500Gi PVC
Ready Timeout	90m
Metrics	vLLM ✓ Scheduler ✓

8. Multi-Pool

Purpose: Validates multiple independent InferencePools sharing a single Gateway — tests VirtualService merge (OSSM-12585 fix).

graph TB
    subgraph "Client"
        U[Test Runner]
    end

    subgraph "Kubernetes Cluster"
        GW[Gateway<br/>shared]
        VS["VirtualService<br/>merges both pools"]

        subgraph "Pool A"
            HR_A[HTTPRoute A]
            SVC_A[Service A]
            IP_A[InferencePool A]
            subgraph "Pod A"
                VLLM_A[vLLM<br/>1 GPU]
                EPP_A[Scheduler A]
            end
        end

        subgraph "Pool B"
            HR_B[HTTPRoute B]
            SVC_B[Service B]
            IP_B[InferencePool B]
            subgraph "Pod B"
                VLLM_B[vLLM<br/>1 GPU]
                EPP_B[Scheduler B]
            end
        end
    end

    U -->|"/multi-pool-a/..."| GW
    U -->|"/multi-pool-b/..."| GW
    GW --> VS
    VS -->|"prefix: /multi-pool-a/"| HR_A --> SVC_A --> EPP_A --> VLLM_A
    VS -->|"prefix: /multi-pool-b/"| HR_B --> SVC_B --> EPP_B --> VLLM_B
    VLLM_A -.-> IP_A
    VLLM_B -.-> IP_B

    style VLLM_A fill:#4caf50,color:#fff
    style VLLM_B fill:#ff9800,color:#fff
    style EPP_A fill:#2196f3,color:#fff
    style EPP_B fill:#2196f3,color:#fff
    style VS fill:#9c27b0,color:#fff

Property	Value
Pools	2 (multi-pool-a, multi-pool-b)
GPUs	2 total (1 per pool)
Schedulers	2 (independent per pool)
Routing	VirtualService prefix-based
Validates	OSSM-12585 multi-pool merge

Deployment Modes

Mock Mode

Replaces vLLM containers with a lightweight mock server — no GPU, no model download. The scheduler, EPP, and InferencePool remain real.

graph LR
    subgraph "Normal Mode"
        N_VLLM["vLLM Container<br/>GPU required<br/>Model download<br/>Real inference"]
    end

    subgraph "Mock Mode (MOCK=image)"
        M_VLLM["Mock Container<br/>CPU only (100m)<br/>No model download<br/>Fake /v1/* endpoints"]
        M_SCH["Scheduler EPP<br/>REAL (unchanged)"]
        M_IP["InferencePool<br/>REAL (unchanged)"]
    end

    N_VLLM -->|"replaced by"| M_VLLM
    M_VLLM --- M_SCH --- M_IP

What changes in mock mode:

Component	Normal	Mock
vLLM image	`vllm/vllm:latest`	Mock image
GPU	Required	Not needed
CPU/Memory	4 CPU, 32Gi	100m CPU, 64Mi
storageInitializer	Downloads model	Disabled
Scheduler/EPP	Real	Real (unchanged)
InferencePool	Real	Real (unchanged)
Phase 5 (model files)	Checked	Skipped
Phase 8 (metrics)	Scraped	Skipped
Prefill pods (P/D)	Real vLLM	Mock (both replaced)

Usage:

make test TESTCASE=single-gpu MOCK=ghcr.io/aneeshkp/vllm-mock:latest
make test-profile-all MOCK=ghcr.io/aneeshkp/vllm-mock:latest  # all topologies

Model Sources

flowchart TD
    MS{MODEL_SOURCE?}

    MS -->|"hf (default)"| HF["hf:// Direct Download<br/>vLLM downloads at startup<br/>via storage-initializer"]

    MS -->|"pvc"| PVC["pvc:// Pre-cached<br/>Download to PVC first<br/>via make cache-model"]

    MS -->|"pvc-snapshot"| SNAP["pvc-snapshot://<br/>HF cache layout<br/>snapshot_download()"]

    HF --> HF_FLOW["Pod starts → init container<br/>downloads from HuggingFace<br/>→ /mnt/models/ModelName/"]

    PVC --> PVC_FLOW["1. make cache-model<br/>2. Job downloads to PVC<br/>3. URI: pvc://pvc-name/ModelName<br/>4. Pod mounts PVC directly"]

    SNAP --> SNAP_FLOW["1. Download with HF cache<br/>2. models--Org--Model/snapshots/hash/<br/>3. URI: pvc://pvc-name"]

    style HF fill:#4caf50,color:#fff
    style PVC fill:#2196f3,color:#fff
    style SNAP fill:#9c27b0,color:#fff

Discover Mode

Validates an existing deployment without deploying or cleaning up.

flowchart LR
    subgraph "Phases SKIPPED"
        P1[1. PREP]
        P2[2. PREREQ]
        P3[3. DEPLOY]
        P4[4. READY]
        P5[5. MODEL FILES]
    end

    subgraph "Phases EXECUTED"
        P6[6. HEALTH ✓]
        P6b[6b. MODEL LIST ✓]
        P7[7. INFERENCE ✓]
        P8[8. METRICS ✓]
    end

    P1 ~~~ P6

    style P1 fill:#ccc
    style P2 fill:#ccc
    style P3 fill:#ccc
    style P4 fill:#ccc
    style P5 fill:#ccc
    style P6 fill:#4caf50,color:#fff
    style P6b fill:#4caf50,color:#fff
    style P7 fill:#4caf50,color:#fff
    style P8 fill:#4caf50,color:#fff

Usage:

make test TESTCASE=single-gpu DISCOVER=true NAMESPACE=my-existing-ns

Platform Differences

Feature	OCP	AKS	GKS
CLI tool	`oc` (preferred)	`kubectl`	`kubectl`
Pull secrets	Global (skipped)	Copied from source NS	Copied from source NS
Auth disable	Annotation injection	Annotation injection	Annotation injection
Gateway discovery	`opendatahub`, `istio-system`	`kserve`, `istio-system`	`kserve`, `istio-system`
Storage default	RWO	Azure Disk (RWO)	Persistent Disk (RWO)
Multi-replica PVC	RWX via NFS	`azurefile-csi-premium`	Custom StorageClass

Manifest Patching Pipeline

Every manifest goes through this transformation pipeline before kubectl apply:

flowchart TD
    MF[Original Manifest<br/>deploy/manifests/*.yaml]

    MF --> A{DisableAuth?}
    A -->|yes| A1["Inject annotation:<br/>security.opendatahub.io/<br/>enable-auth: false"]
    A -->|no| B

    A1 --> B{Model URI<br/>override?}
    B -->|yes| B1["Patch spec.model.uri<br/>and spec.model.name"]
    B -->|no| C

    B1 --> C{PullSecret<br/>override?}
    C -->|yes| C1["Patch imagePullSecrets<br/>name entries"]
    C -->|no| D

    C1 --> D{Mock mode?}
    D -->|yes| D1["Disable storageInitializer"]
    D1 --> D2["Replace vLLM containers<br/>with mock image<br/>(skip scheduler containers)"]
    D -->|no| E

    D2 --> E[Patched Manifest<br/>temp file]
    E --> F["kubectl apply -n ns -f patched.yaml"]
    F --> G["Delete temp file"]

    style MF fill:#e3f2fd
    style E fill:#fff3e0
    style F fill:#e8f5e9

Metrics Validation Matrix

Which metrics are checked per test case topology:

Metric	Smoke	Single	No-Sched	Cache	P/D	P/D+Cache	MoE	Multi
`vllm:request_success_total`	—	✓	—	✓	✓	✓	✓	✓
`vllm:prompt_tokens_total`	—	✓	—	—	✓	✓	✓	✓
`vllm:generation_tokens_total`	—	✓	—	—	✓	✓	✓	✓
`vllm:gpu_cache_usage_perc`	—	✓	—	✓	—	✓	—	✓
`vllm:prefix_cache_queries`	—	—	—	✓	—	✓	—	—
`vllm:prefix_cache_hits`	—	—	—	✓	—	✓	—	—
`prefix_cache_hit_rate`	—	—	—	✓	—	✓	—	—
`nixl:kv_transfer_count`	—	—	—	—	✓	✓	—	—
`nixl:kv_transfer_failures`	—	—	—	—	✓	✓	—	—
`scheduler_e2e_duration`	—	✓	—	✓	✓	✓	✓	✓
`request_total`	—	✓	—	✓	✓	✓	✓	✓
`request_error_total`	—	✓	—	✓	✓	✓	✓	✓
`pool_ready_pods`	—	✓	—	✓	✓	✓	✓	✓
`prefix_indexer_size`	—	—	—	✓	—	✓	—	—

Suggested Improvements

High Priority

#	Improvement	Impact
1	Increase deployer test coverage (19% → 60%+)	Catch manifest patching edge cases
2	Pre-flight `kubectl apply --dry-run=client`	Fail fast on invalid manifests
3	Per-phase configurable timeouts in test case YAML	Different clusters need different timeouts
4	Labeled break in loops and better error context	Easier debugging

Medium Priority

#	Improvement	Impact
5	Test parallelism with per-test namespaces	2h → 30m runtime
6	Separate download timeout from ready timeout	Large models need 2h+ download
7	Pre-flight cluster checks (node capacity, GPU availability)	Fail fast on capacity issues
8	Extract magic numbers (ports, labels, timeouts) to constants	Maintainability

Low Priority

#	Improvement	Impact
9	Plugin system for custom validators	Extensibility
10	Performance tracking (latency trends over time)	Regression detection
11	Manifest versioning (v1, v2 patching strategies)	Backwards compatibility
12	Per-test namespace isolation	Prevent state leakage

Quick Reference

Common Commands

# Setup
make setup                              # Clone manifests (main branch)
make setup MANIFEST_REF=3.4-ea2         # Clone specific branch
make manifests                          # List available branches

# Run tests
make test TESTCASE=single-gpu           # Single test case
make test TESTCASE=single-gpu,pd        # Multiple test cases
make test-profile-all                   # All conformance tests

# Mock mode (no GPU)
make test TESTCASE=single-gpu MOCK=ghcr.io/aneeshkp/vllm-mock:latest
make test-profile-all MOCK=ghcr.io/aneeshkp/vllm-mock:latest

# Validate existing deployment
make test TESTCASE=single-gpu DISCOVER=true NAMESPACE=my-ns

# Cache model to PVC
make cache-model TESTCASE=single-gpu STORAGE_CLASS=azurefile-rwx

# Debug (keep resources)
make test TESTCASE=pd NO_CLEANUP=1

# Unit tests
make unittest

# Discovery
make profiles                           # List test profiles
make testcases                          # List test cases
make models                             # List models

File Layout

├── configs/
│   ├── testcases/          # 8 test case definitions
│   │   ├── single-gpu-smoke.yaml
│   │   ├── single-gpu.yaml
│   │   ├── single-gpu-no-scheduler.yaml
│   │   ├── cache-aware.yaml
│   │   ├── pd.yaml
│   │   ├── pd-cache-aware.yaml
│   │   ├── moe.yaml
│   │   └── multi-pool.yaml
│   └── profiles/           # 5 test profiles
│       ├── smoke.yaml
│       ├── all.yaml
│       ├── cache-aware.yaml
│       ├── pd.yaml
│       └── deepseek.yaml
├── deploy/manifests/       # Cloned from manifest repo (gitignored)
├── framework/
│   ├── cleanup/            # Resource lifecycle tracking
│   ├── client/             # OpenAI-compatible API client
│   ├── config/             # YAML config loading (95% coverage)
│   ├── deployer/           # K8s deployment management (31% coverage)
│   ├── metrics/            # Prometheus scraping (71% coverage)
│   ├── model/              # Model download & PVC caching
│   ├── reporter/           # JSON & HTML reporting
│   └── retry/              # Polling utilities
├── tests/
│   ├── conformance_test.go # 9-phase test lifecycle (1400+ lines)
│   ├── suite_test.go       # Ginkgo suite + CLI flags
│   └── smoke/              # Framework validation (no cluster)
├── docs/
│   ├── architecture.md     # This document
│   ├── adding-test-cases.md
│   └── disconnected-ocp-setup.md
└── Makefile                # All targets and flag mappings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM-D Conformance Test Framework — Architecture Guide

Table of Contents

Framework Overview

9-Phase Test Lifecycle

Phase Details

Package Architecture

Test Case Topologies

1. Single-GPU Smoke

2. Single-GPU

3. Single-GPU No-Scheduler

4. Cache-Aware

5. Prefill/Decode (P/D)

6. P/D + Cache-Aware

7. Mixture of Experts (MoE)

8. Multi-Pool

Deployment Modes

Mock Mode

Model Sources

Discover Mode

Platform Differences

Manifest Patching Pipeline

Metrics Validation Matrix

Suggested Improvements

High Priority

Medium Priority

Low Priority

Quick Reference

Common Commands

File Layout

FilesExpand file tree

architecture.md

Latest commit

History

architecture.md

File metadata and controls

LLM-D Conformance Test Framework — Architecture Guide

Table of Contents

Framework Overview

9-Phase Test Lifecycle

Phase Details

Package Architecture

Test Case Topologies

1. Single-GPU Smoke

2. Single-GPU

3. Single-GPU No-Scheduler

4. Cache-Aware

5. Prefill/Decode (P/D)

6. P/D + Cache-Aware

7. Mixture of Experts (MoE)

8. Multi-Pool

Deployment Modes

Mock Mode

Model Sources

Discover Mode

Platform Differences

Manifest Patching Pipeline

Metrics Validation Matrix

Suggested Improvements

High Priority

Medium Priority

Low Priority

Quick Reference

Common Commands

File Layout