Batch Gateway

Overview

Batch Gateway is a high-performance system for processing large-scale batch inference jobs in Kubernetes environments. It provides an OpenAI-compatible API for submitting, tracking, and managing batch inference jobs.

The system is designed to facilitate efficient processing of batch workloads in combination with interactive workloads. It minimizes interference with interactive workloads while satisfying batch jobs' service level objectives (SLOs).

Use Cases

Inferencing large datasets.
Generating embeddings for large corpora.
Model evaluations and testing.
Offline analysis and batch processing.
Cost-optimized inference using differential billing for batch vs. interactive workloads.

Key Features

Batch Processing

OpenAI API Compatibility: Full schema parity with OpenAI's /v1/batches and /v1/files endpoints.
Large-Scale Processing: Support for up to 50,000 requests per job.
Progress Tracking: Real-time job status progress updates.
Job Management and Control: Enables to manage and control batch jobs, before, during, and after their processing.
Model-Aware Scheduling: Groups and orders requests by model and system prompt for optimal downstream utilization.
Intelligent inference dispatching: Monitors downstream metrics to determine the flow volume of batch inference requests.

System Design

Deployment Flexibility: Separate API server, batch processor, and request dispatcher components for independent scaling.
Pluggable Storage Backends: Supports pluggable storage backends for files, metadata, and queues.
Fault Tolerance: Automatic recovery from batch processor crashes.

Operations

Kubernetes Native: Helm charts with OpenShift compatibility.
Observability: Prometheus metrics and Open Telemetry integration.
Health Checks: Liveness and readiness probes for the system components.
Security: TLS support, non-root execution, capability dropping, read-only filesystem. Processor connections to HTTPS llm-d Routers can use custom CAs and mTLS; see Processor inference TLS. Gateway deployments: batch API admission and per-model inference authorization are enforced on separate routes; see Security boundary in the Kubernetes, RHOAI, and MaaS deployment guides.

Architecture

High-Level System Design

Components

API Server (batch-gateway-apiserver)
- Handles REST API requests for batch job submission, management and tracking, as well as file management.
- Exposes OpenAI-compatible /v1/batches and /v1/files endpoints.
Batch Processor (batch-gateway-processor)
- Pulls a batch job from a priority queue, and gets its associated file of inference requests.
- Pre-processes the batch file and builds per-model execution plans.
- Sends downstream individual inference requests from the batch file, with per-model and global concurrency control.
- Writes results to an output file.
- Updates job status.
- Listens to job events (e.g. cancellation) during job processing.
Data Layer
- Manages batch input and output files, batch jobs' and files' metadata, priority queue, events and status mechanisms.
- Supports pluggable backends.
- Backends available out of the box:
  - Job and file metadata storage: PostgreSQL.
  - Priority queue, event channels, and status updates: Redis or Valkey.
  - File storage: S3, filesystem.
Batch Dispatcher (also called Async Processor, implemented in llm-d-incubation/llm-d-async)
- Implements intelligent flow control to balance batch and interactive workloads.
- Monitors downstream inference system metrics (e.g. queue depth, latency, utilization).
- Dynamically adjusts dispatch flow of batch requests based on downstream system load, to minimize interference with interactive requests while meeting batch jobs SLOs.
- Provides backpressure mechanisms to prevent overwhelming downstream inference engines.

Processing Flow

User → API Server → PostgreSQL (metadata) + Redis/Valkey (queue) + S3 (input file)
                         ↓
                  Priority Queue
                         ↓
                  Batch Processor (pulls jobs)
                         ↓
              Ingestion
                  - Obtain input file
                  - Parse model IDs and system prompts
                  - Build per-model plans
                  - Write plans to local disk
                         ↓
              Execution
                  - Launch per-model goroutines
                  - Acquire global & per-model semaphores
                  - Read requests from plan files
                  - Send to llm-d Router
                  - Write results to output file
                         ↓
                  Upload Results to S3
                         ↓
                  Update Job Status

Design Documents

For detailed architecture information see the design directory.

Repository Structure

batch-gateway/
├── cmd/                          # Application entry points
│   ├── apiserver/                # API server binary
│   ├── batch-processor/          # Batch processor binary
│   └── batch-gc/                 # Garbage collector binary
├── internal/                     # Private application code
│   ├── apiserver/                # API server implementation
│   │   ├── batch/                # Batch job handlers
│   │   ├── file/                 # File handlers
│   │   ├── common/               # Shared handler utilities
│   │   ├── health/               # Health check handler
│   │   ├── middleware/           # HTTP middleware
│   │   ├── readiness/            # Readiness handler
│   │   ├── metrics/              # Metrics mechanism
│   │   └── server/               # Server initialization
│   ├── processor/                # Batch processor implementation
│   │   ├── worker/               # Worker pool, planning, and execution
│   │   ├── config/               # Processor configuration
│   │   └── metrics/              # Prometheus metrics
│   ├── gc/                       # Garbage collector implementation
│   ├── database/                 # Database clients
│   │   ├── api/                  # Database interfaces
│   │   ├── mock/                 # Mock implementation (testing)
│   │   ├── redis/                # Redis implementation
│   │   └── postgresql/           # PostgreSQL implementation
│   ├── files_store/              # File storage clients (S3, FS)
│   ├── shared/                   # Shared types and utilities
│   └── util/                     # Common utilities (logging, TLS, etc.)
├── pkg/                          # Public library code
├── charts/                       # Helm charts
│   └── batch-gateway/            # Kubernetes deployment manifests
├── docs/                         # Documentation
│   ├── design/                   # Architecture and design documents
│   └── guides/                   # Developer and user guides
├── test/                         # Test suites
│   └── e2e/                      # End-to-end tests
├── docker/                       # Dockerfiles
├── scripts/                      # Development and deployment scripts
├── Makefile                      # Build and development targets
└── go.mod                        # Go module dependencies

Key Directories

cmd/: Contains main.go entry points for the components' binaries.
internal/: All private application code, organized by component.
pkg/: Public library code.
charts/: Helm chart for deploying the components in Kubernetes.
docs/: Contains architecture documents and development / usage guides.
test/: Integration and E2E test suites for validating the full system.

Getting Started

Prerequisites

Go 1.25 or later.
PostgreSQL 12+ (for metadata storage).
Redis 6+ or Valkey 8+ (for job queue).
S3-compatible object storage or local filesystem.
Docker or Podman (for containerized deployment).
Kubernetes 1.19+ and Helm 3.0+ (for Kubernetes deployment).

Local Development

1. Build Binaries

# Build all the components
make build

# Or build individually
make build-apiserver
make build-processor
make build-gc

2. Run Tests

# Run unit tests
make test

# Run unit tests with coverage
make test-coverage

# Run integration tests
make test-integration

# Run unit tests and integration tests
make test-all

# Run E2E tests (requires a kind cluster)
make test-e2e

3. Run Locally

Configure the components via YAML configuration files (see cmd/apiserver/config.yaml and cmd/batch-processor/config.yaml for examples).

# Run API server
make run-apiserver

# Run processor (in another terminal)
make run-processor

# Run gc (in another terminal)
make run-gc

# Or with verbose logging
make run-apiserver-dev
make run-processor-dev
make run-gc-dev

Kubernetes Deployment

Quick Start with Kind

Prerequisites: see Development Guide prerequisites.

Deploy to a local Kind cluster for development:

# Creates cluster, builds images, and deploys with Helm
make dev-deploy

# Or deploy a specific release version from GHCR
IMAGE_TAG=v0.1.0 SKIP_BUILD=true make dev-deploy

For detailed instructions, see Development Guide.

Production Deployment

# Install all components (apiserver, processor, gc) with defaults
helm install batch-gateway ./charts/batch-gateway

# Scale processor replicas
helm install batch-gateway ./charts/batch-gateway \
  --set processor.replicaCount=3

See Helm Chart README for full configuration options.

Docker Images

# Build all images
make image-build

# Or build individually
make image-build-apiserver
make image-build-processor
make image-build-gc

Images are published to:

ghcr.io/llm-d-incubation/batch-gateway-apiserver
ghcr.io/llm-d-incubation/batch-gateway-processor
ghcr.io/llm-d-incubation/batch-gateway-gc

API Usage

Submit a Batch Job

# 1. Upload input file
curl -X POST http://localhost:8000/v1/files \
  -H "Content-Type: multipart/form-data" \
  -F "file=@batch_requests.jsonl" \
  -F "purpose=batch"

# Response: {"id": "file_abc123", ...}

# 2. Create batch job
curl -X POST http://localhost:8000/v1/batches \
  -H "Content-Type: application/json" \
  -d '{
    "input_file_id": "file_abc123",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h"
  }'

# Response: {"id": "batch_xyz789", "status": "validating", ...}

Check Job Status

curl http://localhost:8000/v1/batches/batch_xyz789

# Response includes status: validating, in_progress, finalizing, completed, failed, expired, cancelled

Retrieve Results

# Get output file ID from batch status
curl http://localhost:8000/v1/batches/batch_xyz789 | jq -r '.output_file_id'

# Download results
curl http://localhost:8000/v1/files/file-output123/content > results.jsonl

For complete API documentation, see the OpenAI Batch API reference.

Configuration

API Server Configuration

See configuration example in cmd/apiserver/config.yaml.

Batch Processor Configuration

See configuration example in cmd/batch-processor/config.yaml.

Garbage Collector Configuration

See configuration example in cmd/batch-gc/config.yaml.

Monitoring

Metrics

The batch gateway components expose Prometheus metrics for monitoring. For a complete list of available metrics, see docs/guides/metrics.md.

Health Checks

API Server:

Health: GET /health (port 8081).
Readiness: GET /ready (port 8081).

Processor:

Health: GET /health (port 9090).
Readiness: GET /ready (port 9090).

Profiling (pprof)

Both the API server and processor support Go pprof profiling endpoints on their observability ports. Pprof is controlled by the enable_pprof config option (or config.enablePprof in Helm values) and is disabled by default.

Enable via Helm:

apiserver:
  config:
    enablePprof: true
processor:
  config:
    enablePprof: true

Usage (with dev-deploy port-forwards):

# API Server (port 8081)
go tool pprof http://localhost:8081/debug/pprof/profile?seconds=30  # CPU
go tool pprof http://localhost:8081/debug/pprof/heap                # Heap
go tool pprof http://localhost:8081/debug/pprof/allocs              # Allocs
go tool pprof http://localhost:8081/debug/pprof/goroutine           # Goroutine

# Processor (port 9090)
go tool pprof http://localhost:9090/debug/pprof/profile?seconds=30  # CPU
go tool pprof http://localhost:9090/debug/pprof/heap                # Heap
go tool pprof http://localhost:9090/debug/pprof/allocs              # Allocs
go tool pprof http://localhost:9090/debug/pprof/goroutine           # Goroutine

All pprof endpoints are served on the observability port (not the API port), so they are not exposed to external traffic.

Development

Code Quality

# Run all CI checks
make ci

# Or run individual checks:
make fmt   # Format code only
make lint  # Run linter only (requires golangci-lint)
make vet   # Run static analysis only
make ci    # Run fmt + vet + lint + test

Install Development Tools

make install-tools

This installs:

golangci-lint - Linting and static analysis
goimports - Import formatting and organization
gosec - Security vulnerability scanner

Project Structure Conventions

Use internal/ for all private code (not intended for external import).
Place shared types in internal/shared/.
Keep component-specific code in dedicated subdirectories (internal/apiserver/, internal/processor/).
Write unit tests alongside implementation files (*_test.go).
Place E2E tests in test/e2e/.

Contributing

Contributions are welcome! Please ensure:

New features include tests and documentation.
CI checks pass: make ci.
E2E tests pass: make test-e2e.
Commits are signed off (git commit -s) and follow conventional commit format.
Code follows project contributing guidelines.

Security

Batch-gateway implements defense-in-depth security across multiple layers:

Authentication & authorization: delegated to the gateway layer (Kuadrant/Authorino); supports API key, ServiceAccount token, and OpenShift user token modes.
Multi-tenancy isolation: all data access scoped to the authenticated tenant; cross-tenant requests return 404.
TLS/mTLS: optional TLS on the API server (TLS 1.2+); processor supports custom CAs and mTLS for outbound inference connections.
Input validation: strict JSON decoding, file size and line count limits.
Pod hardening: non-root execution (UID 65532), read-only root filesystem, all Linux capabilities dropped, no privilege escalation, seccomp enabled, OpenShift SCC compatible.
Secret management: Kubernetes secrets mounted read-only; read via os.OpenInRoot() to prevent path traversal.
Security headers: X-Content-Type-Options, X-Frame-Options, X-XSS-Protection on every response.

For the full overview, see the Security Guide. To report vulnerabilities, see SECURITY.md.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Related Projects

llm-d-inference-scheduler - Inference request scheduler.
gateway-api-inference-extension - Kubernetes Gateway API extensions for inference workloads.

Support

For help and support:

Open an issue on GitHub.
Review the design documentation.
Review the development and usage guides.

Name		Name	Last commit message	Last commit date
Latest commit History 281 Commits
.github		.github
charts/batch-gateway		charts/batch-gateway
cmd		cmd
docker		docker
docs		docs
examples		examples
internal		internal
pkg/clients		pkg/clients
scripts		scripts
test/e2e		test/e2e
tools		tools
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prowlabels.yaml		.prowlabels.yaml
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
PR_SIGNOFF.md		PR_SIGNOFF.md
README.md		README.md
SECURITY.md		SECURITY.md
_typos.toml		_typos.toml
docker-bake.hcl		docker-bake.hcl
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

Batch Gateway

Overview

Use Cases

Key Features

Batch Processing

System Design

Operations

Architecture

High-Level System Design

Components

Processing Flow

Design Documents

Repository Structure

Key Directories

Getting Started

Prerequisites

Local Development

1. Build Binaries

2. Run Tests

3. Run Locally

Kubernetes Deployment

Quick Start with Kind

Production Deployment

Docker Images

API Usage

Submit a Batch Job

Check Job Status

Retrieve Results

Configuration

API Server Configuration

Batch Processor Configuration

Garbage Collector Configuration

Monitoring

Metrics

Health Checks

Profiling (pprof)

Development

Code Quality

Install Development Tools

Project Structure Conventions

Contributing

Security

License

Related Projects

Support

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages