tokio-prompt-orchestrator

A production-ready, cost-optimized orchestrator for multi-stage LLM pipelines with comprehensive reliability features, observability, and enterprise-grade capabilities.

What Is This?

An intelligent orchestrator that manages the complete lifecycle of LLM inference requests through a 5-stage pipeline:

Request → RAG → Assemble → Inference → Post-Process → Stream → Response
            ↓       ↓          ↓            ↓            ↓
         Context  Format    AI Model     Filter       Output

Built for production with features that save costs, improve reliability, and provide enterprise-grade observability.

Key Features

High-Impact Features

Request Deduplication - Save 60-80% on inference costs by caching duplicate requests
** Circuit Breaker** - Prevent cascading failures, fast-fail when services are down
** Retry Logic** - Automatic retry with exponential backoff, 99%+ reliability

Real Model Integrations

OpenAI - GPT-4, GPT-3.5-turbo, and all OpenAI models
Anthropic - Claude 3.5 Sonnet, Claude 3 Opus, Haiku
llama.cpp - Local inference with any GGUF model
vLLM - High-throughput GPU inference with PagedAttention

Web API Layer

REST API - Full HTTP API for inference requests
WebSocket - Real-time bidirectional streaming
CORS Support - Cross-origin request handling
Request Tracking - UUID-based request tracking

Advanced Metrics & Observability

Prometheus Integration - Production-grade metrics
Grafana Dashboard - Pre-built visualization with 9 panels
Alert Rules - 5 critical alerts included
Tracing - Comprehensive structured logging

Enhanced Features

Caching - In-memory + Redis support, configurable TTL
Rate Limiting - Per-session limits, token bucket algorithm
Priority Queues - 4-level priority (Critical, High, Normal, Low)
Backpressure - Graceful degradation under load

Production Ready

Bounded Channels - Prevent memory exhaustion
Session Affinity - Hash-based sharding for optimization
Health Checks - Detailed component monitoring
Docker Support - Complete docker-compose stack

ROI & Impact

Cost Savings

Typical Production Scenario (1M requests/month):

Feature	Without	With	Savings
Deduplication	$30,000/mo	$9,000/mo	$21,000/mo
Caching	-	Additional 10-20%	$1,800-3,600/mo
Total	$30,000/mo	$7,200/mo	$22,800/mo

Annual savings: $273,600

Reliability Improvements

Without retries: 95% success rate
With retries: 99.75% success rate
Improvement: 4.75% fewer user-visible errors

Performance

Request deduplication: <1ms overhead, 60-80% cost savings
Circuit breaker: <10μs overhead, prevents cascading failures
Retry logic: Automatic, 4-5% reliability improvement

Quick Start

Installation

git clone <repository-url>
cd tokio-prompt-orchestrator

Basic Demo (No API Keys)

cargo run --bin orchestrator-demo

Uses EchoWorker for testing the pipeline without API keys.

With OpenAI

export OPENAI_API_KEY="sk-..."
cargo run --example openai_worker

With Anthropic Claude

export ANTHROPIC_API_KEY="sk-ant-..."
cargo run --example anthropic_worker

Web API Server (All Features)

cargo run --example web_api_demo --features full

Starts REST API on http://localhost:8080

High-Impact Features Demo

cargo run --example high_impact_demo --features full

See deduplication, circuit breaker, and retry logic in action.

Usage Examples

Basic Usage

use tokio_prompt_orchestrator::{
    spawn_pipeline, EchoWorker, ModelWorker, PromptRequest, SessionId,
};
use std::sync::Arc;

#[tokio::main]
async fn main() {
    // Create worker
    let worker: Arc<dyn ModelWorker> = Arc::new(EchoWorker::new());
    
    // Spawn pipeline
    let handles = spawn_pipeline(worker);
    
    // Send request
    let request = PromptRequest {
        session: SessionId::new("user-123"),
        input: "Hello, world!".to_string(),
        meta: Default::default(),
    };
    
    handles.input_tx.send(request).await.unwrap();
}

Using OpenAI

use tokio_prompt_orchestrator::{spawn_pipeline, OpenAiWorker};

let worker = Arc::new(
    OpenAiWorker::new("gpt-4")
        .with_max_tokens(512)
        .with_temperature(0.7)
);

let handles = spawn_pipeline(worker);

Using Anthropic Claude

use tokio_prompt_orchestrator::{spawn_pipeline, AnthropicWorker};

let worker = Arc::new(
    AnthropicWorker::new("claude-3-5-sonnet-20241022")
        .with_max_tokens(1024)
        .with_temperature(1.0)
);

let handles = spawn_pipeline(worker);

With Deduplication (Save 60-80% Costs!)

use tokio_prompt_orchestrator::enhanced::{Deduplicator, dedup};

let dedup = Deduplicator::new(Duration::from_secs(300)); // 5 min cache

let key = dedup::dedup_key(prompt, &metadata);

match dedup.check_and_register(&key).await {
    DeduplicationResult::Cached(result) => {
        // Return cached result - FREE!
        return Ok(result);
    }
    DeduplicationResult::New(token) => {
        // Process new request
        let result = process_request(prompt).await?;
        dedup.complete(token, result.clone()).await;
        Ok(result)
    }
    DeduplicationResult::InProgress => {
        // Wait for in-progress request
        dedup.wait_for_result(&key).await
    }
}

With Circuit Breaker

use tokio_prompt_orchestrator::enhanced::CircuitBreaker;

let breaker = CircuitBreaker::new(
    5,                          // 5 failures opens circuit
    0.8,                        // 80% success rate closes it
    Duration::from_secs(60),    // 60 second timeout
);

match breaker.call(|| async {
    worker.infer(prompt).await
}).await {
    Ok(result) => Ok(result),
    Err(CircuitBreakerError::Open) => {
        // Service is down, fail fast
        Err("Service temporarily unavailable")
    }
    Err(CircuitBreakerError::Failed(e)) => Err(e),
}

With Retry Logic

use tokio_prompt_orchestrator::enhanced::RetryPolicy;

let policy = RetryPolicy::exponential(3, Duration::from_millis(100));

let result = policy.retry(|| async {
    worker.infer(prompt).await
}).await?;

REST API Usage

Submit request:

curl -X POST http://localhost:8080/api/v1/infer \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is Rust?",
    "session_id": "user-123",
    "metadata": {"priority": "high"}
  }'

Response:

{
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "processing"
}

Get result:

curl http://localhost:8080/api/v1/result/550e8400-e29b-41d4-a716-446655440000

WebSocket Streaming

const ws = new WebSocket('ws://localhost:8080/api/v1/stream');

ws.send(JSON.stringify({
  prompt: "Tell me a story",
  session_id: "user-123"
}));

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(data.result);
};

Architecture

Pipeline Stages

┌─────────────┐  512  ┌──────────────┐  512  ┌──────────────┐
│  RAG Stage  │──────→│ Assemble     │──────→│  Inference   │
│  (5ms)      │       │ (format)     │       │  (Model)     │
└─────────────┘       └──────────────┘       └──────────────┘
                                                     │ 1024
                                                     ↓
                    ┌──────────────┐       ┌──────────────┐
                    │   Stream     │←──────│    Post      │
                    │   (output)   │  512  │  (filter)    │
                    └──────────────┘       └──────────────┘

Channel sizes: 512 → 512 → 1024 → 512 → 256

Key Design Decisions

Bounded Channels - Prevent memory exhaustion under load
Backpressure - Graceful shedding when queues are full
Object-Safe Workers - Hot-swappable inference backends
Session Affinity - Deterministic sharding for optimization

Monitoring & Observability

Prometheus Metrics

Start metrics server:

cargo run --example metrics_demo --features metrics-server

Access metrics: http://localhost:9090/metrics

Available metrics:

orchestrator_requests_total - Request throughput
orchestrator_stage_duration_seconds - Latency per stage
orchestrator_queue_depth - Current queue depth
orchestrator_requests_shed_total - Backpressure events
orchestrator_errors_total - Error rates

Grafana Dashboard

# Start full monitoring stack
docker-compose up -d

Access Grafana: http://localhost:3000 (admin/admin)

Dashboard includes:

Request rate graphs
Latency percentiles (p50, p95, p99)
Queue depth monitoring
Error rate tracking
9 pre-configured panels

Alerts

5 critical alerts included:

High error rate (>5% for 2 min)
High latency (p95 >5s for 5 min)
High backpressure (>10% shed for 5 min)
Queue depth high (>900 for 2 min)
Service down (>1 min)

Configuration

Feature Flags

[features]
default = []
web-api = ["axum", "tower", "tower-http", "tokio-stream", "futures", "uuid"]
metrics-server = ["axum", "tower", "tower-http"]
caching = ["redis"]
rate-limiting = ["governor", "nonzero_ext"]
full = ["web-api", "metrics-server", "caching", "rate-limiting"]

Build with all features:

cargo build --features full

Environment Variables

# Model API keys
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

# Local model servers
export LLAMA_CPP_URL="http://localhost:8080"
export VLLM_URL="http://localhost:8000"

# Redis (optional)
export REDIS_URL="redis://localhost:6379"

# Server config
export SERVER_PORT="8080"
export METRICS_PORT="9090"

Testing

# Run all tests
cargo test

# Run specific feature tests
cargo test enhanced::dedup
cargo test enhanced::circuit_breaker
cargo test enhanced::retry

# Run with output
cargo test -- --nocapture

# Run examples
cargo run --example openai_worker
cargo run --example web_api_demo --features full
cargo run --example high_impact_demo --features full

Docker Deployment

Quick Start

docker-compose up -d

Includes:

Orchestrator (ports 8080, 9090)
Prometheus (port 9091)
Grafana (port 3000)
Redis (port 6379)
Alertmanager (port 9093)

Custom Deployment

FROM rust:1.79 as builder
WORKDIR /app
COPY . .
RUN cargo build --release --features full

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/orchestrator /app/
EXPOSE 8080 9090
CMD ["/app/orchestrator"]

Documentation

Comprehensive Guides

WORKERS.md - Model integration guide (400+ lines)
METRICS.md - Observability setup (600+ lines)
ARCHITECTURE.md - Design deep-dive
HIGH_IMPACT.md - Cost optimization guide
QUICKSTART.md - Getting started

API Documentation

cargo doc --open

Navigate to tokio_prompt_orchestrator for full API docs.

Use Cases

Cost-Sensitive Applications

Scenario: Chat application with 1M requests/month

Without optimization: $30,000/month
With deduplication: $9,000/month (70% hit rate)
Savings: $21,000/month

High-Reliability Services

Scenario: Customer-facing API requiring 99.9% uptime

Retry logic: 95% → 99.75% success rate
Circuit breaker: Fast-fail during outages
Result: 4.75% fewer user-visible errors

Multi-Model Routing

Scenario: Route requests to different models based on complexity

let worker = if prompt.len() < 100 {
    Arc::new(OpenAiWorker::new("gpt-3.5-turbo-instruct")) // Cheap
} else {
    Arc::new(OpenAiWorker::new("gpt-4")) // Powerful
};

A/B Testing

Scenario: Compare model performance

let worker = if rand::random::<f32>() < 0.5 {
    Arc::new(OpenAiWorker::new("gpt-4"))
} else {
    Arc::new(AnthropicWorker::new("claude-3-5-sonnet-20241022"))
};

Roadmap

Completed

5-stage pipeline with backpressure
4 production model workers (OpenAI, Anthropic, llama.cpp, vLLM)
Prometheus metrics + Grafana dashboards
REST API + WebSocket streaming
Caching, rate limiting, priority queues
Request deduplication (60-80% cost savings)
Circuit breaker (failure protection)
Retry logic with exponential backoff

Coming Soon

OpenAPI/Swagger documentation
Structured JSON logging
Distributed tracing (Jaeger/Zipkin)
Multi-node deployment (NATS/Kafka)
Declarative pipeline configuration (YAML/TOML)
Advanced health checks

Future

Streaming inference support
Batch processing mode
Request deduplication across nodes
Auto-scaling based on queue depth
Model fine-tuning integration

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new features
Ensure cargo test passes
Run cargo fmt and cargo clippy
Submit a pull request

License

MIT License - see LICENSE file for details.

Acknowledgments

Built with:

Tokio - Async runtime
Axum - Web framework
Prometheus - Metrics
Tracing - Structured logging

Project Stats

Lines of Code: ~15,400
Test Coverage: Comprehensive unit + integration tests
Dependencies: Minimal, production-grade
Performance: <1ms overhead for all features
Reliability: 99%+ with default configuration

Star History

If this project helps you, please consider giving it a star!

Built for production. Optimized for cost. Ready for scale.

Made with by the Rust community

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
examples		examples
src		src
ARCHITECTURE.md		ARCHITECTURE.md
BUILD.md		BUILD.md
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
ENHANCEMENT_SUMMARY.md		ENHANCEMENT_SUMMARY.md
HIGH_IMPACT.md		HIGH_IMPACT.md
LICENSE		LICENSE
METRICS.md		METRICS.md
MODEL_WORKERS.md		MODEL_WORKERS.md
PHASE2_COMPLETE.md		PHASE2_COMPLETE.md
PHASE34_COMPLETE.md		PHASE34_COMPLETE.md
PHASE3_COMPLETE.md		PHASE3_COMPLETE.md
PROJECT_OVERVIEW.txt		PROJECT_OVERVIEW.txt
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
Screenshot 2025-11-02 161903.png		Screenshot 2025-11-02 161903.png
UPGRADE.md		UPGRADE.md
WEB_API.md		WEB_API.md
WHATS_NEW.md		WHATS_NEW.md
WORKERS.md		WORKERS.md
docker-compose.yml		docker-compose.yml
grafana-dashboard.json		grafana-dashboard.json
grafana-datasource.yml		grafana-datasource.yml
pipeline.example.toml		pipeline.example.toml
prometheus.yml		prometheus.yml

License

Mattbusel/tokio-prompt-orchestrator

Folders and files

Latest commit

History

Repository files navigation

tokio-prompt-orchestrator

What Is This?

Key Features

High-Impact Features

Real Model Integrations

Web API Layer

Advanced Metrics & Observability

Enhanced Features

Production Ready

ROI & Impact

Cost Savings

Reliability Improvements

Performance

Quick Start

Installation

Basic Demo (No API Keys)

With OpenAI

With Anthropic Claude

Web API Server (All Features)

High-Impact Features Demo

Usage Examples

Basic Usage

Using OpenAI

Using Anthropic Claude

With Deduplication (Save 60-80% Costs!)

With Circuit Breaker

With Retry Logic

REST API Usage

WebSocket Streaming

Architecture

Pipeline Stages

Key Design Decisions

Monitoring & Observability

Prometheus Metrics

Grafana Dashboard

Alerts

Configuration

Feature Flags

Environment Variables

Testing

Docker Deployment

Quick Start

Custom Deployment

Documentation

Comprehensive Guides

API Documentation

Use Cases

Cost-Sensitive Applications

High-Reliability Services

Multi-Model Routing

A/B Testing

Roadmap

Completed

Coming Soon

Future

Contributing

License

Acknowledgments

Project Stats

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages