RFC-0017: Dynamic Webhook Middleware for ToolHive

Note: This was originally THV-2300 in the toolhive repository.

Status: Draft
Author(s): Juan Antonio Osorio (@JAORMX)
Created: 2025-10-23
Last Updated: 2025-10-23
Target Repository: toolhive
Related Issues: toolhive#2300

Summary

This proposal introduces a dynamic webhook middleware system for ToolHive, inspired by Kubernetes' admission webhook pattern. The system enables external services to participate in the MCP request pipeline through two webhook types: validating webhooks (accept/deny requests) and mutating webhooks (transform requests). This provides a powerful extension mechanism that allows organizations to plug arbitrary systems into ToolHive's middleware chain without modifying ToolHive code.

Problem Statement

ToolHive's current middleware architecture provides built-in middleware components (authentication, parsing, authorization, audit, etc.), but adding new functionality requires code changes to ToolHive itself. Organizations often need to:

Integrate with external policy engines or decision systems
Enrich requests with data from external sources (HR, CMDB, project databases)
Log to enterprise SIEM systems with custom formats
Apply business-specific validation rules beyond Cedar policies
Transform requests based on organizational policies
Integrate with ticketing or approval workflows
Enforce rate limits using external state stores

Currently, these use cases require:

Forking ToolHive and adding custom middleware
Maintaining custom builds
Complex deployment pipelines
Difficulty upgrading to new ToolHive versions
Duplicating organizational logic across systems

Goals

Enable external HTTP services to participate in MCP request processing
Support both validation (accept/deny) and mutation (transform) use cases
Support multiple webhook instances (multiple validating, multiple mutating)
Maintain ToolHive's security and observability guarantees
Follow HTTP best practices for webhook communication
Provide clear failure mode semantics
Integrate seamlessly with existing middleware chain
Enable stateless, horizontally scalable architecture

Non-Goals

Real-time streaming or bidirectional communication (webhooks are request/response)
Plugin systems (WASM, shared libraries, etc.)
Message queue integration (async processing patterns)
Built-in webhook server implementations (organizations provide their own)
Webhook response caching (initial implementation; future enhancement)
Circuit breaker patterns (initial implementation; future enhancement)

Proposed Solution

Introduce two new middleware types that call external HTTP services during request processing:

Validating Webhook Middleware: Sends request metadata to external service for accept/deny decision
Mutating Webhook Middleware: Sends full request to external service for transformation

Both webhook types:

Support multiple instances (multiple webhooks of each type can be configured)
Operate as standard middleware in ToolHive's existing chain
Can be configured alongside existing middleware like auth, authz, and audit
Execute in configuration order when multiple webhooks are present

High-Level Design

sequenceDiagram
    participant Client
    participant Auth as Authentication
    participant Parser as MCP Parser
    participant ToolFilter as Tool Filter
    participant Mutating1 as Mutating<br/>Webhook 1
    participant Mutating2 as Mutating<br/>Webhook 2
    participant Validating1 as Validating<br/>Webhook 1
    participant Validating2 as Validating<br/>Webhook 2
    participant Authz as Authorization
    participant Audit as Audit
    participant MCP as MCP Server

    Client->>Auth: MCP Request + JWT
    Auth->>Parser: Request + Claims
    Parser->>ToolFilter: Parsed Request
    ToolFilter->>Mutating1: Filtered Request

    Mutating1->>Mutating1: Transform (e.g., enrich from HR)
    Mutating1->>Mutating2: Mutated Request
    Mutating2->>Mutating2: Transform (e.g., enrich from CMDB)

    Mutating2->>Validating1: Fully Mutated Request
    Validating1->>Validating1: Validate (e.g., check approval)
    Validating1->>Validating2: Continue
    Validating2->>Validating2: Validate (e.g., rate limit)

    Validating2->>Authz: All Checks Passed
    Authz->>Audit: Authorized
    Audit->>MCP: Process Request
    MCP-->>Client: Response

Detailed Design

HTTP Protocol Design

Design Decisions:

Use POST for both webhook types (standard HTTP method for data processing)
Use 200 OK with JSON body for decisions (avoid confusion between policy vs operational errors)
Use 5xx status codes for webhook operational errors (trigger failure policies)

Validating Webhook Request (POST to webhook URL):

{
  "version": "v0.1.0",
  "uid": "unique-request-id-for-idempotency",
  "timestamp": "2025-10-23T10:30:00Z",
  "principal": {
    "sub": "user123",
    "email": "user@example.com",
    "name": "John Doe",
    "groups": ["engineering", "admins"],
    "claims": {
      "department": "platform",
      "role": "sre"
    }
  },
  "mcp_request": {
    "mcp_version": "2024-11-05",
    "method": "tools/call",
    "resource_id": "database_query",
    "arguments": {
      "query": "SELECT * FROM users",
      "database": "production"
    }
  },
  "context": {
    "server_name": "my-vmcp-instance",
    "backend_server": "postgres-mcp",
    "namespace": "production",
    "source_ip": "192.0.2.1",
    "transport": "sse"
  }
}

Context Fields:

server_name: The ToolHive/vMCP instance name handling the request

backend_server: (Optional) The actual MCP server being proxied, when using vMCP

namespace: Kubernetes namespace (if applicable)

source_ip: Client IP address

transport: Connection transport type (sse, stdio, etc.)

Validating Webhook Response (200 OK with JSON body):

{
  "version": "v0.1.0",
  "uid": "unique-request-id-for-idempotency",
  "allowed": true
}

Validating Webhook Response (Denied):

{
  "version": "v0.1.0",
  "uid": "unique-request-id-for-idempotency",
  "allowed": false,
  "code": 403,
  "message": "Production writes require approval",
  "reason": "RequiresApproval",
  "details": {
    "ticket_url": "https://tickets.example.com/PROD-1234",
    "instructions": "Please request approval from security-team"
  }
}

Mutating Webhook Request (POST to webhook URL):

{
  "version": "v0.1.0",
  "uid": "unique-request-id-for-idempotency",
  "timestamp": "2025-10-23T10:30:00Z",
  "principal": {
    "sub": "user123",
    "email": "user@example.com",
    "name": "John Doe",
    "groups": ["engineering"]
  },
  "mcp_request": {
    "mcp_version": "2024-11-05",
    "jsonrpc": "2.0",
    "id": 1,
    "method": "tools/call",
    "params": {
      "name": "database_query",
      "arguments": {
        "query": "SELECT * FROM users",
        "database": "production"
      }
    }
  },
  "context": {
    "server_name": "my-vmcp-instance",
    "backend_server": "postgres-mcp",
    "namespace": "production",
    "source_ip": "192.0.2.1",
    "transport": "sse"
  }
}

Mutating Webhook Response (JSONPatch):

{
  "version": "v0.1.0",
  "uid": "unique-request-id-for-idempotency",
  "allowed": true,
  "patch_type": "json_patch",
  "patch": [
    {
      "op": "add",
      "path": "/mcp_request/params/arguments/audit_user",
      "value": "user@example.com"
    },
    {
      "op": "add",
      "path": "/mcp_request/params/arguments/department",
      "value": "engineering"
    }
  ]
}

Note: Mutation patches are scoped to the mcp_request container. This prevents accidental or malicious modification of principal, context, or other immutable fields.

Failure Modes

Each webhook is configured with a failure policy:

failure_policy: "fail" (Fail-Closed): Deny request on webhook error
- Recommended for security-critical validating webhooks
- Ensures policies are always enforced
- Risk: availability impact if webhook is down
failure_policy: "ignore" (Fail-Open): Allow request on webhook error
- Recommended for non-critical webhooks and mutating webhooks
- Maintains availability when webhook is unavailable
- Risk: policy bypass during outages

Scenario	Validating (fail)	Validating (ignore)	Mutating (fail)	Mutating (ignore)
Network error	Deny (403)	Allow	Deny (500)	Use original
Timeout (408)	Deny (403)	Allow	Deny (500)	Use original
Webhook 5xx	Deny (403)	Allow	Deny (500)	Use original
Invalid JSON response	Deny (403)	Allow	Deny (500)	Use original
200 {allowed: false}	Deny (403)	N/A	N/A	N/A
422	N/A	N/A	Deny (422)	Deny (422)

Middleware Ordering

Webhooks are inserted at specific points in the existing middleware chain:

Authentication Middleware - Validates JWT tokens, extracts claims
Token Exchange Middleware (optional) - Exchanges tokens
MCP Parser Middleware - Parses JSON-RPC requests
Tool Filter Middleware (optional) - Filters available tools
Mutating Webhook Middleware (if configured, multiple in order) - Transforms requests
Validating Webhook Middleware (if configured, multiple in order) - Validates requests
Telemetry Middleware (optional) - OpenTelemetry instrumentation
Authorization Middleware (optional) - Cedar policy evaluation
Audit Middleware (optional) - Request logging

Configuration Changes

RunConfig Configuration:

# Validating webhooks (multiple supported)
validating_webhooks:
  - name: external-policy
    url: https://policy.company.com/validate
    failure_policy: fail  # or "ignore"
    timeout: 5s
    signing_secret_ref:
      name: webhook-signing-secret
      key: hmac-key
    ca_bundle: |
      -----BEGIN CERTIFICATE-----
      ...
      -----END CERTIFICATE-----

  - name: rate-limiter
    url: https://ratelimit.company.com/check
    failure_policy: ignore
    timeout: 2s

# Mutating webhooks (multiple supported)
mutating_webhooks:
  - name: hr-enrichment
    url: https://hr-api.company.com/enrich
    failure_policy: ignore
    timeout: 3s

  - name: cmdb-enrichment
    url: https://cmdb-api.company.com/enrich
    failure_policy: ignore
    timeout: 3s
    client_cert: /path/to/cert.pem
    client_key: /path/to/key.pem

CLI Configuration:

# Single webhook
thv run postgres-mcp \
  --webhook-config external-policy-webhook.yaml

# Multiple webhooks
thv run postgres-mcp \
  --webhook-config external-policy-webhook.yaml \
  --webhook-config hr-enrichment-webhook.yaml \
  --webhook-config rate-limiter-webhook.yaml

Kubernetes Configuration (MCPWebhookConfig CRD):

apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPWebhookConfig
metadata:
  name: company-policy-webhook
  namespace: mcp-servers
spec:
  type: validating  # or "mutating"
  url: https://policy.company.com/validate
  failure_policy: fail  # or "ignore"
  timeout: 5s
  signing_secret_ref:
    secret_name: webhook-signing-secret
    key: hmac-key
  client_cert_ref:
    secret_name: webhook-mtls-cert
    cert_key: tls.crt
    key_key: tls.key
  ca_bundle: |
    -----BEGIN CERTIFICATE-----
    ...
    -----END CERTIFICATE-----

Security Considerations

Threat Model

Webhook requests contain sensitive information:

User identity (JWT claims: email, name, groups)
MCP request parameters (may contain sensitive data, secrets, queries)
System metadata (server names, namespaces, source IPs)

Threats:

Information disclosure to unauthorized webhook endpoints
Denial of service via slow or unavailable webhooks
Trust boundary violations if webhook servers are compromised

Authentication and Authorization

Webhook Server Authentication:

TLS/HTTPS: Required for all webhook endpoints
mTLS: Optional client certificates for mutual authentication
Bearer Token: ToolHive can send bearer token in Authorization header

Payload Signing (Recommended):

ToolHive can sign webhook payloads using HMAC-SHA256
Signature sent in X-ToolHive-Signature header
Webhooks should verify signatures to ensure payload integrity
Shared secret configured per-webhook via secret reference

Example headers:

X-ToolHive-Signature: sha256=abc123...
X-ToolHive-Timestamp: 1698057000

Note: While mTLS authenticates the connection, payload signing provides defense-in-depth by proving the payload originated from ToolHive and wasn't tampered with. This is especially important if webhook endpoints are reachable by other internal services.

ToolHive Authorization:

Webhooks run after authentication middleware (principal is validated)
Webhooks access validated JWT claims for policy decisions
Webhook configuration requires appropriate privileges

Data Security

Mitigations:

TLS encryption required for all webhook traffic
Certificate validation for webhook server identity
Audit webhook access and responses
Limit webhook endpoints to trusted systems via network policies
Document security requirements for webhook server operators

Input Validation

Webhook responses validated against expected schema
HTTP response size limits (1MB max)
Configurable timeouts per webhook (default: 10s, max: 30s)

Secrets Management

Client certificates stored securely (file paths or Kubernetes secrets)
CA bundles for custom certificate authorities
Bearer tokens via environment variables or secret references

Audit and Logging

Log webhook invocations to ToolHive audit log:

{
  "type": "webhook_invocation",
  "logged_at": "2025-10-23T10:30:00Z",
  "outcome": "denied",
  "component": "toolhive-webhook",
  "webhook": {
    "name": "external-policy",
    "type": "validating",
    "url": "https://policy.company.com/validate",
    "duration_ms": 150,
    "status_code": 200
  },
  "request": {
    "uid": "unique-request-id",
    "principal": "user@example.com",
    "method": "tools/call",
    "resource_id": "database_query"
  },
  "response": {
    "allowed": false,
    "reason": "RequiresApproval"
  }
}

Mitigations

Connection pooling and reuse
Failure policies to handle unavailability
Defense in depth: use multiple layers (webhooks + Cedar authz + audit)
Monitor webhook behavior for anomalies

Alternatives Considered

Alternative 1: WASM Plugin System

Pros: No network latency, sandboxed execution
Cons: Complex runtime, limited language support, harder to debug
Why not chosen: Higher complexity, webhooks are more universally accessible

Alternative 2: gRPC-based Webhooks

Pros: Efficient binary protocol, strong typing
Cons: Requires protobuf tooling, less accessible
Why not chosen: HTTP/JSON is more universally accessible and debuggable

Alternative 3: Message Queue Integration

Pros: Async processing, better handling of slow consumers
Cons: Eventual consistency, complex infrastructure
Why not chosen: Synchronous validation required for request pipeline

Compatibility

Backward Compatibility

Webhooks are opt-in; existing deployments unaffected
No changes to existing middleware configuration
Gradual adoption possible

Forward Compatibility

Versioned protocol (version: v0.1.0) enables protocol evolution
UID field enables idempotency and future features
Extensible request/response structures

Implementation Plan

Phase 1: Core Implementation (MVP)

Implement webhook middleware interfaces (pkg/webhooks/)
Validating webhook support with fail-closed/fail-open
Mutating webhook support with JSONPatch
Multiple webhook instances support
Optional HMAC payload signing (X-ToolHive-Signature header)
CLI flags for webhook configuration
RunConfig webhook fields
Basic metrics and audit logging
Documentation and examples

Phase 2: Kubernetes Integration

Implement MCPWebhookConfig CRD
Add webhook_config_refs to MCPServer CRD
Operator controller updates
E2E tests for Kubernetes deployments

Phase 3: Advanced Features

Request field filtering (security)
Full request mutation support
Circuit breaker pattern
Enhanced metrics and dashboards
Webhook SDK/library for webhook authors

Phase 4: Ecosystem and Examples

Example webhook implementations (Go, Python, Node.js)
Integration guides (OPA, custom policy engines, SIEMs)
Performance benchmarking
Best practices documentation

Testing Strategy

Unit tests: Webhook middleware, HTTP client, response parsing
Integration tests: End-to-end webhook flows, failure scenarios
E2E tests: Full middleware chain with webhooks
Performance tests: Latency impact, connection pooling
Security tests: TLS validation, timeout handling

Documentation

User Documentation:
- Webhook concepts guide
- Configuration guide (CLI, RunConfig, Kubernetes)
- Writing webhook servers
- Integration guides (OPA, SIEMs)
- Troubleshooting
Developer Documentation:
- Webhook API specification
- Architecture documentation
- Example implementations

Open Questions

Should we support webhook response caching in the initial implementation?
What should be the default timeout values?
Should conditional webhooks (filter by request type) be in MVP or Phase 3?

References

RFC Lifecycle

Review History

Date	Reviewer	Decision	Notes
2025-10-23	-	Draft	Ported from toolhive PR #2300

Implementation Tracking

Repository	PR	Status
toolhive	-	Pending

FilesExpand file tree

THV-0017-dynamic-webhook-middleware.md

Latest commit

History