Skip to content

[TESTING]: Plugin runtime management — global enable/disable, RateLimiterPlugin dynamic config, multi-instance validation #4230

@gandhipratik203

Description

@gandhipratik203

Summary

The dynamic plugin configuration and per-tool bindings added in #4068 / #4143 are the supported way to manage plugins at runtime. This issue covers two pillars of plugin runtime management:

  1. Global plugin enable/disable — toggle any plugin on/off at runtime via the API, verify it takes effect across all gateway instances
  2. RateLimiterPlugin dynamic configuration — change rate limiter settings at runtime (limits, mode, scope), verify the new config takes effect across all instances

Currently there are no tests that verify either capability works correctly across multiple gateway instances. In a 3-pod deployment, a change made via the API on pod A must be visible to pods B and C — and we don't currently test or guarantee that.

This issue covers auditing the propagation mechanism, writing tests that expose gaps, and fixing any bugs found.

Context

  • The enable_plugins() function in mcpgateway/plugins/framework/__init__.py is a per-process in-memory toggle — it does not propagate across pods
  • The plugin config loader (mcpgateway/plugins/framework/loader/config.py) supports Jinja2 env var resolution but reads from a static YAML file at startup
  • Tool plugin bindings (/v1/tools/plugin_bindings) persist to Postgres — but it's unclear whether the plugin manager reads these per-request or caches them in-memory
  • Rate limiter uses Redis as shared state for rate counting (works cross-pod), but the plugin enable/disable and config state may not be shared

Related: #4222 (SQL Sanitizer e2e tests), PII Filter e2e tests — both depend on the plugin runtime management infrastructure being reliable across instances.

Approach

Test-driven hardening — write the tests first, let failures expose bugs, fix the bugs, tests become the proof.

Deliverables

1. Audit — map the state flow

Trace the plugin runtime management path end-to-end:

  • API call → DB write → plugin manager reload → request pipeline
  • Identify where state is per-process (in-memory) vs shared (DB/Redis)
  • Document the cache TTLs and invalidation mechanisms (if any)
  • Specifically: when a binding is created/updated via /v1/tools/plugin_bindings, how does each pod learn about it?

2. Integration tests (tests/integration/test_plugin_runtime_management.py)

Against a live gateway (not mocked):

Global enable/disable:

  • Enable a plugin via API, verify it takes effect on subsequent requests
  • Disable a plugin via API, verify it stops running
  • Per-tool binding: bind plugin to tool A only, call tool A (plugin runs), call tool B (plugin doesn't)
  • Binding deletion: remove binding, verify plugin stops running

RateLimiterPlugin dynamic configuration:

  • Change rate limit from 30/m to 100/m via API, verify the new limit is enforced
  • Switch mode from permissive to enforce via API, verify enforcement on subsequent requests
  • Change by_user vs by_tenant settings, verify the correct scope is applied

Multi-instance validation:

  • Make API call to one pod, verify behaviour on a different pod (the core test)
  • Pod restart survival: create binding + config change, delete a pod, verify both persist after recreate

3. Load test (tests/loadtest/locustfile_plugin_runtime_management.py)

Locust file for multi-instance environments:

  • Admin user class: periodically enables/disables plugins and changes rate limits via the API
  • Normal user class (100+): continuously calls tools and verifies plugins are active/inactive and rate limits match the current config
  • Success criteria: 0% inconsistency across all pods under concurrent load

4. Bug fixes

When tests fail, fix the underlying code. Possible fixes:

  • Add DB/Redis read for dynamic bindings on each request (if currently cached in-memory only)
  • Add Redis pub/sub or cache invalidation for plugin state/config changes
  • Add cache TTL configuration for plugin manager state
  • Ensure dynamic config changes propagate to all Gunicorn workers within a pod (not just the one handling the API call)
  • Ensure fail_on_plugin_error behaves correctly with dynamically enabled plugins

Each fix ships with the test that proves it works.

5. E2E validation

Run the full test suite on a multi-instance deployment:

  • tests/integration/ via pytest against the gateway
  • tests/loadtest/ via Locust with multiple workers
  • Document results and any remaining limitations

Priority order

  1. Single-instance global enable/disable (does the API actually change runtime behaviour?)
  2. Single-instance RateLimiterPlugin dynamic config (does changing settings take effect?)
  3. Multi-instance propagation (does pod B see pod A's change?)
  4. Persistence across pod restarts
  5. Concurrency under load
  6. Security (can a non-admin user toggle plugins or change rate limits?)

Environments

  • Colima — single and multi-instance (docker-compose with replicas: 3) for integration test development
  • OCP — optional, for validation on a real cluster if available

Metadata

Metadata

Labels

enhancementNew feature or requestpluginstestingTesting (unit, e2e, manual, automated, etc)triageIssues / Features awaiting triage

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions