Proposed
2026-03-25
APME has 100+ rules spread across four built-in validators (Native, OPA, Ansible, Gitleaks) and an extensible set of third-party plugin containers (ADR-042). Today, rules are opaque — nobody outside the engine knows what rules exist until violations appear in scan output. There is no catalog, no ability to enable/disable rules, and no mechanism for severity overrides.
Two proposed features — rule severity ratings (REQ-005 / PR #88) and rule enable/disable with acknowledgment (REQ-007 / PR #90) — both require the same underlying infrastructure: a rule catalog that the Gateway knows about, and a mechanism to deliver overrides to the engine at scan time. Without this architecture, both specs jump straight to UI concerns without answering where the catalog lives, how overrides reach the engine, or what happens in multi-pod deployments.
The rule set is not static or uniform:
- Built-in rules are baked into the engine image. Different engine versions have different rules (e.g., v2.1 adds L073).
- Plugin rules are dynamic. ADR-042 (Third-Party Plugin Services) plugins register
EXT-prefixed rules via theirDescribeRPC. A pod with thesecteamplugin has rules that a pod without it does not. - Multi-pod deployments (ADR-012, ADR-034) mean multiple Primaries, potentially with different engine versions or plugin sets. A management UI that shows rules from one pod may not reflect the reality of another.
Once a catalog exists, overrides (severity changes, enable/disable) must reach the engine at scan time. Two directions were considered:
- Engine pulls from Gateway — inverts the dependency. Today the Gateway depends on the engine, not the other way around. This would create a circular dependency.
- Gateway pushes with scan request — the Gateway already initiates scans. Including overrides in the request is stateless and keeps the dependency direction clean.
In a multi-pod deployment, which pod's rule set is canonical? If Pod A has EXT-secteam and Pod B doesn't, the catalog differs. Someone must be authoritative, or every pod is an island with no consistency guarantee.
We will establish the Gateway as the authoritative rule catalog and the Primary as the registration source, with overrides delivered at scan time via ScanRequest.
When a Primary starts, it discovers its validators and plugins (per ADR-005 and ADR-042), collects the full rule set from each, and registers the catalog with the Gateway.
Each rule registration includes:
| Field | Type | Description |
|---|---|---|
rule_id |
string | Rule identifier (e.g., L026, EXT-secteam-001) |
default_severity |
enum | Critical / High / Medium / Low / Info |
category |
string | lint / modernize / risk / policy / secrets / ext |
source |
string | Validator or plugin name (e.g., native, opa, secteam) |
description |
string | Human-readable rule description |
scope |
string | Rule scope (per ADR-026) |
enabled |
bool | Default enabled state |
The registration mechanism reuses the existing engine → Gateway push channel (ADR-020 / ADR-034). A new RegisterRules RPC on the Gateway's Reporting service (or a dedicated management RPC) accepts the full rule set from a Primary.
Built-in rules and plugin rules use the same registration path. No build-time injection, no separate mechanism. The Primary is the single source that reports everything it can execute.
The Gateway treats each registration as the complete current state of the registering Primary's rule set. On re-registration (Primary restart, engine upgrade, plugin added/removed):
- New rules (in registration but not in DB) → added to catalog
- Removed rules (in DB but not in registration) → removed from catalog; orphaned overrides flagged
- Unchanged rules → no-op
This makes rule lifecycle automatic. Engine upgrade adds L073? Primary restarts, re-registers, Gateway catalog updated. Plugin removed? Primary restarts without those EXT- rules, re-registers, Gateway drops them. No manual intervention.
Single-pod (default): the one Primary is the authority. No flag needed.
Multi-pod: one pod is designated as the rule authority via explicit configuration (e.g., APME_RULE_AUTHORITY=true env var). Exactly one Primary per deployment should have this flag set. That Primary's registration defines the canonical catalog. Other Primaries do not register rules — they are listeners that receive scan requests and execute.
- The Gateway only accepts rule registrations from a Primary that identifies itself as the authority. Registrations from non-authority Primaries are rejected (no-op, logged). There is no implicit "first to register wins" behavior.
- If the authority Primary goes down, other Primaries keep scanning. The catalog is in the Gateway's DB. The authority is a registration-time concept, not a runtime dependency.
- If the authority Primary comes back (or upgrades), it re-registers. The Gateway reconciles the catalog.
- For identical replicas (same image, same plugins), any Primary can be chosen as the authority since they all have the same rule set, but the choice is explicit via configuration.
The Gateway sends the full resolved rule configuration — not just deltas — with every scan request. This keeps the engine stateless: it executes exactly what it's told, with no need to remember what it registered or cache previous overrides.
A new RuleConfig message carries the complete per-rule state, and ScanOptions includes the full set:
message RuleConfig {
string rule_id = 1;
Severity severity = 2; // resolved severity (override > default)
bool enabled = 3; // false = skip this rule entirely
bool enforced = 4; // true = ignore inline # apme:ignore
}
message ScanOptions {
bool include_scandata = 1;
string ansible_core_version = 2;
repeated string collection_specs = 3;
string session_id = 4;
repeated RuleConfig rule_configs = 5; // NEW — full resolved rule set
}At ~100 bytes per rule in protobuf, a 200-rule catalog is ~20KB — negligible compared to the file payloads already in ScanRequest.
The Primary applies the rule configuration before fanning out to validators:
- Rules with
enabled=falseare excluded from the validation fan-out - Severity values are applied to violations before returning results
- Rules with
enforced=trueignore inline# apme:ignoreannotations — the violation always counts regardless of code-level suppression. This is the compliance lever: an admin can mandate that certain rules (e.g., SEC, policy) cannot be suppressed by developers at the code level - If the request includes a
rule_idthe Primary doesn't have → hard fail (see §5)
The CLI can also pass rule configs (from a local .apme/rules.yml or flags), enabling the same mechanism outside the Gateway.
If the Gateway sends a scan request that references a rule the Primary cannot execute (e.g., plugin not deployed, engine version skew), the Primary fails the scan with a descriptive error — not a silent skip, not a warning.
This is the consistency enforcement mechanism:
- Rolling upgrade with new Gateway catalog but old engine → scans fail → forces completion of the upgrade
- Multi-pod deployment where one pod is missing a plugin → scans fail → operational signal to deploy the plugin or deregister the rules
- No silent degradation — you always know what you're scanning for
When a Primary registers plugin rules (EXT-*) with the Gateway, those rules become part of the canonical catalog. The Gateway includes them in scan requests to all Primaries. Any Primary without the corresponding plugin cannot execute those rules and will fail the scan.
Deploying a plugin to one pod means deploying it to all pods. Removing a plugin means the authority Primary re-registers without those rules, and the Gateway drops them from the catalog.
Description: Extract the rule catalog from engine code at build time and embed it in the Gateway image. Engine and Gateway ship together with a shared manifest.
Pros:
- Gateway knows the catalog immediately at startup, no registration delay
- No runtime discovery needed for built-in rules
Cons:
- Requires a build-time extraction step and coupling between engine and Gateway builds
- Two mechanisms needed: build-time for built-in rules, runtime registration for plugins
- Rolling upgrades become more complex — Gateway manifest must match engine version exactly
Why not chosen: Runtime registration handles both built-in and plugin rules uniformly. The startup delay (Primary registers before scans can run) is acceptable since the Gateway already can't scan until an engine is up.
Description: Each Primary registers with its own Gateway independently. No authority model. Each pod manages its own catalog and overrides.
Pros:
- Simplest implementation — no coordination
- Each pod is fully self-contained
Cons:
- No policy consistency across pods. "Disable L026 everywhere" requires manual action on each pod.
- Different pods may silently enforce different rule sets
- No central management UI — each Gateway shows its own view
Why not chosen: Enterprise deployments need policy consistency. The authority model provides this with minimal coordination overhead.
Description: Rule overrides stored in an external system (etcd, ConfigMap, shared DB). All pods read from the same store.
Pros:
- Guaranteed consistency across pods
- Decoupled from Gateway lifecycle
Cons:
- Contradicts ADR-005 (no service discovery / external infrastructure)
- New operational dependency
- Over-engineered for current deployment model
Why not chosen: The Gateway DB is already the persistence layer. Adding external infrastructure contradicts the project's operational simplicity principles.
Description: The engine queries the Gateway for current rule configuration at startup or per-scan.
Pros:
- Engine always has latest overrides
- No proto change on
ScanRequest
Cons:
- Inverts the dependency direction — engine would depend on Gateway
- Creates a circular dependency (Gateway → engine for scans, engine → Gateway for config)
- Engine can't scan if Gateway is down (for the config fetch)
Why not chosen: The current architecture has a clean dependency direction: Gateway depends on engine, not the other way. Overrides in ScanRequest preserve this.
- Single mechanism for both built-in and plugin rule registration — no special cases
- Self-healing catalog — re-registration on Primary restart handles additions, removals, and upgrades automatically
- Clean dependency direction — overrides flow Gateway → engine via
ScanRequest, no circular dependency - Resilient — authority Primary going down does not affect scanning; catalog is persisted in Gateway DB
- Consistency enforcement — hard fail on rule mismatch prevents silent degradation
- Enables REQ-005 and REQ-007 — severity management and rule enable/disable become Gateway UI + CRUD once this infrastructure exists
- Startup ordering dependency — Gateway cannot show the rule management UI or send overrides until the authority Primary has registered. Mitigated: the Gateway already can't scan until an engine is up.
- Single authority limitation — in early multi-pod deployments, accidental dual-authority (two Primaries both marked as authority with different rule sets) would cause catalog thrashing. Mitigated: operational discipline and eventual conflict detection.
- Proto change — adding
RuleConfigtoScanOptionsand a registration RPC requires proto regeneration and coordinated deployment.
- Inline acknowledgment (
# apme:ignore) is unaffected — it's scan-time annotation parsing in the engine, independent of the catalog. - Existing scan behavior is unchanged when no overrides are present (all rules enabled at default severity).
- Plugin
DescribeRPC (ADR-042) provides the rule metadata that Primaries forward during registration.
- Define
RegisterRulesRequest/RegisterRulesResponsemessages inreporting.proto(or a newmanagement.proto) - Each validator exposes its rule metadata (ID, default severity, description, scope) via an internal interface
- Primary aggregates across validators and plugins, calls
RegisterRuleson Gateway at startup - Gateway persists catalog in a
rulestable, reconciles on re-registration
- Add
RuleConfigmessage andrepeated RuleConfig rule_configstoScanOptionsinprimary.proto - Primary applies overrides: filters disabled rules, attaches severity overrides to violations
- Gateway sends overrides with each
ScanRequestfrom its stored override config - CLI reads overrides from
.apme/rules.ymland passes them inScanOptions
- Primary validates incoming rule configs against its known rule set
- If a config references a rule the Primary doesn't have → Primary aborts the
ScanRPC withFAILED_PRECONDITIONstatus, including the unknown rule IDs in the status detail - Gateway interprets this non-OK gRPC status as a deployment/configuration issue and surfaces it in the UI
- CRUD endpoints for rule overrides (
GET/PUT /api/v1/rules/{id}/config) - Rule catalog browsing UI with severity and enable/disable toggles
- Audit trail for override changes
- Severity threshold configuration for CI gating
- ADR-005: No service discovery — still correct; registration uses existing push channel
- ADR-008: Rule ID conventions (L/M/R/P/SEC) — extended by ADR-042 with EXT- prefix
- ADR-012: Scale pods, not services — multi-pod authority model
- ADR-020: Reporting service and event delivery — reused for rule registration push
- ADR-026: Rule scope metadata — rule scope included in catalog registration
- ADR-042: Third-party plugin services — plugin
DescribeRPC provides registration data - ADR-034: Multi-pod health registration — similar registration pattern for health heartbeats
- PR #88: REQ-005 Rule Rating & Severity (blocked on this ADR)
- PR #90: REQ-007 Rule Management & Issue Acknowledgment (blocked on this ADR)
proto/apme/v1/primary.proto—ScanOptions,ScanRequestproto/apme/v1/reporting.proto— Reporting service (registration endpoint)
| Date | Author | Change |
|---|---|---|
| 2026-03-25 | Brad (cidrblock) | Initial proposal |