Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Synthetic RDS PostgreSQL Suite

This suite benchmarks RDS PostgreSQL root-cause analysis against bundled telemetry fixtures instead of live AWS infrastructure. Each scenario is a static evidence snapshot served through a FixtureGrafanaBackend, which drives the same agentic pipeline (plan → investigate → diagnose) used in production.

Scenario table

Axis 1 — Efficiency (marker: synthetic)

ID Name Difficulty True root cause Adversarial element Forbidden
000 healthy 1 healthy none resource_exhaustion
001 replication-lag 1 replication_lag none
002 connection-exhaustion 1 connection_exhaustion none
003 storage-full 1 storage_exhaustion none
004 cpu-saturation-bad-query 1 cpu_saturation none
005 failover 1 multi_az_failover_health_check none
006 replication-lag-cpu-redherring 2 replication_lag CPUUtilization elevated (analytics job) category: cpu_saturation
007 connection-pressure-noisy-healthy 2 healthy CPU/connections oscillating near-threshold category: connection_exhaustion
008 storage-full-missing-metric 3 storage_exhaustion FreeStorageSpace absent from fixture
009 dual-fault-connection-cpu 4 connection_exhaustion connections + CPU both failing, causally linked keywords: storage, replication
010 replication-lag-missing-metric 3 replication_lag ReplicaLag metric absent from fixture

Axis 2 — Reasoning (marker: axis2)

Axis 2 scenarios use SelectiveGrafanaBackend. The agent must ask for the right metrics and explicitly rule out alternative hypotheses. Each scenario has ruling_out_keywords that must appear in the agent's output.

ID Name Difficulty True root cause Red herring / adversarial element Must rule out
011 cpu-storage-compositional 4 dual_resource_exhaustion ReplicaLag elevation and connection spike as side effects connection_exhaustion, replication
012 replication-lag-misleading-events 3 replication_lag Three historical infra events in event stream (none are root cause) infrastructure (failover as root cause)
013 storage-recovery-false-alert 3 healthy FreeStorageSpace spike + WriteLatency brief elevation resource_exhaustion (already recovered)
014 checkpoint-storm-cpu-saturation 4 checkpoint_io_storm CPU at 92% — alert fires on CPU but cause is VACUUM checkpoint storm cpu_saturation (CPU is symptom, not root)

Difficulty levels

Level Description
1 Single dominant signal — all evidence consistent, root cause identifiable in one step
2 One confounder present — second evidence source needed to rule it out
3 Absent or indirect evidence — key metric missing, or misleading signals require timeline reasoning
4 Compositional fault — two failure modes active; agent must identify both and correctly characterise causally-linked side effects

MECE basis

Uniqueness is on (primary_signal × rate × corroborating_presence × event_presence), not on primary signal alone.

003 and 008 both map to storage_full but have distinct fingerprints:

  • 003: FreeStorageSpace present and trending to 0 with elevated WriteIOPS
  • 008: FreeStorageSpace absent from the fixture entirely — agent must infer from events + PI write latency

Scoring

Axis 1 (all scenarios — test_suite.py)

Each scenario passes when all of the following are true:

  1. The model returns a non-empty root cause
  2. The predicted ROOT_CAUSE_CATEGORY matches answer.yml
  3. Every required keyword from answer.yml:required_keywords appears in the output
  4. The actual category is not in answer.yml:forbidden_categories (level 2+ scenarios)
  5. No forbidden keyword from answer.yml:forbidden_keywords appears in the output (level 4 scenario)
  6. Every source listed in answer.yml:required_evidence_sources is non-empty in final_state["evidence"] — proves the agent consulted the right evidence, not just keyword-matched the alert title

Additionally, a TrajectoryScore is computed for each scenario with an optimal_trajectory:

  • sequencing_ok: all expected action types appear in the agent's executed action log (set membership; parallel execution order is non-deterministic)
  • calibration_ok: number of investigation loops ≤ max_investigation_loops
  • efficiency_score: mean(sequencing_ok, calibration_ok); 1.0 = full pass

Axis 2 (adversarial scenarios — test_suite_axis2.py, marker: axis2)

Axis 2 runs through SelectiveGrafanaBackend, which:

  • Returns only the metric series matching the agent's metric_name query (case-insensitive substring)
  • Records every metric name the agent requested in queried_metrics (audit log)

On top of all Axis 1 checks, Axis 2 asserts a ReasoningScore:

  • ruling_out_ok: every token in answer.yml:ruling_out_keywords appears in agent output (proves the agent addressed and dismissed alternative hypotheses)
  • queries_ok: every entry in answer.yml:required_queries was requested by the agent via query_timeseriesnot yet testable: the current action registry hardcodes metric_name="pipeline_runs_total" for all query_grafana_metrics calls; required_queries is reserved for when the agent supports per-metric querying
  • reasoning_score: mean(ruling_out_ok, queries_ok); 1.0 = full pass

Gap metric

The gap between Axis 1 and Axis 2 pass rates is the primary health indicator for adversarial robustness. A large gap means the agent can find answers when handed all data but cannot reason when it must choose what to look at. Track it per difficulty level:

python -m tests.synthetic.rds_postgres.run_suite --mock-grafana --axis2

Each scenario folder contains

  • scenario.yml: scenario metadata (engine, difficulty, adversarial_signals, depends_on)
  • alert.json: synthetic alert payload
  • aws_cloudwatch_metrics.json: CloudWatch metric evidence (may omit metrics to simulate collection gaps)
  • aws_rds_events.json: RDS event stream for the incident window
  • aws_performance_insights.json: top SQL and wait-event evidence
  • answer.yml: expected category, required keywords, optional forbidden constraints, required evidence sources

Running

Via the interactive CLI (recommended):

opensre tests synthetic

Run the whole Axis 1 suite directly:

python -m tests.synthetic.rds_postgres.run_suite --mock-grafana

Run Axis 2 adversarial scenarios via pytest:

pytest -m axis2 tests/synthetic/rds_postgres/test_suite_axis2.py -v

Run a single scenario:

python -m tests.synthetic.rds_postgres.run_suite --scenario 006-replication-lag-cpu-redherring --mock-grafana

Print JSON results:

python -m tests.synthetic.rds_postgres.run_suite --mock-grafana --json

CI tier strategy

  • Axis 1, Levels 1–2 (scenarios 000–007): run on every commit (@synthetic)
  • Axis 1, Levels 3–4 (scenarios 008–010): deferred to nightly — require indirect inference
  • Axis 2 (scenarios 011–013, @axis2): run nightly alongside Axis 1 levels 3–4; gap is reported each run

Known gaps

  • Temporal ordering: all scenarios deliver evidence as a static snapshot. Production delivers evidence incrementally (alert fires → query metrics → query events → …). Testing temporal ordering requires architectural changes to the fixture backend and is out of scope.
  • Level 4 coverage: two compositional fault scenarios (009, 011). A fuller curriculum would include 3–4 dual-fault combinations across different failure mode pairs.
  • Slack/markdown renderer for multi-fault: the renderer displays a single root_cause string. Compositional faults may eventually need a root_causes: list field in the schema.
  • Axis 2 required_queries not yet enforced: The agent action registry hardcodes metric_name="pipeline_runs_total" for all query_grafana_metrics calls. SelectiveGrafanaBackend records all queried metric names as an audit log but does not filter results. When the agent is updated to pass dynamic CloudWatch metric names (e.g. metric_name="CPUUtilization"), re-enable filtering in SelectiveGrafanaBackend.query_timeseries and set required_queries in answer.yml.

Dependency: healthy_rca_state

Scenario 007 depends on HEALTHY_SHORT_CIRCUIT=true (the default) and the healthy category being wired into the LLM prompt. If you run with HEALTHY_SHORT_CIRCUIT=false, scenario 007 will fall through to the LLM path, which should still classify as healthy — but the test is most deterministic with the short-circuit enabled.