fix: Deep audit — race conditions, stalled pipeline, installer, and doc accuracy

alokemajumder · alokemajumder · commit f8a4f7566160 · 2026-03-02T14:01:56.000+05:30
- Replace case lock with async mutex queue to fix race condition with 3+ concurrent callers
- Fix stalled-pipeline detector: broken callback URLs for investigated/planned/approved statuses
- Wrap stalled-pipeline file writes in withCaseLock to prevent data corruption
- Fix MCP session init timeout leak with try/finally
- Fix dispatchToGateway timer and response body leaks
- Add createCase overwrite protection
- Guard false-positive feedback against closed/terminal cases
- Set isShuttingDown on uncaughtException
- Expand log sensitive field denylist (4 → 10 fields)
- Add Slack SAFE_ERROR_PATTERNS for executor and duplicate case errors
- Fix installer: dmPolicy open → allowlist, npm install → npm ci --omit=dev
- Generate AUTOPILOT_SERVICE_TOKEN in installer credentials
- Add missing env vars to env.template (enrichment, stalled pipeline, MCP advanced)
- Fix RUNTIME_API.md env var names and defaults
- Fix SECURITY.md supported versions
- Remove dead metrics and phantom OpenTelemetry section from observability docs
- Fix openclaw.json dmPolicy to match security docs
- Bump version to 2.4.3, update CHANGELOG with all fixes
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,24 +9,40 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 - **Stalled-pipeline detector**: Automatically detects cases stuck in transient statuses (open, triaged, correlated, etc.) for longer than a configurable threshold and re-dispatches the webhook to give the agent another attempt. Configurable via `STALLED_PIPELINE_ENABLED`, `STALLED_PIPELINE_THRESHOLD_MINUTES` (default 30), and `STALLED_PIPELINE_CHECK_INTERVAL_MS` (default 300000). New metrics: `autopilot_stalled_pipeline_detected_total`, `autopilot_stalled_pipeline_redispatched_total`.
+- **Runtime enforcement for policy time_windows, rate_limits, and idempotency**: These three policy.yaml sections were previously declarative only. The runtime now enforces them:
+  - **Time windows**: `policyCheckTimeWindow()` blocks `createResponsePlan()` and `executePlan()` outside configured UTC day/time windows (respects `outside_window_action: allow|deny`)
+  - **Rate limits**: `policyCheckActionRateLimit()` enforces per-action and global hourly/daily rate limits inside the plan execution action loop. Counters auto-reset on window expiry.
+  - **Idempotency**: `policyCheckIdempotency()` blocks duplicate action+target pairs within the configured `window_minutes` (default 60min)
+  - Denied actions are skipped individually (rate limit / idempotency) or fail the entire plan (time window), with `policy_denies_total` metric incremented using reason labels `time_window_denied`, `action_rate_limited`, `global_rate_limited`, `duplicate_action`
+  - Cleanup intervals evict expired rate limit and dedup state every 5 minutes
+- **LLM deployment choice documentation**: README now presents three clear paths (Cloud APIs, Local Ollama, Hybrid) with the local/air-gapped path clearly stating current limitations
+- 296 tests across 10 files
 
 ### Fixed
+- **Stalled pipeline data corruption**: `checkStalledPipeline()` now acquires the case lock before writing `updated_at`, preventing data loss from concurrent `updateCase` calls
+- **Stalled pipeline broken callback URLs**: Fixed callback URLs for `investigated` (was sending empty actions array, always rejected), `planned` and `approved` (were using literal `{plan_id}` instead of actual plan ID from evidence pack)
+- **Case lock race condition**: Replaced simple `Map.has/set/delete` pattern with proper async mutex queue. The old pattern allowed a third concurrent caller to bypass the lock between delete and resolve.
 - **Ollama "fetch failed" 5-minute timeout** (fixes #12, #13): OpenClaw bundles its own undici copy, and pi-ai's `http-proxy.ts` resets the global dispatcher with a default 300-second `headersTimeout`. Combined with OpenClaw silently disabling streaming for Ollama tool-calling models, any generation >5 min causes "fetch failed" with 0 tokens. The preload script now writes directly to the shared `globalThis[Symbol.for('undici.globalDispatcher.1')]` and re-applies via `setTimeout` to survive pi-ai's async overwrite. See `docs/AIR_GAPPED_DEPLOYMENT.md` for the complete fix.
 - **Ollama `api` mode in air-gapped config** (fixes #11): Changed `openclaw-airgapped.json` from `"api": "openai-completions"` with `/v1` to `"api": "ollama"` with native endpoint. The OpenAI-compatible mode breaks tool calling — models output raw JSON instead of invoking tools.
 - **`reasoning: false` for Ollama models** (fixes #10): Set `"reasoning": false` explicitly for standard Ollama models (Llama, Mistral, Qwen, CodeLlama). Only models with native thinking support (deepseek-r1, qwq) should have `"reasoning": true`.
+- **MCP session init timeout leak**: `clearTimeout` now runs in a `finally` block for the notification step, preventing timer leaks on error paths
+- **`dispatchToGateway` timeout and response leaks**: Timer is now cleared before retries and on error paths. Response body is drained on final 5xx to release connection pool sockets.
+- **`createCase` silent overwrite**: Now checks if `evidence-pack.json` already exists and returns 409 Conflict instead of silently overwriting
+- **False positive feedback on closed cases**: Feedback is now always appended; status change to `false_positive` is only attempted if the case isn't in a terminal state
+- **Slack `SAFE_ERROR_PATTERNS` missing entries**: Added patterns for "Executor must be different" (separation of duties) and "Case already exists"
+- **Log sensitive field denylist too narrow**: Expanded from 4 fields (auth, token, password, secret) to 10 fields (added api_key, apiKey, authorization, credential, bearer, session_token)
+- **`uncaughtException` handler**: Now sets `isShuttingDown = true` to reject new requests during the 1-second drain window
+- **`dmPolicy: "open"` in `openclaw.json`**: Changed to `"allowlist"` to match the documented security posture
+- **`package.json` version**: Bumped from 2.3.0 to 2.4.3 to match the latest release
+- **RUNTIME_API.md wrong Slack env var names**: Changed `SLACK_CHANNEL_ALERTS`/`SLACK_CHANNEL_APPROVALS` to `SLACK_ALERTS_CHANNEL`/`SLACK_APPROVALS_CHANNEL` to match actual code
+- **RUNTIME_API.md `MCP_MAX_RETRIES` default**: Corrected from 3 to 2
+- **OBSERVABILITY_EXPORT.md dead metric**: Removed `autopilot_action_plans_proposed_total` (replaced by `plans_created_total` in v2.4.2)
+- **OBSERVABILITY_EXPORT.md phantom OpenTelemetry section**: Removed section describing OTEL support that was never implemented
+- **SECURITY.md supported versions**: Updated from 2.0.x-2.1.x to 2.2.x-2.4.x
 
 ### Changed
 - **Tested with OpenClaw v2026.3.1**: Verified compatibility with the latest OpenClaw release. The undici timeout preload script is still required — pi-ai@0.55.3 ships identical `http-proxy.ts`.
 
-### Added
-- **Runtime enforcement for policy time_windows, rate_limits, and idempotency**: These three policy.yaml sections were previously declarative only. The runtime now enforces them:
-  - **Time windows**: `policyCheckTimeWindow()` blocks `createResponsePlan()` and `executePlan()` outside configured UTC day/time windows (respects `outside_window_action: allow|deny`)
-  - **Rate limits**: `policyCheckActionRateLimit()` enforces per-action and global hourly/daily rate limits inside the plan execution action loop. Counters auto-reset on window expiry.
-  - **Idempotency**: `policyCheckIdempotency()` blocks duplicate action+target pairs within the configured `window_minutes` (default 60min)
-  - Denied actions are skipped individually (rate limit / idempotency) or fail the entire plan (time window), with `policy_denies_total` metric incremented using reason labels `time_window_denied`, `action_rate_limited`, `global_rate_limited`, `duplicate_action`
-  - Cleanup intervals evict expired rate limit and dedup state every 5 minutes
-- 21 new tests (276 total): Time window, rate limit, idempotency, and reset helper tests
-
 ## [2.4.3] - 2026-02-27
 
 ### Fixed
diff --git a/SECURITY.md b/SECURITY.md
@@ -4,9 +4,10 @@
 
 | Version | Supported          |
 | ------- | ------------------ |
-| 2.1.x   | :white_check_mark: |
-| 2.0.x   | :white_check_mark: |
-| < 2.0   | :x:                |
+| 2.4.x   | :white_check_mark: |
+| 2.3.x   | :white_check_mark: |
+| 2.2.x   | :white_check_mark: |
+| < 2.2   | :x:                |
 
 ## Security Model
 
diff --git a/docs/OBSERVABILITY_EXPORT.md b/docs/OBSERVABILITY_EXPORT.md
@@ -7,7 +7,7 @@ Wazuh Autopilot exports metrics and logs for integration with your existing obse
 - **No stack shipped** - Autopilot doesn't include Prometheus, Grafana, or Loki
 - **Standard formats** - Prometheus metrics, JSON structured logs
 - **Secure defaults** - Metrics bound to localhost only
-- **Optional OTEL** - OpenTelemetry support when configured
+- **Extensible** - Standard Prometheus format works with any monitoring stack
 
 ## Metrics
 
@@ -65,9 +65,6 @@ autopilot_mcp_tool_call_latency_seconds_count{tool="wazuh_get_alert"}
 #### Planning and Approval Metrics
 
 ```prometheus
-# Counter: Response plans proposed (by agents)
-autopilot_action_plans_proposed_total
-
 # Counter: Approval requests sent
 autopilot_approvals_requested_total
 
@@ -383,29 +380,6 @@ StandardOutput=append:/var/log/wazuh-autopilot/autopilot.log
 {job="wazuh-autopilot"} | json | component="policy" | msg=~".*deny.*"
 ```
 
-## OpenTelemetry (Optional)
-
-### Configuration
-
-Set OTEL environment variables to enable:
-
-```bash
-OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
-OTEL_SERVICE_NAME=wazuh-autopilot
-```
-
-### Exported Spans
-
-When OTEL is configured, Autopilot exports:
-
-- Workflow spans (triage, correlation, planning)
-- MCP call spans
-- Approval workflow spans
-
-### Trace Context
-
-Correlation IDs are propagated as trace context, allowing end-to-end tracing from alert to action.
-
 ## Alerting Examples
 
 ### Prometheus Alertmanager
diff --git a/docs/RUNTIME_API.md b/docs/RUNTIME_API.md
@@ -735,13 +735,18 @@ All responses include security headers:
 | `LOG_LEVEL` | info | Log level: debug, info, warn, error |
 | `LOG_FORMAT` | json | Log format: json or text |
 | `MAX_CONCURRENT_EXECUTIONS` | 5 | Max concurrent plan executions |
-| `MCP_MAX_RETRIES` | 3 | Max retries for MCP tool calls |
+| `MCP_MAX_RETRIES` | 2 | Max retries for MCP tool calls |
 | `AUTOPILOT_DATA_DIR` | /var/lib/wazuh-autopilot | Data directory for cases and plans |
 | `AUTOPILOT_CONFIG_DIR` | /etc/wazuh-autopilot | Config directory for policies |
 | `STALLED_PIPELINE_ENABLED` | true | Enable stalled-pipeline detection and re-dispatch |
 | `STALLED_PIPELINE_THRESHOLD_MINUTES` | 30 | Minutes before a case is considered stalled |
 | `STALLED_PIPELINE_CHECK_INTERVAL_MS` | 300000 | Interval between stall checks (5 min) |
 | `SLACK_APP_TOKEN` | (none) | Slack app-level token (xapp-...) for Socket Mode |
 | `SLACK_BOT_TOKEN` | (none) | Slack bot token (xoxb-...) for API calls |
-| `SLACK_CHANNEL_ALERTS` | (none) | Slack channel ID for alert notifications |
-| `SLACK_CHANNEL_APPROVALS` | (none) | Slack channel ID for approval requests |
+| `SLACK_ALERTS_CHANNEL` | (none) | Slack channel ID for alert notifications |
+| `SLACK_APPROVALS_CHANNEL` | (none) | Slack channel ID for approval requests |
+| `SLACK_REPORTS_CHANNEL` | (none) | Slack channel ID for report postings |
+| `MCP_RETRY_BASE_MS` | 1000 | Base delay for MCP retry backoff |
+| `ENRICHMENT_ERROR_CACHE_TTL_MS` | 300000 | Cache TTL for enrichment errors (5 min) |
+| `MCP_BOOTSTRAP_URL` | (none) | Fallback URL for MCP server (used if `MCP_URL` not set) |
+| `AUTOPILOT_MODE` | bootstrap | Runtime mode: `bootstrap` (fail-open) or `production` (fail-closed) |
diff --git a/install/env.template b/install/env.template
@@ -250,3 +250,53 @@ SHUTDOWN_TIMEOUT_MS=30000
 
 # Maximum concurrent agent operations
 # MAX_CONCURRENT_OPERATIONS=5
+
+# =============================================================================
+# IP ENRICHMENT (AbuseIPDB)
+# =============================================================================
+
+# Enable IP enrichment lookups during triage
+ENRICHMENT_ENABLED=false
+
+# AbuseIPDB API key (required if ENRICHMENT_ENABLED=true)
+# Get a free key at https://www.abuseipdb.com/account/api
+ABUSEIPDB_API_KEY=
+
+# Enrichment cache TTL (milliseconds, default 1 hour)
+# ENRICHMENT_CACHE_TTL_MS=3600000
+
+# Enrichment error cache TTL (milliseconds, default 5 minutes)
+# ENRICHMENT_ERROR_CACHE_TTL_MS=300000
+
+# =============================================================================
+# STALLED PIPELINE DETECTION
+# =============================================================================
+
+# Enable automatic detection and re-dispatch of stalled cases
+STALLED_PIPELINE_ENABLED=true
+
+# Minutes before a case is considered stalled (default 30)
+# STALLED_PIPELINE_THRESHOLD_MINUTES=30
+
+# Check interval in milliseconds (default 300000 = 5 minutes)
+# STALLED_PIPELINE_CHECK_INTERVAL_MS=300000
+
+# =============================================================================
+# MCP ADVANCED
+# =============================================================================
+
+# Base delay for MCP retry backoff (milliseconds)
+# MCP_RETRY_BASE_MS=1000
+
+# MCP authentication protocol mode
+# mcp-jsonrpc = standard MCP protocol (default)
+# legacy-rest = backwards compatibility
+# MCP_AUTH_MODE=mcp-jsonrpc
+
+# =============================================================================
+# OPENCLAW GATEWAY
+# =============================================================================
+
+# URL for dispatching webhooks to OpenClaw Gateway
+# Auto-configured by installer — override only if using a custom gateway address
+# OPENCLAW_GATEWAY_URL=http://127.0.0.1:18789
diff --git a/install/install.sh b/install/install.sh
@@ -606,7 +606,7 @@ setup_credentials() {
     echo ""
 
     # Idempotent: reuse existing secrets if present (prevents breaking running services on re-install)
-    local MCP_AUTH_TOKEN OPENCLAW_TOKEN PAIRING_SECRET APPROVAL_SECRET
+    local MCP_AUTH_TOKEN OPENCLAW_TOKEN PAIRING_SECRET APPROVAL_SECRET AUTOPILOT_SERVICE_TOKEN
     local _generated_new=false
 
     if [[ -s "$SECRETS_DIR/mcp_token" ]]; then
@@ -649,6 +649,14 @@ setup_credentials() {
         _generated_new=true
     fi
 
+    if [[ -s "$SECRETS_DIR/service_token" ]]; then
+        AUTOPILOT_SERVICE_TOKEN=$(cat "$SECRETS_DIR/service_token")
+        log_info "Reusing existing service token"
+    else
+        AUTOPILOT_SERVICE_TOKEN=$(generate_secret 32)
+        _generated_new=true
+    fi
+
     # Store secrets in isolated files with restrictive umask
     local _old_umask
     _old_umask=$(umask)
@@ -658,6 +666,7 @@ setup_credentials() {
     echo "$OPENCLAW_WEBHOOK_TOKEN" > "$SECRETS_DIR/openclaw_webhook_token"
     echo "$PAIRING_SECRET" > "$SECRETS_DIR/pairing_code"
     echo "$APPROVAL_SECRET" > "$SECRETS_DIR/approval_secret"
+    echo "$AUTOPILOT_SERVICE_TOKEN" > "$SECRETS_DIR/service_token"
     umask "$_old_umask"
 
     # Verify permissions
@@ -667,9 +676,10 @@ setup_credentials() {
     log_security "OpenClaw token stored: $SECRETS_DIR/openclaw_token (mode 600)"
     log_security "OpenClaw webhook token stored: $SECRETS_DIR/openclaw_webhook_token (mode 600)"
     log_security "Pairing code stored: $SECRETS_DIR/pairing_code (mode 600)"
+    log_security "Service token stored: $SECRETS_DIR/service_token (mode 600)"
 
     # Export for later use
-    export MCP_AUTH_TOKEN OPENCLAW_TOKEN OPENCLAW_WEBHOOK_TOKEN PAIRING_SECRET APPROVAL_SECRET
+    export MCP_AUTH_TOKEN OPENCLAW_TOKEN OPENCLAW_WEBHOOK_TOKEN PAIRING_SECRET APPROVAL_SECRET AUTOPILOT_SERVICE_TOKEN
 
     if [[ "$_generated_new" == "true" ]]; then
         log_success "Credentials generated and isolated"
@@ -949,7 +959,7 @@ deploy_agents() {
       "enabled": __SLACK_ENABLED__,
       "botToken": "__SLACK_BOT_TOKEN__",
       "appToken": "__SLACK_APP_TOKEN__",
-      "dmPolicy": "open",
+      "dmPolicy": "allowlist",
       "groupPolicy": "allowlist"
     }
   },
@@ -1072,8 +1082,8 @@ install_runtime_service() {
 
     cd "$RUNTIME_DST"
     log_info "Installing dependencies..."
-    if ! npm install --production 2>&1; then
-        log_warn "npm install --production failed, retrying without --production flag..."
+    if ! npm ci --omit=dev 2>&1; then
+        log_warn "npm ci --omit=dev failed, retrying with npm install..."
         if ! npm install 2>&1; then
             log_error "npm install failed — runtime service dependencies not installed"
             exit 1
@@ -1314,6 +1324,9 @@ APPROVAL_TOKEN_TTL_MINUTES=60
 AUTOPILOT_PAIRING_CODE=__PAIRING_SECRET__
 AUTOPILOT_PAIRING_ENABLED=true
 
+# INTERNAL SERVICE AUTHENTICATION
+AUTOPILOT_SERVICE_TOKEN=__AUTOPILOT_SERVICE_TOKEN__
+
 # RESPONDER AGENT - TWO-TIER HUMAN APPROVAL
 # SAFETY: Disabled by default. Every action requires human Approve + Execute.
 AUTOPILOT_RESPONDER_ENABLED=false
@@ -1380,6 +1393,7 @@ ENVEOF
     _safe_subst "__SLACK_BOT_TOKEN__" "SLACK_BOT_TOKEN" "$_envfile"
     _safe_subst "__SLACK_ALERTS_CHANNEL__" "SLACK_ALERTS_CHANNEL" "$_envfile"
     _safe_subst "__SLACK_APPROVALS_CHANNEL__" "SLACK_APPROVALS_CHANNEL" "$_envfile"
+    _safe_subst "__AUTOPILOT_SERVICE_TOKEN__" "AUTOPILOT_SERVICE_TOKEN" "$_envfile"
 
     # Set AUTOPILOT_MODE based on install mode (full → production, bootstrap/mcp-only → bootstrap)
     if [[ "$INSTALL_MODE" == "full" ]]; then
diff --git a/openclaw/openclaw.json b/openclaw/openclaw.json
@@ -206,7 +206,7 @@
       "enabled": true,
       "botToken": "${SLACK_BOT_TOKEN}",
       "appToken": "${SLACK_APP_TOKEN}",
-      "dmPolicy": "open",
+      "dmPolicy": "allowlist",
       "groupPolicy": "allowlist"
     }
   },
diff --git a/runtime/autopilot-service/index.js b/runtime/autopilot-service/index.js
diff --git a/runtime/autopilot-service/package.json b/runtime/autopilot-service/package.json
diff --git a/runtime/autopilot-service/slack.js b/runtime/autopilot-service/slack.js

Original file line number	Diff line number	Diff line change
`@@ -206,7 +206,7 @@`
`206`	`206`	`"enabled": true,`
`207`	`207`	`"botToken": "${SLACK_BOT_TOKEN}",`
`208`	`208`	`"appToken": "${SLACK_APP_TOKEN}",`
`209`		`- "dmPolicy": "open",`
	`209`	`+ "dmPolicy": "allowlist",`
`210`	`210`	`"groupPolicy": "allowlist"`
`211`	`211`	`}`
`212`	`212`	`},`