Skip to content

Commit f8a4f75

Browse files
committed
fix: Deep audit — race conditions, stalled pipeline, installer, and doc accuracy
- Replace case lock with async mutex queue to fix race condition with 3+ concurrent callers - Fix stalled-pipeline detector: broken callback URLs for investigated/planned/approved statuses - Wrap stalled-pipeline file writes in withCaseLock to prevent data corruption - Fix MCP session init timeout leak with try/finally - Fix dispatchToGateway timer and response body leaks - Add createCase overwrite protection - Guard false-positive feedback against closed/terminal cases - Set isShuttingDown on uncaughtException - Expand log sensitive field denylist (4 → 10 fields) - Add Slack SAFE_ERROR_PATTERNS for executor and duplicate case errors - Fix installer: dmPolicy open → allowlist, npm install → npm ci --omit=dev - Generate AUTOPILOT_SERVICE_TOKEN in installer credentials - Add missing env vars to env.template (enrichment, stalled pipeline, MCP advanced) - Fix RUNTIME_API.md env var names and defaults - Fix SECURITY.md supported versions - Remove dead metrics and phantom OpenTelemetry section from observability docs - Fix openclaw.json dmPolicy to match security docs - Bump version to 2.4.3, update CHANGELOG with all fixes
1 parent 4e7e79c commit f8a4f75

File tree

10 files changed

+205
-88
lines changed

10 files changed

+205
-88
lines changed

CHANGELOG.md

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,24 +9,40 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111
- **Stalled-pipeline detector**: Automatically detects cases stuck in transient statuses (open, triaged, correlated, etc.) for longer than a configurable threshold and re-dispatches the webhook to give the agent another attempt. Configurable via `STALLED_PIPELINE_ENABLED`, `STALLED_PIPELINE_THRESHOLD_MINUTES` (default 30), and `STALLED_PIPELINE_CHECK_INTERVAL_MS` (default 300000). New metrics: `autopilot_stalled_pipeline_detected_total`, `autopilot_stalled_pipeline_redispatched_total`.
12+
- **Runtime enforcement for policy time_windows, rate_limits, and idempotency**: These three policy.yaml sections were previously declarative only. The runtime now enforces them:
13+
- **Time windows**: `policyCheckTimeWindow()` blocks `createResponsePlan()` and `executePlan()` outside configured UTC day/time windows (respects `outside_window_action: allow|deny`)
14+
- **Rate limits**: `policyCheckActionRateLimit()` enforces per-action and global hourly/daily rate limits inside the plan execution action loop. Counters auto-reset on window expiry.
15+
- **Idempotency**: `policyCheckIdempotency()` blocks duplicate action+target pairs within the configured `window_minutes` (default 60min)
16+
- Denied actions are skipped individually (rate limit / idempotency) or fail the entire plan (time window), with `policy_denies_total` metric incremented using reason labels `time_window_denied`, `action_rate_limited`, `global_rate_limited`, `duplicate_action`
17+
- Cleanup intervals evict expired rate limit and dedup state every 5 minutes
18+
- **LLM deployment choice documentation**: README now presents three clear paths (Cloud APIs, Local Ollama, Hybrid) with the local/air-gapped path clearly stating current limitations
19+
- 296 tests across 10 files
1220

1321
### Fixed
22+
- **Stalled pipeline data corruption**: `checkStalledPipeline()` now acquires the case lock before writing `updated_at`, preventing data loss from concurrent `updateCase` calls
23+
- **Stalled pipeline broken callback URLs**: Fixed callback URLs for `investigated` (was sending empty actions array, always rejected), `planned` and `approved` (were using literal `{plan_id}` instead of actual plan ID from evidence pack)
24+
- **Case lock race condition**: Replaced simple `Map.has/set/delete` pattern with proper async mutex queue. The old pattern allowed a third concurrent caller to bypass the lock between delete and resolve.
1425
- **Ollama "fetch failed" 5-minute timeout** (fixes #12, #13): OpenClaw bundles its own undici copy, and pi-ai's `http-proxy.ts` resets the global dispatcher with a default 300-second `headersTimeout`. Combined with OpenClaw silently disabling streaming for Ollama tool-calling models, any generation >5 min causes "fetch failed" with 0 tokens. The preload script now writes directly to the shared `globalThis[Symbol.for('undici.globalDispatcher.1')]` and re-applies via `setTimeout` to survive pi-ai's async overwrite. See `docs/AIR_GAPPED_DEPLOYMENT.md` for the complete fix.
1526
- **Ollama `api` mode in air-gapped config** (fixes #11): Changed `openclaw-airgapped.json` from `"api": "openai-completions"` with `/v1` to `"api": "ollama"` with native endpoint. The OpenAI-compatible mode breaks tool calling — models output raw JSON instead of invoking tools.
1627
- **`reasoning: false` for Ollama models** (fixes #10): Set `"reasoning": false` explicitly for standard Ollama models (Llama, Mistral, Qwen, CodeLlama). Only models with native thinking support (deepseek-r1, qwq) should have `"reasoning": true`.
28+
- **MCP session init timeout leak**: `clearTimeout` now runs in a `finally` block for the notification step, preventing timer leaks on error paths
29+
- **`dispatchToGateway` timeout and response leaks**: Timer is now cleared before retries and on error paths. Response body is drained on final 5xx to release connection pool sockets.
30+
- **`createCase` silent overwrite**: Now checks if `evidence-pack.json` already exists and returns 409 Conflict instead of silently overwriting
31+
- **False positive feedback on closed cases**: Feedback is now always appended; status change to `false_positive` is only attempted if the case isn't in a terminal state
32+
- **Slack `SAFE_ERROR_PATTERNS` missing entries**: Added patterns for "Executor must be different" (separation of duties) and "Case already exists"
33+
- **Log sensitive field denylist too narrow**: Expanded from 4 fields (auth, token, password, secret) to 10 fields (added api_key, apiKey, authorization, credential, bearer, session_token)
34+
- **`uncaughtException` handler**: Now sets `isShuttingDown = true` to reject new requests during the 1-second drain window
35+
- **`dmPolicy: "open"` in `openclaw.json`**: Changed to `"allowlist"` to match the documented security posture
36+
- **`package.json` version**: Bumped from 2.3.0 to 2.4.3 to match the latest release
37+
- **RUNTIME_API.md wrong Slack env var names**: Changed `SLACK_CHANNEL_ALERTS`/`SLACK_CHANNEL_APPROVALS` to `SLACK_ALERTS_CHANNEL`/`SLACK_APPROVALS_CHANNEL` to match actual code
38+
- **RUNTIME_API.md `MCP_MAX_RETRIES` default**: Corrected from 3 to 2
39+
- **OBSERVABILITY_EXPORT.md dead metric**: Removed `autopilot_action_plans_proposed_total` (replaced by `plans_created_total` in v2.4.2)
40+
- **OBSERVABILITY_EXPORT.md phantom OpenTelemetry section**: Removed section describing OTEL support that was never implemented
41+
- **SECURITY.md supported versions**: Updated from 2.0.x-2.1.x to 2.2.x-2.4.x
1742

1843
### Changed
1944
- **Tested with OpenClaw v2026.3.1**: Verified compatibility with the latest OpenClaw release. The undici timeout preload script is still required — pi-ai@0.55.3 ships identical `http-proxy.ts`.
2045

21-
### Added
22-
- **Runtime enforcement for policy time_windows, rate_limits, and idempotency**: These three policy.yaml sections were previously declarative only. The runtime now enforces them:
23-
- **Time windows**: `policyCheckTimeWindow()` blocks `createResponsePlan()` and `executePlan()` outside configured UTC day/time windows (respects `outside_window_action: allow|deny`)
24-
- **Rate limits**: `policyCheckActionRateLimit()` enforces per-action and global hourly/daily rate limits inside the plan execution action loop. Counters auto-reset on window expiry.
25-
- **Idempotency**: `policyCheckIdempotency()` blocks duplicate action+target pairs within the configured `window_minutes` (default 60min)
26-
- Denied actions are skipped individually (rate limit / idempotency) or fail the entire plan (time window), with `policy_denies_total` metric incremented using reason labels `time_window_denied`, `action_rate_limited`, `global_rate_limited`, `duplicate_action`
27-
- Cleanup intervals evict expired rate limit and dedup state every 5 minutes
28-
- 21 new tests (276 total): Time window, rate limit, idempotency, and reset helper tests
29-
3046
## [2.4.3] - 2026-02-27
3147

3248
### Fixed

SECURITY.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,10 @@
44

55
| Version | Supported |
66
| ------- | ------------------ |
7-
| 2.1.x | :white_check_mark: |
8-
| 2.0.x | :white_check_mark: |
9-
| < 2.0 | :x: |
7+
| 2.4.x | :white_check_mark: |
8+
| 2.3.x | :white_check_mark: |
9+
| 2.2.x | :white_check_mark: |
10+
| < 2.2 | :x: |
1011

1112
## Security Model
1213

docs/OBSERVABILITY_EXPORT.md

Lines changed: 1 addition & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Wazuh Autopilot exports metrics and logs for integration with your existing obse
77
- **No stack shipped** - Autopilot doesn't include Prometheus, Grafana, or Loki
88
- **Standard formats** - Prometheus metrics, JSON structured logs
99
- **Secure defaults** - Metrics bound to localhost only
10-
- **Optional OTEL** - OpenTelemetry support when configured
10+
- **Extensible** - Standard Prometheus format works with any monitoring stack
1111

1212
## Metrics
1313

@@ -65,9 +65,6 @@ autopilot_mcp_tool_call_latency_seconds_count{tool="wazuh_get_alert"}
6565
#### Planning and Approval Metrics
6666

6767
```prometheus
68-
# Counter: Response plans proposed (by agents)
69-
autopilot_action_plans_proposed_total
70-
7168
# Counter: Approval requests sent
7269
autopilot_approvals_requested_total
7370
@@ -383,29 +380,6 @@ StandardOutput=append:/var/log/wazuh-autopilot/autopilot.log
383380
{job="wazuh-autopilot"} | json | component="policy" | msg=~".*deny.*"
384381
```
385382

386-
## OpenTelemetry (Optional)
387-
388-
### Configuration
389-
390-
Set OTEL environment variables to enable:
391-
392-
```bash
393-
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
394-
OTEL_SERVICE_NAME=wazuh-autopilot
395-
```
396-
397-
### Exported Spans
398-
399-
When OTEL is configured, Autopilot exports:
400-
401-
- Workflow spans (triage, correlation, planning)
402-
- MCP call spans
403-
- Approval workflow spans
404-
405-
### Trace Context
406-
407-
Correlation IDs are propagated as trace context, allowing end-to-end tracing from alert to action.
408-
409383
## Alerting Examples
410384

411385
### Prometheus Alertmanager

docs/RUNTIME_API.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -735,13 +735,18 @@ All responses include security headers:
735735
| `LOG_LEVEL` | info | Log level: debug, info, warn, error |
736736
| `LOG_FORMAT` | json | Log format: json or text |
737737
| `MAX_CONCURRENT_EXECUTIONS` | 5 | Max concurrent plan executions |
738-
| `MCP_MAX_RETRIES` | 3 | Max retries for MCP tool calls |
738+
| `MCP_MAX_RETRIES` | 2 | Max retries for MCP tool calls |
739739
| `AUTOPILOT_DATA_DIR` | /var/lib/wazuh-autopilot | Data directory for cases and plans |
740740
| `AUTOPILOT_CONFIG_DIR` | /etc/wazuh-autopilot | Config directory for policies |
741741
| `STALLED_PIPELINE_ENABLED` | true | Enable stalled-pipeline detection and re-dispatch |
742742
| `STALLED_PIPELINE_THRESHOLD_MINUTES` | 30 | Minutes before a case is considered stalled |
743743
| `STALLED_PIPELINE_CHECK_INTERVAL_MS` | 300000 | Interval between stall checks (5 min) |
744744
| `SLACK_APP_TOKEN` | (none) | Slack app-level token (xapp-...) for Socket Mode |
745745
| `SLACK_BOT_TOKEN` | (none) | Slack bot token (xoxb-...) for API calls |
746-
| `SLACK_CHANNEL_ALERTS` | (none) | Slack channel ID for alert notifications |
747-
| `SLACK_CHANNEL_APPROVALS` | (none) | Slack channel ID for approval requests |
746+
| `SLACK_ALERTS_CHANNEL` | (none) | Slack channel ID for alert notifications |
747+
| `SLACK_APPROVALS_CHANNEL` | (none) | Slack channel ID for approval requests |
748+
| `SLACK_REPORTS_CHANNEL` | (none) | Slack channel ID for report postings |
749+
| `MCP_RETRY_BASE_MS` | 1000 | Base delay for MCP retry backoff |
750+
| `ENRICHMENT_ERROR_CACHE_TTL_MS` | 300000 | Cache TTL for enrichment errors (5 min) |
751+
| `MCP_BOOTSTRAP_URL` | (none) | Fallback URL for MCP server (used if `MCP_URL` not set) |
752+
| `AUTOPILOT_MODE` | bootstrap | Runtime mode: `bootstrap` (fail-open) or `production` (fail-closed) |

install/env.template

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -250,3 +250,53 @@ SHUTDOWN_TIMEOUT_MS=30000
250250

251251
# Maximum concurrent agent operations
252252
# MAX_CONCURRENT_OPERATIONS=5
253+
254+
# =============================================================================
255+
# IP ENRICHMENT (AbuseIPDB)
256+
# =============================================================================
257+
258+
# Enable IP enrichment lookups during triage
259+
ENRICHMENT_ENABLED=false
260+
261+
# AbuseIPDB API key (required if ENRICHMENT_ENABLED=true)
262+
# Get a free key at https://www.abuseipdb.com/account/api
263+
ABUSEIPDB_API_KEY=
264+
265+
# Enrichment cache TTL (milliseconds, default 1 hour)
266+
# ENRICHMENT_CACHE_TTL_MS=3600000
267+
268+
# Enrichment error cache TTL (milliseconds, default 5 minutes)
269+
# ENRICHMENT_ERROR_CACHE_TTL_MS=300000
270+
271+
# =============================================================================
272+
# STALLED PIPELINE DETECTION
273+
# =============================================================================
274+
275+
# Enable automatic detection and re-dispatch of stalled cases
276+
STALLED_PIPELINE_ENABLED=true
277+
278+
# Minutes before a case is considered stalled (default 30)
279+
# STALLED_PIPELINE_THRESHOLD_MINUTES=30
280+
281+
# Check interval in milliseconds (default 300000 = 5 minutes)
282+
# STALLED_PIPELINE_CHECK_INTERVAL_MS=300000
283+
284+
# =============================================================================
285+
# MCP ADVANCED
286+
# =============================================================================
287+
288+
# Base delay for MCP retry backoff (milliseconds)
289+
# MCP_RETRY_BASE_MS=1000
290+
291+
# MCP authentication protocol mode
292+
# mcp-jsonrpc = standard MCP protocol (default)
293+
# legacy-rest = backwards compatibility
294+
# MCP_AUTH_MODE=mcp-jsonrpc
295+
296+
# =============================================================================
297+
# OPENCLAW GATEWAY
298+
# =============================================================================
299+
300+
# URL for dispatching webhooks to OpenClaw Gateway
301+
# Auto-configured by installer — override only if using a custom gateway address
302+
# OPENCLAW_GATEWAY_URL=http://127.0.0.1:18789

install/install.sh

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -606,7 +606,7 @@ setup_credentials() {
606606
echo ""
607607

608608
# Idempotent: reuse existing secrets if present (prevents breaking running services on re-install)
609-
local MCP_AUTH_TOKEN OPENCLAW_TOKEN PAIRING_SECRET APPROVAL_SECRET
609+
local MCP_AUTH_TOKEN OPENCLAW_TOKEN PAIRING_SECRET APPROVAL_SECRET AUTOPILOT_SERVICE_TOKEN
610610
local _generated_new=false
611611

612612
if [[ -s "$SECRETS_DIR/mcp_token" ]]; then
@@ -649,6 +649,14 @@ setup_credentials() {
649649
_generated_new=true
650650
fi
651651

652+
if [[ -s "$SECRETS_DIR/service_token" ]]; then
653+
AUTOPILOT_SERVICE_TOKEN=$(cat "$SECRETS_DIR/service_token")
654+
log_info "Reusing existing service token"
655+
else
656+
AUTOPILOT_SERVICE_TOKEN=$(generate_secret 32)
657+
_generated_new=true
658+
fi
659+
652660
# Store secrets in isolated files with restrictive umask
653661
local _old_umask
654662
_old_umask=$(umask)
@@ -658,6 +666,7 @@ setup_credentials() {
658666
echo "$OPENCLAW_WEBHOOK_TOKEN" > "$SECRETS_DIR/openclaw_webhook_token"
659667
echo "$PAIRING_SECRET" > "$SECRETS_DIR/pairing_code"
660668
echo "$APPROVAL_SECRET" > "$SECRETS_DIR/approval_secret"
669+
echo "$AUTOPILOT_SERVICE_TOKEN" > "$SECRETS_DIR/service_token"
661670
umask "$_old_umask"
662671

663672
# Verify permissions
@@ -667,9 +676,10 @@ setup_credentials() {
667676
log_security "OpenClaw token stored: $SECRETS_DIR/openclaw_token (mode 600)"
668677
log_security "OpenClaw webhook token stored: $SECRETS_DIR/openclaw_webhook_token (mode 600)"
669678
log_security "Pairing code stored: $SECRETS_DIR/pairing_code (mode 600)"
679+
log_security "Service token stored: $SECRETS_DIR/service_token (mode 600)"
670680

671681
# Export for later use
672-
export MCP_AUTH_TOKEN OPENCLAW_TOKEN OPENCLAW_WEBHOOK_TOKEN PAIRING_SECRET APPROVAL_SECRET
682+
export MCP_AUTH_TOKEN OPENCLAW_TOKEN OPENCLAW_WEBHOOK_TOKEN PAIRING_SECRET APPROVAL_SECRET AUTOPILOT_SERVICE_TOKEN
673683

674684
if [[ "$_generated_new" == "true" ]]; then
675685
log_success "Credentials generated and isolated"
@@ -949,7 +959,7 @@ deploy_agents() {
949959
"enabled": __SLACK_ENABLED__,
950960
"botToken": "__SLACK_BOT_TOKEN__",
951961
"appToken": "__SLACK_APP_TOKEN__",
952-
"dmPolicy": "open",
962+
"dmPolicy": "allowlist",
953963
"groupPolicy": "allowlist"
954964
}
955965
},
@@ -1072,8 +1082,8 @@ install_runtime_service() {
10721082

10731083
cd "$RUNTIME_DST"
10741084
log_info "Installing dependencies..."
1075-
if ! npm install --production 2>&1; then
1076-
log_warn "npm install --production failed, retrying without --production flag..."
1085+
if ! npm ci --omit=dev 2>&1; then
1086+
log_warn "npm ci --omit=dev failed, retrying with npm install..."
10771087
if ! npm install 2>&1; then
10781088
log_error "npm install failed — runtime service dependencies not installed"
10791089
exit 1
@@ -1314,6 +1324,9 @@ APPROVAL_TOKEN_TTL_MINUTES=60
13141324
AUTOPILOT_PAIRING_CODE=__PAIRING_SECRET__
13151325
AUTOPILOT_PAIRING_ENABLED=true
13161326
1327+
# INTERNAL SERVICE AUTHENTICATION
1328+
AUTOPILOT_SERVICE_TOKEN=__AUTOPILOT_SERVICE_TOKEN__
1329+
13171330
# RESPONDER AGENT - TWO-TIER HUMAN APPROVAL
13181331
# SAFETY: Disabled by default. Every action requires human Approve + Execute.
13191332
AUTOPILOT_RESPONDER_ENABLED=false
@@ -1380,6 +1393,7 @@ ENVEOF
13801393
_safe_subst "__SLACK_BOT_TOKEN__" "SLACK_BOT_TOKEN" "$_envfile"
13811394
_safe_subst "__SLACK_ALERTS_CHANNEL__" "SLACK_ALERTS_CHANNEL" "$_envfile"
13821395
_safe_subst "__SLACK_APPROVALS_CHANNEL__" "SLACK_APPROVALS_CHANNEL" "$_envfile"
1396+
_safe_subst "__AUTOPILOT_SERVICE_TOKEN__" "AUTOPILOT_SERVICE_TOKEN" "$_envfile"
13831397

13841398
# Set AUTOPILOT_MODE based on install mode (full → production, bootstrap/mcp-only → bootstrap)
13851399
if [[ "$INSTALL_MODE" == "full" ]]; then

openclaw/openclaw.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,7 @@
206206
"enabled": true,
207207
"botToken": "${SLACK_BOT_TOKEN}",
208208
"appToken": "${SLACK_APP_TOKEN}",
209-
"dmPolicy": "open",
209+
"dmPolicy": "allowlist",
210210
"groupPolicy": "allowlist"
211211
}
212212
},

0 commit comments

Comments
 (0)