Harden listener controls and wire backend failure telemetry#5
Harden listener controls and wire backend failure telemetry#5DsChauhan08 merged 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR strengthens runtime/admin controls and observability by adding new admin control commands (rate/connection caps, emergency mode, audit chain verification) and wiring corresponding telemetry into Prometheus metrics and CLI/integration smoke coverage.
Changes:
- Added new admin control-plane commands:
SETRATE,SETMAXCONNS,SETGLOBALMAXCONNS,EMERGENCY, andAUDITVERIFY, plus listener-side connection admission checks. - Expanded Prometheus metrics for connection rejections and audit verification failures, and added backend TLS pin failure counting.
- Updated test harnesses and build/CI sanitizer targets to run smoke/CLI tests against a selectable
SPF_BIN.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/integration_smoke.py | Allows overriding the SPF binary via SPF_BIN for test portability. |
| tests/cli_realworld.sh | Exercises new control commands and validates new metrics/audit verification output. |
| src/server.cpp | Implements new control commands and enforces new connection-admission checks/telemetry. |
| src/metrics.c | Adds new metrics definitions and exports new counters/gauges. |
| src/core.c | Adds control command classification, audit chain verification, and initializes new state fields. |
| src/common.h | Extends state and command enums to support new controls/metrics. |
| spf_audit.log | Adds a sample audit log file to the repository. |
| README.md | Documents new admin commands and expanded metrics list. |
| Makefile | Adds ASAN/UBSAN build+test targets and passes SPF_BIN into tests. |
| .github/workflows/sanitizers.yml | Runs the new sanitizer test targets in CI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static void json_unescape_minimal(char* s) { | ||
| if (!s) { | ||
| return; | ||
| } | ||
| size_t r = 0, w = 0; | ||
| while (s[r] != '\0') { | ||
| if (s[r] == '\\' && s[r + 1] != '\0') { | ||
| char n = s[r + 1]; | ||
| if (n == 'n') s[w++] = '\n'; | ||
| else if (n == 'r') s[w++] = '\r'; | ||
| else if (n == 't') s[w++] = '\t'; | ||
| else s[w++] = n; | ||
| r += 2; | ||
| } else { | ||
| s[w++] = s[r++]; | ||
| } | ||
| } | ||
| s[w] = '\0'; | ||
| } |
There was a problem hiding this comment.
spf_audit_verify_chain() calls json_unescape_minimal() on extracted JSON strings, but json_unescape_minimal() treats \uXXXX as a generic escape and strips the backslash (turning e.g. \u001b into u001b). Since audit_json_escape() can emit \u%04x for control chars, this makes the canonical string differ from what was originally hashed and will cause verification to fail for valid logs containing \u escapes. Consider either (a) not unescaping at all and reusing the already-escaped JSON field values when reconstructing canonical, or (b) implementing proper JSON unescape support for \uXXXX and then always re-escaping with audit_json_escape() before hashing.
| int spf_audit_verify_chain(spf_state_t* state, uint32_t* entries_checked, uint32_t* failures) { | ||
| if (!state || !state->config.admin.audit_log_path[0]) { | ||
| return -1; | ||
| } | ||
|
|
||
| FILE* f = fopen(state->config.admin.audit_log_path, "r"); | ||
| if (!f) { | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
spf_audit_verify_chain() reads the audit log file without taking state->audit_lock, while spf_audit_log() appends under that lock. This can race with concurrent writes (partial line read / inconsistent hash chain) and produce false verification failures. Consider holding state->audit_lock for the duration of verification (or copying the file/size under lock before reading) to ensure a consistent snapshot.
| else if (strncmp(buf, "SETRATE ", 8) == 0) { | ||
| uint32_t id = 0; | ||
| uint64_t bps = 0; | ||
| if (sscanf(buf + 8, "%u %lu", &id, &bps) == 2) { | ||
| spf_rule_t* r = spf_get_rule(&g_state, id); | ||
| if (!r) { | ||
| snprintf(resp, sizeof(resp), "ERR rule not found\n"); | ||
| } else if (bps == 0 || rule_set_rate_limit(r, bps) != 0) { | ||
| snprintf(resp, sizeof(resp), "ERR invalid rate\n"); | ||
| } else { | ||
| snprintf(resp, sizeof(resp), "OK rate set rule=%u bps=%lu\n", id, bps); | ||
| } |
There was a problem hiding this comment.
The SETRATE parser stores into a uint64_t bps but uses %lu in sscanf/snprintf. This is not portable across platforms where uint64_t is not unsigned long (e.g., Windows LLP64) and can truncate or misformat values. Prefer using SCNu64/PRIu64 (or %llu with an unsigned long long temp) for both parsing and formatting.
| pthread_mutex_lock(&g_state.stats_lock); | ||
| if (g_state.emergency_mode) { | ||
| g_state.conn_reject_emergency++; | ||
| pthread_mutex_unlock(&g_state.stats_lock); | ||
| return false; | ||
| } | ||
| if (g_state.global_max_conns > 0 && g_state.active_conns >= g_state.global_max_conns) { | ||
| g_state.conn_reject_global_max++; | ||
| pthread_mutex_unlock(&g_state.stats_lock); | ||
| return false; | ||
| } | ||
| pthread_mutex_unlock(&g_state.stats_lock); |
There was a problem hiding this comment.
try_accept_new_conn() checks g_state.active_conns against g_state.global_max_conns under stats_lock, but does not reserve/increment the count while holding the lock. With concurrent listener threads, multiple accepts can pass the check and then later increment active_conns, exceeding the configured cap. To make the cap reliable, consider reserving a slot under stats_lock (and rolling it back on subsequent dial/TLS failure).
| if (rule->max_conns > 0) { | ||
| uint32_t sum = 0; | ||
| for (int i = 0; i < rule->backend_count; i++) { | ||
| pthread_mutex_lock(&rule->backends[i].lock); | ||
| sum += rule->backends[i].active_conns; | ||
| pthread_mutex_unlock(&rule->backends[i].lock); | ||
| } | ||
| if (sum >= rule->max_conns) { |
There was a problem hiding this comment.
try_accept_new_conn() reads rule->max_conns and iterates rule->backend_count without holding rule->lock, while SETMAXCONNS updates rule->max_conns under that lock. This introduces an unsynchronized read/write data race. Consider taking rule->lock to snapshot max_conns/backend_count before the loop (or making these fields atomic / updated under a lock that readers also hold).
| if (rule->max_conns > 0) { | |
| uint32_t sum = 0; | |
| for (int i = 0; i < rule->backend_count; i++) { | |
| pthread_mutex_lock(&rule->backends[i].lock); | |
| sum += rule->backends[i].active_conns; | |
| pthread_mutex_unlock(&rule->backends[i].lock); | |
| } | |
| if (sum >= rule->max_conns) { | |
| uint32_t rule_max_conns = 0; | |
| int backend_count = 0; | |
| pthread_mutex_lock(&rule->lock); | |
| rule_max_conns = rule->max_conns; | |
| backend_count = rule->backend_count; | |
| pthread_mutex_unlock(&rule->lock); | |
| if (rule_max_conns > 0) { | |
| uint32_t sum = 0; | |
| for (int i = 0; i < backend_count; i++) { | |
| pthread_mutex_lock(&rule->backends[i].lock); | |
| sum += rule->backends[i].active_conns; | |
| pthread_mutex_unlock(&rule->backends[i].lock); | |
| } | |
| if (sum >= rule_max_conns) { |
| written += snprintf(buf + written, len - written, | ||
| "spf_connections_active %u\n" | ||
| "spf_connections_total %lu\n" | ||
| "spf_bytes_in_total %lu\n" | ||
| "spf_bytes_out_total %lu\n" | ||
| "spf_blocked_total %lu\n" | ||
| "spf_rules_active %u\n" | ||
| "spf_uptime_seconds %lu\n" | ||
| "spf_admin_auth_failures_total %lu\n" | ||
| "spf_admin_lockouts_total %lu\n" | ||
| "spf_admin_service_token_auth_success_total %lu\n" | ||
| "spf_admin_service_token_auth_fail_total %lu\n" | ||
| "spf_admin_cmd_rate_limited_total %lu\n" | ||
| "spf_admin_temp_grants_created_total %lu\n" | ||
| "spf_admin_failed_commands_total %lu\n" | ||
| "spf_admin_unknown_commands_total %lu\n" | ||
| "spf_admin_sensitive_commands_total %lu\n" | ||
| "spf_backend_tls_handshake_failures_total %lu\n" | ||
| "spf_backend_tls_pin_failures_total %lu\n" | ||
| "spf_backend_connect_timeouts_total %lu\n", | ||
| "spf_backend_connect_timeouts_total %lu\n" | ||
| "spf_conn_reject_emergency_total %lu\n" | ||
| "spf_conn_reject_rule_max_total %lu\n" | ||
| "spf_conn_reject_global_max_total %lu\n" | ||
| "spf_audit_verify_failures_total %lu\n" | ||
| "spf_global_max_conns %u\n" | ||
| "spf_emergency_mode %u\n", | ||
| state->active_conns, |
There was a problem hiding this comment.
The metrics formatter appends new uint64_t counters using %lu (e.g., spf_conn_reject_*_total, spf_audit_verify_failures_total). On platforms where unsigned long is 32-bit, this will misreport or invoke UB. Prefer switching these new fields (and ideally the whole formatter) to PRIu64 for uint64_t values to ensure correct output across targets.
|
@copilot apply changes based on the comments in this thread |
…tting Agent-Logs-Url: https://github.com/DsChauhan08/tunl/sessions/c8942bdd-4e9a-4a47-86ee-51352c749efb Co-authored-by: DsChauhan08 <101635962+DsChauhan08@users.noreply.github.com>
Implemented the review-thread fixes in commit I addressed the audit verification escape/locking issues, portable |
Summary