Skip to content

Harden listener controls and wire backend failure telemetry#5

Merged
DsChauhan08 merged 2 commits intomainfrom
security-hardening-enterprise-controls
Apr 1, 2026
Merged

Harden listener controls and wire backend failure telemetry#5
DsChauhan08 merged 2 commits intomainfrom
security-hardening-enterprise-controls

Conversation

@DsChauhan08
Copy link
Copy Markdown
Owner

Summary

  • strengthen listener/runtime control flow and TLS cache handling to reduce race conditions and tighten lifecycle safety
  • wire backend failure telemetry into metrics and audit paths for better operational visibility during incidents
  • update sanitizers/build scripts, docs, and integration smoke coverage to reflect and verify the hardening changes

Copilot AI review requested due to automatic review settings April 1, 2026 12:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens runtime/admin controls and observability by adding new admin control commands (rate/connection caps, emergency mode, audit chain verification) and wiring corresponding telemetry into Prometheus metrics and CLI/integration smoke coverage.

Changes:

  • Added new admin control-plane commands: SETRATE, SETMAXCONNS, SETGLOBALMAXCONNS, EMERGENCY, and AUDITVERIFY, plus listener-side connection admission checks.
  • Expanded Prometheus metrics for connection rejections and audit verification failures, and added backend TLS pin failure counting.
  • Updated test harnesses and build/CI sanitizer targets to run smoke/CLI tests against a selectable SPF_BIN.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/integration_smoke.py Allows overriding the SPF binary via SPF_BIN for test portability.
tests/cli_realworld.sh Exercises new control commands and validates new metrics/audit verification output.
src/server.cpp Implements new control commands and enforces new connection-admission checks/telemetry.
src/metrics.c Adds new metrics definitions and exports new counters/gauges.
src/core.c Adds control command classification, audit chain verification, and initializes new state fields.
src/common.h Extends state and command enums to support new controls/metrics.
spf_audit.log Adds a sample audit log file to the repository.
README.md Documents new admin commands and expanded metrics list.
Makefile Adds ASAN/UBSAN build+test targets and passes SPF_BIN into tests.
.github/workflows/sanitizers.yml Runs the new sanitizer test targets in CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/core.c Outdated
Comment on lines +319 to +337
static void json_unescape_minimal(char* s) {
if (!s) {
return;
}
size_t r = 0, w = 0;
while (s[r] != '\0') {
if (s[r] == '\\' && s[r + 1] != '\0') {
char n = s[r + 1];
if (n == 'n') s[w++] = '\n';
else if (n == 'r') s[w++] = '\r';
else if (n == 't') s[w++] = '\t';
else s[w++] = n;
r += 2;
} else {
s[w++] = s[r++];
}
}
s[w] = '\0';
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spf_audit_verify_chain() calls json_unescape_minimal() on extracted JSON strings, but json_unescape_minimal() treats \uXXXX as a generic escape and strips the backslash (turning e.g. \u001b into u001b). Since audit_json_escape() can emit \u%04x for control chars, this makes the canonical string differ from what was originally hashed and will cause verification to fail for valid logs containing \u escapes. Consider either (a) not unescaping at all and reusing the already-escaped JSON field values when reconstructing canonical, or (b) implementing proper JSON unescape support for \uXXXX and then always re-escaping with audit_json_escape() before hashing.

Copilot uses AI. Check for mistakes.
Comment thread src/core.c
Comment on lines +339 to +347
int spf_audit_verify_chain(spf_state_t* state, uint32_t* entries_checked, uint32_t* failures) {
if (!state || !state->config.admin.audit_log_path[0]) {
return -1;
}

FILE* f = fopen(state->config.admin.audit_log_path, "r");
if (!f) {
return -1;
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spf_audit_verify_chain() reads the audit log file without taking state->audit_lock, while spf_audit_log() appends under that lock. This can race with concurrent writes (partial line read / inconsistent hash chain) and produce false verification failures. Consider holding state->audit_lock for the duration of verification (or copying the file/size under lock before reading) to ensure a consistent snapshot.

Copilot uses AI. Check for mistakes.
Comment thread src/server.cpp
Comment on lines +2145 to +2156
else if (strncmp(buf, "SETRATE ", 8) == 0) {
uint32_t id = 0;
uint64_t bps = 0;
if (sscanf(buf + 8, "%u %lu", &id, &bps) == 2) {
spf_rule_t* r = spf_get_rule(&g_state, id);
if (!r) {
snprintf(resp, sizeof(resp), "ERR rule not found\n");
} else if (bps == 0 || rule_set_rate_limit(r, bps) != 0) {
snprintf(resp, sizeof(resp), "ERR invalid rate\n");
} else {
snprintf(resp, sizeof(resp), "OK rate set rule=%u bps=%lu\n", id, bps);
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SETRATE parser stores into a uint64_t bps but uses %lu in sscanf/snprintf. This is not portable across platforms where uint64_t is not unsigned long (e.g., Windows LLP64) and can truncate or misformat values. Prefer using SCNu64/PRIu64 (or %llu with an unsigned long long temp) for both parsing and formatting.

Copilot uses AI. Check for mistakes.
Comment thread src/server.cpp
Comment on lines +1114 to +1125
pthread_mutex_lock(&g_state.stats_lock);
if (g_state.emergency_mode) {
g_state.conn_reject_emergency++;
pthread_mutex_unlock(&g_state.stats_lock);
return false;
}
if (g_state.global_max_conns > 0 && g_state.active_conns >= g_state.global_max_conns) {
g_state.conn_reject_global_max++;
pthread_mutex_unlock(&g_state.stats_lock);
return false;
}
pthread_mutex_unlock(&g_state.stats_lock);
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_accept_new_conn() checks g_state.active_conns against g_state.global_max_conns under stats_lock, but does not reserve/increment the count while holding the lock. With concurrent listener threads, multiple accepts can pass the check and then later increment active_conns, exceeding the configured cap. To make the cap reliable, consider reserving a slot under stats_lock (and rolling it back on subsequent dial/TLS failure).

Copilot uses AI. Check for mistakes.
Comment thread src/server.cpp Outdated
Comment on lines +1127 to +1134
if (rule->max_conns > 0) {
uint32_t sum = 0;
for (int i = 0; i < rule->backend_count; i++) {
pthread_mutex_lock(&rule->backends[i].lock);
sum += rule->backends[i].active_conns;
pthread_mutex_unlock(&rule->backends[i].lock);
}
if (sum >= rule->max_conns) {
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_accept_new_conn() reads rule->max_conns and iterates rule->backend_count without holding rule->lock, while SETMAXCONNS updates rule->max_conns under that lock. This introduces an unsynchronized read/write data race. Consider taking rule->lock to snapshot max_conns/backend_count before the loop (or making these fields atomic / updated under a lock that readers also hold).

Suggested change
if (rule->max_conns > 0) {
uint32_t sum = 0;
for (int i = 0; i < rule->backend_count; i++) {
pthread_mutex_lock(&rule->backends[i].lock);
sum += rule->backends[i].active_conns;
pthread_mutex_unlock(&rule->backends[i].lock);
}
if (sum >= rule->max_conns) {
uint32_t rule_max_conns = 0;
int backend_count = 0;
pthread_mutex_lock(&rule->lock);
rule_max_conns = rule->max_conns;
backend_count = rule->backend_count;
pthread_mutex_unlock(&rule->lock);
if (rule_max_conns > 0) {
uint32_t sum = 0;
for (int i = 0; i < backend_count; i++) {
pthread_mutex_lock(&rule->backends[i].lock);
sum += rule->backends[i].active_conns;
pthread_mutex_unlock(&rule->backends[i].lock);
}
if (sum >= rule_max_conns) {

Copilot uses AI. Check for mistakes.
Comment thread src/metrics.c
Comment on lines 75 to 101
written += snprintf(buf + written, len - written,
"spf_connections_active %u\n"
"spf_connections_total %lu\n"
"spf_bytes_in_total %lu\n"
"spf_bytes_out_total %lu\n"
"spf_blocked_total %lu\n"
"spf_rules_active %u\n"
"spf_uptime_seconds %lu\n"
"spf_admin_auth_failures_total %lu\n"
"spf_admin_lockouts_total %lu\n"
"spf_admin_service_token_auth_success_total %lu\n"
"spf_admin_service_token_auth_fail_total %lu\n"
"spf_admin_cmd_rate_limited_total %lu\n"
"spf_admin_temp_grants_created_total %lu\n"
"spf_admin_failed_commands_total %lu\n"
"spf_admin_unknown_commands_total %lu\n"
"spf_admin_sensitive_commands_total %lu\n"
"spf_backend_tls_handshake_failures_total %lu\n"
"spf_backend_tls_pin_failures_total %lu\n"
"spf_backend_connect_timeouts_total %lu\n",
"spf_backend_connect_timeouts_total %lu\n"
"spf_conn_reject_emergency_total %lu\n"
"spf_conn_reject_rule_max_total %lu\n"
"spf_conn_reject_global_max_total %lu\n"
"spf_audit_verify_failures_total %lu\n"
"spf_global_max_conns %u\n"
"spf_emergency_mode %u\n",
state->active_conns,
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics formatter appends new uint64_t counters using %lu (e.g., spf_conn_reject_*_total, spf_audit_verify_failures_total). On platforms where unsigned long is 32-bit, this will misreport or invoke UB. Prefer switching these new fields (and ideally the whole formatter) to PRIu64 for uint64_t values to ensure correct output across targets.

Copilot uses AI. Check for mistakes.
@DsChauhan08
Copy link
Copy Markdown
Owner Author

@copilot apply changes based on the comments in this thread

…tting

Agent-Logs-Url: https://github.com/DsChauhan08/tunl/sessions/c8942bdd-4e9a-4a47-86ee-51352c749efb

Co-authored-by: DsChauhan08 <101635962+DsChauhan08@users.noreply.github.com>
@DsChauhan08 DsChauhan08 merged commit 9411fb7 into main Apr 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 1, 2026

@copilot apply changes based on the comments in this thread

Implemented the review-thread fixes in commit 4f2a189.

I addressed the audit verification escape/locking issues, portable uint64_t formatting/parsing in SETRATE and metrics, and the listener connection-cap race/data-race concerns with reservation + rollback handling. I also validated with make -j$(nproc), make test-smoke, and make test-cli (with a fresh /tmp/spf-cli-audit.log).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants