Skip to content

production-ready dispatcher with CLI flags, W^X double-buffering, CPI-based thermal policy, XMM-only payloads, and demo options

License

Notifications You must be signed in to change notification settings

SaridakisStamatisChristos/Thermal_SIMD_Dispatcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Thermal‑Aware Self‑Patching SIMD Dispatcher

Production‑grade, Linux x86‑64 only. Runtime chooses between SSE4.1 / AVX2 / AVX‑512 (XMM‑only) and self‑patches a tiny trampoline under strict W^X with a double buffer. Thermal adaptation uses time‑scaled CPI from perf_event_open with hysteresis, cooldown and a minimum dwell time. A small shim handles scalar↔SIMD and avoids AVX/SSE transition penalties.

Build

Prerequisites

  • Linux 5.9+ with CAP_PERFMON available to the dispatcher user.
  • /dev/cpu/*/msr readable by the runtime (systemd unit grants CAP_SYS_ADMIN).
  • Optional metrics TLS materials (certificate + key) if exposing /metrics off-host.
  • Attestation bundle (patcher_measurement.json, attestor_pub.pem) staged under /etc/tsd/.
  • Config overrides for telemetry/predictive controller can be provided through TSD_TELEMETRY_* and TSD_PREDICTIVE_* env vars.

Make

make

CMake

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j

Run

Requires CAP_PERFMON or sudo sysctl kernel.perf_event_paranoid=0.

./thermal_simd --help
./thermal_simd --no-avx512 --interval=100 --down-ratio=1.3 --duration-sec=5

Flags

  • --config=FILE load overrides from a JSON file (see configuration docs).
  • --interval=MS check interval (default 50).
  • --down-count=N throttles before downgrade (default 3).
  • --up-count=N stable intervals before upgrade (default 5).
  • --down-ratio=R throttle threshold as CPI multiple (default 1.5).
  • --cooldown-down=MS cooldown after downgrade (default 1000).
  • --cooldown-up=MS cooldown after upgrade (default 2000).
  • --min-dwell=MS minimum time per SIMD width (default 200).
  • --no-avx512 disable AVX‑512 usage.
  • --duration-sec=S runtime duration for demo (default 10).
  • --work-iters=N inner work iterations per tick (default 10,000,000).
  • --degraded-timeout-sec=S fail closed if hardware counters remain unavailable for S seconds (default 120).
  • --log-level=LEVEL set log verbosity (error, warn, info, debug; default info).
  • --health-check run diagnostics (perf counters, telemetry, trampolines) and exit with status.

Predictive controller

  • --temp-ceiling=°C predictive controller ceiling (default 92).
  • --safety-margin=°C guard band below the ceiling for upgrades (default 4).
  • --emergency-margin=°C additional buffer that triggers scalar fallback (default 10).
  • --predictive-alpha=A CPI EWMA alpha in the predictive path (default 0.25).
  • --coeff-path=PATH ARX coefficient bundle (default config/controller_coeffs.json).

Telemetry fusion

  • --telemetry-interval=MS collector interval (default 50).
  • --telemetry-max-skew=MS allowable skew between collectors (default 150).
  • --telemetry-ewma=A telemetry CPI EWMA alpha (default 0.25).
  • --telemetry-profile=PATH optional telemetry profile manifest.

Metrics & observability

  • --metrics-port=PORT Prometheus endpoint port (default 9464, 0 disables).
  • --metrics-bind=ADDR bind address (default 127.0.0.1).
  • --metrics-cert=PATH / --metrics-key=PATH enable TLS for the metrics endpoint.
  • --metrics-ca=PATH optional client CA bundle when using mutual TLS.
  • --metrics-require-client-auth enforce mutual TLS for /metrics and /healthz.
  • --metrics-basic-auth=user:pass enable HTTP basic authentication.
  • --statsd-host=HOST emit StatsD metrics to the given host (disabled by default).
  • --statsd-port=PORT StatsD UDP port (default 8125).

Environment override:

  • TSD_LOG_LEVEL mirrors --log-level for non-interactive deployments.
  • TSD_TELEMETRY_*, TSD_PREDICTIVE_*, and TSD_METRICS_* mirror respective CLI flags.

Health Check

The dispatcher exposes a one-shot diagnostic mode that validates hardware counters, telemetry probes, and trampoline integrity before workloads start:

./thermal_simd --health-check

The command exits non-zero when the dispatcher would operate in degraded mode (e.g. missing perf_event_open permissions or inaccessible MSRs) and increments the health_check_failures metric.

Metrics & Observability

Structured log lines (key=value) and in-process counters provide hooks for Prometheus/StatsD scraping. The following counters are tracked in runtime_metrics.c and exposed via log snapshots:

  • perf_fallbacks / perf_recoveries
  • telemetry_temp_*, telemetry_freq_*, telemetry_msr_*
  • patch_transitions / patch_failures
  • software_timeout_escalations
  • health_check_failures
  • attestation_verifications
  • attestation_failure
  • metrics_flush_duration_ms

Sensor dropouts automatically trigger exponential back-off retries and emit logs such as event=telemetry_sensor state=degraded sensor=temp to simplify alert wiring.

See dedicated docs for subsystem details:

Tests

Refer to the Validation Matrix for a subsystem → coverage breakdown.

Run smoke tests (build + basic run):

tests/compile.sh && tests/smoke.sh

A hardware-backed nightly can re-use the new helper script:

ci/hw-smoke.sh

CI expectations:

  • .github/workflows/ci.yml runs the public GitHub Actions pipeline (configure, build, unit and integration tests).
  • ci/pipeline.yml orchestrates build, hardware-smoke, stress-suite, and thermal-soak hardware stages described in the Validation Matrix.
  • ci/hw-smoke.sh executes on bare metal to verify MSR/perf integration and metrics TLS (see docs/ci-hil.md for provisioning guidance).

Infrastructure requirement Hardware-in-the-loop stages are pinned to runners tagged hil and avx512. Ensure this fleet is online before expecting counter/MSR regressions to surface automatically.

Note Security attestation and sandbox fuzzing now run via ci/security.yml and ci/sandbox.yml. These jobs require dedicated credentials/runners and currently fail open, so release reviews must still confirm the checklists documented in docs/testing-matrix.md before promotion.

Packaging

  • packaging/Dockerfile builds a minimal container with the dispatcher defaulting to health checks on startup.
  • packaging/systemd/thermal-simd.service is a hardened unit file that runs the binary with the required capabilities.
  • packaging/kubernetes/daemonset.yaml demonstrates a daemonset with MSR/perf mounts and capability grants.

Notes

  • Requires SSE4.1 (fails fast otherwise)
  • Uses perf_event_open; in containers, add --cap-add=SYS_ADMIN or run privileged
  • XMM‑only payloads to minimize downclocks and power
  • Patch failures restore trampoline page protections before retrying so the runtime fails closed

License

This project is distributed under a proprietary commercial license. See LICENSE for full terms.

About

production-ready dispatcher with CLI flags, W^X double-buffering, CPI-based thermal policy, XMM-only payloads, and demo options

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published