Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ set(THERMAL_SIMD_CORE_SOURCES
src/thermal_signals.c
src/policy/policy_config.c
src/policy/dispatcher_policy.cpp
src/policy/arx_model.cpp
src/policy/mpc_controller.cpp
)

Expand All @@ -47,6 +48,10 @@ target_include_directories(thermal_simd_core
PRIVATE
${CMAKE_CURRENT_SOURCE_DIR}/src
)
target_compile_definitions(thermal_simd_core
PRIVATE
TSD_DEFAULT_COEFF_PATH="${CMAKE_CURRENT_SOURCE_DIR}/config/controller_coeffs.json"
)

add_executable(thermal_simd src/main.cpp src/thermal_simd.c)
target_compile_options(thermal_simd PRIVATE -O2 -pthread -fPIC -mno-avx ${THERMAL_SIMD_DISPATCHER_CPU_FLAGS})
Expand Down Expand Up @@ -76,6 +81,10 @@ if(BUILD_TESTING)
${CMAKE_CURRENT_SOURCE_DIR}/src
)
target_compile_definitions(thermal_simd_core_tests PRIVATE TSD_ENABLE_TESTS)
target_compile_definitions(thermal_simd_core_tests
PRIVATE
TSD_DEFAULT_COEFF_PATH="${CMAKE_CURRENT_SOURCE_DIR}/config/controller_coeffs.json"
)

add_executable(test_config_parser tests/test_config_parser.c)
target_link_libraries(test_config_parser PRIVATE thermal_simd_core_tests)
Expand Down Expand Up @@ -106,6 +115,13 @@ if(BUILD_TESTING)
target_compile_options(test_policy_controller PRIVATE -Wall -Wextra)
add_test(NAME policy_controller COMMAND test_policy_controller)

add_executable(test_arx_model tests/policy/test_arx_model.cpp)
target_link_libraries(test_arx_model PRIVATE thermal_simd_core_tests)
target_compile_options(test_arx_model PRIVATE -Wall -Wextra)
target_include_directories(test_arx_model PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src)
target_compile_definitions(test_arx_model PRIVATE TSD_ENABLE_TESTS)
add_test(NAME policy_arx_model COMMAND test_arx_model)

add_executable(test_healthcheck_runtime_flags tests/healthcheck/runtime_flags.cpp)
target_link_libraries(test_healthcheck_runtime_flags PRIVATE thermal_simd_core_tests)
target_compile_options(test_healthcheck_runtime_flags PRIVATE -Wall -Wextra)
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ Sensor dropouts automatically trigger exponential back-off retries and emit logs

See dedicated docs for subsystem details:
- [Predictive Controller](docs/predictive-controller.md)
- [Controller Coefficient Format](docs/controller_coeffs.md)
- [Telemetry Fusion](docs/telemetry-fusion.md)
- [Metrics Endpoints](docs/metrics-endpoints.md)
- [Sandbox Workflow](docs/sandbox-workflow.md)
Expand Down
9 changes: 9 additions & 0 deletions config/controller_coeffs.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"bias": 1200.0,
"ar_temperature": [0.85],
"ratio": [-0.35],
"severity": [0.05],
"trimmed_ratio": [0.0],
"ma": 0.25,
"staleness_window_ms": 750
}
40 changes: 40 additions & 0 deletions docs/controller_coeffs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Controller Coefficient File

The predictive controller ingests coefficients from `config/controller_coeffs.json` (or the path provided via `--coeff-path`). The file is a JSON object with the following fields:

| Field | Type | Required | Description |
| --- | --- | --- | --- |
| `bias` | number | Yes | Constant term applied to the forecast (millicelsius). |
| `ar_temperature` | array<number> | Yes | Auto-regressive coefficients applied to historical package temperatures. The array length determines the minimum history window. |
| `ratio` | array<number> | No | Coefficients applied to historical SIMD ratio measurements (milli-units). |
| `trimmed_ratio` | array<number> | No | Coefficients applied to the trimmed ratio (if available). |
| `severity` | array<number> | No | Coefficients applied to the severity metric reported in telemetry (milli-units). |
| `ma` | number | No | Moving-average gain applied to the most recent residual (`actual - forecast`). |
| `staleness_window_ms` | number | No | Maximum age (in milliseconds) of telemetry used for prediction. Defaults to 500 ms. |

Example:

```json
{
"bias": 1200.0,
"ar_temperature": [0.85, 0.05],
"ratio": [-0.30],
"severity": [0.04],
"ma": 0.25,
"staleness_window_ms": 750
}
```

## Hot Reload Workflow

1. Update the JSON file on disk (e.g., write a new revision into the ConfigMap or local path).
2. Send `SIGHUP` to the dispatcher process. The controller marks a reload for the next control tick.
3. On the following recommendation cycle, the controller attempts to parse the file:
- Success increments `predictive_coeff_reload_total` and logs an INFO entry with the new history window and staleness guard.
- Failure increments `predictive_coeff_reload_errors_total`, logs an ERROR entry, and falls back to the previous coefficients or averaging forecast.

## Validation Tips

- Use `tests/policy/test_arx_model.cpp` as a reference for crafting deterministic coefficients during development.
- Monitor `predictive_abs_error_millic_total` to evaluate how well the updated coefficients track observed temperatures.
- Pair coefficient adjustments with updates to [Predictive Controller](predictive-controller.md) documentation to keep operational guidance in sync.
6 changes: 5 additions & 1 deletion docs/metrics-endpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,11 @@ The dispatcher exports metrics and health data via a multi-channel strategy tail
| Metric | Type | Description |
| --- | --- | --- |
| `predictive_forecasts_total` | Counter | Forecast cycles executed by the predictive controller. |
| `predictive_downgrades_total` | Counter | Controller-driven SIMD downgrades. |
| `predictive_decisions_total` | Counter | Control loop iterations that issued a predictive decision. |
| `predictive_abs_error_millic_total` | Counter | Accumulated absolute error between forecast and observed temperature. |
| `predictive_stale_samples_total` | Counter | Telemetry snapshots rejected because they exceeded the staleness window. |
| `predictive_coeff_reload_total` | Counter | Successful coefficient reloads (startup and SIGHUP). |
| `predictive_coeff_reload_errors_total` | Counter | Failed attempts to reload the coefficient file. |
| `telemetry_snapshots_total` | Counter | Telemetry fusion snapshots published. |
| `telemetry_degraded_total` | Counter | Snapshots flagged as degraded due to missing signals. |
| `patch_transitions_total` | Counter | Successful SIMD trampoline swaps. |
Expand Down
30 changes: 19 additions & 11 deletions docs/predictive-controller.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,24 @@ The predictive controller combines reactive thermal throttling with a short-hori
Each signal is tagged with a monotonic timestamp. Stale signals (>2 intervals) are discarded and treated as unavailable.

## Forecast Model
The controller uses a single-step ARX model:
The controller uses a single-step ARX/ARMAX model implemented in `src/policy/arx_model.cpp` and driven by coefficients stored in `config/controller_coeffs.json` (see [Controller Coefficients](controller_coeffs.md)). The model consumes a sliding window of recent telemetry samples and projects the next package temperature in millicelsius:

```
T[t+1] = a0 + a1 * T[t] + a2 * CPI[t] + a3 * Freq[t] + a4 * Power[t]
T[t+1] = bias
+ Σ φᵢ · T[t-i]
+ Σ θᵢ · Ratio[t-i]
+ Σ γᵢ · Severity[t-i]
+ ψ · ε[t]
```

- Coefficients `a1..a4` are calibrated offline using lab traces and stored in `config/controller_coeffs.json`.
- The bias `a0` compensates for ambient temperature.
- Missing inputs zero out their coefficients and raise the `predictive_input_gaps` metric.
- `φᵢ`, `θᵢ`, and `γᵢ` are configurable auto-regressive and exogenous coefficients.
- `ψ` is an optional moving-average gain applied to the most recent residual `ε[t] = T[t] - T̂[t]`.
- Missing temperature samples disable the prediction path and fall back to a simple moving average.
- Coefficient files support hot-reload: the controller listens for `SIGHUP` and re-reads `config/controller_coeffs.json` on the next control tick. Successful reloads and failures are logged and exported via metrics.

The forecast produces a projected temperature and CPI value under the current SIMD width. The controller evaluates transitions (`SSE4.1`, `AVX2`, `AVX-512`) and selects the highest width whose projected temperature remains below `temp_ceiling_c - safety_margin_c` and whose CPI ratio is under `up_ratio`.
Telemetry freshness is enforced prior to forecasting. If the latest sample exceeds the configured `staleness_window_ms`, the controller skips predictive evaluation, logs a warning, and records `predictive_stale_samples_total`.

The forecast produces a projected temperature under the current SIMD width. The controller evaluates transitions (`SSE4.1`, `AVX2`, `AVX-512`) and selects the highest width whose projected temperature remains below `temp_ceiling_c - safety_margin_c` and whose CPI ratio is under `up_ratio`.

## Decision Pipeline
1. **Acquire Inputs:** Pull the latest telemetry fusion snapshot (all `TelemetrySnapshot` values share a generation number).
Expand All @@ -52,11 +59,12 @@ The forecast produces a projected temperature and CPI value under the current SI
| `--predictive-alpha` | EWMA alpha applied to CPI history. | 0.25 |

## Telemetry & Metrics
- `predictive_forecasts_total`: incremented each control tick.
- `predictive_downgrades_total`: decision to reduce SIMD width due to forecast.
- `predictive_input_gaps_total`: missing telemetry inputs for a tick.
- `predictive_emergency_transitions_total`: emergency scalar fallbacks.
- `predictive_coeff_reload_errors_total`: failure to read coefficients on reload.
- `predictive_forecasts_total`: ARX/ARMAX forecasts executed with valid telemetry.
- `predictive_decisions_total`: control decisions driven by the predictive controller.
- `predictive_abs_error_millic_total`: accumulated absolute prediction error in millicelsius.
- `predictive_stale_samples_total`: telemetry snapshots rejected due to staleness.
- `predictive_coeff_reload_total`: successful coefficient reloads (including on startup).
- `predictive_coeff_reload_errors_total`: failures to read or parse the coefficient file.

Metrics are exposed through the metrics subsystem documented in [Metrics Endpoints](metrics-endpoints.md).

Expand Down
2 changes: 1 addition & 1 deletion docs/sandbox-workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ This workflow describes how to exercise the dispatcher in a non-production sandb

## Exit Criteria
- Dispatcher exits 0.
- `artifacts/*/metrics.ndjson` contains expected counters (`predictive_downgrades_total > 0` during spike scenario).
- `artifacts/*/metrics.ndjson` contains expected counters (`predictive_decisions_total > 0` during spike scenario).
- No `state=emergency` logs during nominal runs.

## Automation
Expand Down
6 changes: 6 additions & 0 deletions include/thermal/simd/metrics.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,12 @@ typedef enum {
TSD_METRIC_PATCH_FAILURES,
TSD_METRIC_HEALTH_CHECK_FAILURES,
TSD_METRIC_SOFTWARE_TIMEOUT_ESCALATIONS,
TSD_METRIC_PREDICTIVE_FORECASTS,
TSD_METRIC_PREDICTIVE_STALE_SAMPLES,
TSD_METRIC_PREDICTIVE_RELOADS,
TSD_METRIC_PREDICTIVE_RELOAD_ERRORS,
TSD_METRIC_PREDICTIVE_ABS_ERROR_MILLIC,
TSD_METRIC_PREDICTIVE_DECISIONS,
TSD_METRIC_COUNT
} tsd_metric_counter_t;

Expand Down
Loading