Skip to content

Commit 9807692

Browse files
Merge pull request #36 from SaridakisStamatisChristos/codex/add-arx/armax-forecasting-module
Add ARX forecasting module with telemetry staleness and metrics
2 parents 740e588 + 0ecfd47 commit 9807692

File tree

14 files changed

+880
-20
lines changed

14 files changed

+880
-20
lines changed

CMakeLists.txt

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ set(THERMAL_SIMD_CORE_SOURCES
3535
src/thermal_signals.c
3636
src/policy/policy_config.c
3737
src/policy/dispatcher_policy.cpp
38+
src/policy/arx_model.cpp
3839
src/policy/mpc_controller.cpp
3940
)
4041

@@ -47,6 +48,10 @@ target_include_directories(thermal_simd_core
4748
PRIVATE
4849
${CMAKE_CURRENT_SOURCE_DIR}/src
4950
)
51+
target_compile_definitions(thermal_simd_core
52+
PRIVATE
53+
TSD_DEFAULT_COEFF_PATH="${CMAKE_CURRENT_SOURCE_DIR}/config/controller_coeffs.json"
54+
)
5055

5156
add_executable(thermal_simd src/main.cpp src/thermal_simd.c)
5257
target_compile_options(thermal_simd PRIVATE -O2 -pthread -fPIC -mno-avx ${THERMAL_SIMD_DISPATCHER_CPU_FLAGS})
@@ -76,6 +81,10 @@ if(BUILD_TESTING)
7681
${CMAKE_CURRENT_SOURCE_DIR}/src
7782
)
7883
target_compile_definitions(thermal_simd_core_tests PRIVATE TSD_ENABLE_TESTS)
84+
target_compile_definitions(thermal_simd_core_tests
85+
PRIVATE
86+
TSD_DEFAULT_COEFF_PATH="${CMAKE_CURRENT_SOURCE_DIR}/config/controller_coeffs.json"
87+
)
7988

8089
add_executable(test_config_parser tests/test_config_parser.c)
8190
target_link_libraries(test_config_parser PRIVATE thermal_simd_core_tests)
@@ -106,6 +115,13 @@ if(BUILD_TESTING)
106115
target_compile_options(test_policy_controller PRIVATE -Wall -Wextra)
107116
add_test(NAME policy_controller COMMAND test_policy_controller)
108117

118+
add_executable(test_arx_model tests/policy/test_arx_model.cpp)
119+
target_link_libraries(test_arx_model PRIVATE thermal_simd_core_tests)
120+
target_compile_options(test_arx_model PRIVATE -Wall -Wextra)
121+
target_include_directories(test_arx_model PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src)
122+
target_compile_definitions(test_arx_model PRIVATE TSD_ENABLE_TESTS)
123+
add_test(NAME policy_arx_model COMMAND test_arx_model)
124+
109125
add_executable(test_healthcheck_runtime_flags tests/healthcheck/runtime_flags.cpp)
110126
target_link_libraries(test_healthcheck_runtime_flags PRIVATE thermal_simd_core_tests)
111127
target_compile_options(test_healthcheck_runtime_flags PRIVATE -Wall -Wextra)

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ Sensor dropouts automatically trigger exponential back-off retries and emit logs
8787

8888
See dedicated docs for subsystem details:
8989
- [Predictive Controller](docs/predictive-controller.md)
90+
- [Controller Coefficient Format](docs/controller_coeffs.md)
9091
- [Telemetry Fusion](docs/telemetry-fusion.md)
9192
- [Metrics Endpoints](docs/metrics-endpoints.md)
9293
- [Sandbox Workflow](docs/sandbox-workflow.md)

config/controller_coeffs.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"bias": 1200.0,
3+
"ar_temperature": [0.85],
4+
"ratio": [-0.35],
5+
"severity": [0.05],
6+
"trimmed_ratio": [0.0],
7+
"ma": 0.25,
8+
"staleness_window_ms": 750
9+
}

docs/controller_coeffs.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Controller Coefficient File
2+
3+
The predictive controller ingests coefficients from `config/controller_coeffs.json` (or the path provided via `--coeff-path`). The file is a JSON object with the following fields:
4+
5+
| Field | Type | Required | Description |
6+
| --- | --- | --- | --- |
7+
| `bias` | number | Yes | Constant term applied to the forecast (millicelsius). |
8+
| `ar_temperature` | array<number> | Yes | Auto-regressive coefficients applied to historical package temperatures. The array length determines the minimum history window. |
9+
| `ratio` | array<number> | No | Coefficients applied to historical SIMD ratio measurements (milli-units). |
10+
| `trimmed_ratio` | array<number> | No | Coefficients applied to the trimmed ratio (if available). |
11+
| `severity` | array<number> | No | Coefficients applied to the severity metric reported in telemetry (milli-units). |
12+
| `ma` | number | No | Moving-average gain applied to the most recent residual (`actual - forecast`). |
13+
| `staleness_window_ms` | number | No | Maximum age (in milliseconds) of telemetry used for prediction. Defaults to 500 ms. |
14+
15+
Example:
16+
17+
```json
18+
{
19+
"bias": 1200.0,
20+
"ar_temperature": [0.85, 0.05],
21+
"ratio": [-0.30],
22+
"severity": [0.04],
23+
"ma": 0.25,
24+
"staleness_window_ms": 750
25+
}
26+
```
27+
28+
## Hot Reload Workflow
29+
30+
1. Update the JSON file on disk (e.g., write a new revision into the ConfigMap or local path).
31+
2. Send `SIGHUP` to the dispatcher process. The controller marks a reload for the next control tick.
32+
3. On the following recommendation cycle, the controller attempts to parse the file:
33+
- Success increments `predictive_coeff_reload_total` and logs an INFO entry with the new history window and staleness guard.
34+
- Failure increments `predictive_coeff_reload_errors_total`, logs an ERROR entry, and falls back to the previous coefficients or averaging forecast.
35+
36+
## Validation Tips
37+
38+
- Use `tests/policy/test_arx_model.cpp` as a reference for crafting deterministic coefficients during development.
39+
- Monitor `predictive_abs_error_millic_total` to evaluate how well the updated coefficients track observed temperatures.
40+
- Pair coefficient adjustments with updates to [Predictive Controller](predictive-controller.md) documentation to keep operational guidance in sync.

docs/metrics-endpoints.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,11 @@ The dispatcher exports metrics and health data via a multi-channel strategy tail
2121
| Metric | Type | Description |
2222
| --- | --- | --- |
2323
| `predictive_forecasts_total` | Counter | Forecast cycles executed by the predictive controller. |
24-
| `predictive_downgrades_total` | Counter | Controller-driven SIMD downgrades. |
24+
| `predictive_decisions_total` | Counter | Control loop iterations that issued a predictive decision. |
25+
| `predictive_abs_error_millic_total` | Counter | Accumulated absolute error between forecast and observed temperature. |
26+
| `predictive_stale_samples_total` | Counter | Telemetry snapshots rejected because they exceeded the staleness window. |
27+
| `predictive_coeff_reload_total` | Counter | Successful coefficient reloads (startup and SIGHUP). |
28+
| `predictive_coeff_reload_errors_total` | Counter | Failed attempts to reload the coefficient file. |
2529
| `telemetry_snapshots_total` | Counter | Telemetry fusion snapshots published. |
2630
| `telemetry_degraded_total` | Counter | Snapshots flagged as degraded due to missing signals. |
2731
| `patch_transitions_total` | Counter | Successful SIMD trampoline swaps. |

docs/predictive-controller.md

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -18,17 +18,24 @@ The predictive controller combines reactive thermal throttling with a short-hori
1818
Each signal is tagged with a monotonic timestamp. Stale signals (>2 intervals) are discarded and treated as unavailable.
1919

2020
## Forecast Model
21-
The controller uses a single-step ARX model:
21+
The controller uses a single-step ARX/ARMAX model implemented in `src/policy/arx_model.cpp` and driven by coefficients stored in `config/controller_coeffs.json` (see [Controller Coefficients](controller_coeffs.md)). The model consumes a sliding window of recent telemetry samples and projects the next package temperature in millicelsius:
2222

2323
```
24-
T[t+1] = a0 + a1 * T[t] + a2 * CPI[t] + a3 * Freq[t] + a4 * Power[t]
24+
T[t+1] = bias
25+
+ Σ φᵢ · T[t-i]
26+
+ Σ θᵢ · Ratio[t-i]
27+
+ Σ γᵢ · Severity[t-i]
28+
+ ψ · ε[t]
2529
```
2630

27-
- Coefficients `a1..a4` are calibrated offline using lab traces and stored in `config/controller_coeffs.json`.
28-
- The bias `a0` compensates for ambient temperature.
29-
- Missing inputs zero out their coefficients and raise the `predictive_input_gaps` metric.
31+
- `φᵢ`, `θᵢ`, and `γᵢ` are configurable auto-regressive and exogenous coefficients.
32+
- `ψ` is an optional moving-average gain applied to the most recent residual `ε[t] = T[t] - T̂[t]`.
33+
- Missing temperature samples disable the prediction path and fall back to a simple moving average.
34+
- Coefficient files support hot-reload: the controller listens for `SIGHUP` and re-reads `config/controller_coeffs.json` on the next control tick. Successful reloads and failures are logged and exported via metrics.
3035

31-
The forecast produces a projected temperature and CPI value under the current SIMD width. The controller evaluates transitions (`SSE4.1`, `AVX2`, `AVX-512`) and selects the highest width whose projected temperature remains below `temp_ceiling_c - safety_margin_c` and whose CPI ratio is under `up_ratio`.
36+
Telemetry freshness is enforced prior to forecasting. If the latest sample exceeds the configured `staleness_window_ms`, the controller skips predictive evaluation, logs a warning, and records `predictive_stale_samples_total`.
37+
38+
The forecast produces a projected temperature under the current SIMD width. The controller evaluates transitions (`SSE4.1`, `AVX2`, `AVX-512`) and selects the highest width whose projected temperature remains below `temp_ceiling_c - safety_margin_c` and whose CPI ratio is under `up_ratio`.
3239

3340
## Decision Pipeline
3441
1. **Acquire Inputs:** Pull the latest telemetry fusion snapshot (all `TelemetrySnapshot` values share a generation number).
@@ -52,11 +59,12 @@ The forecast produces a projected temperature and CPI value under the current SI
5259
| `--predictive-alpha` | EWMA alpha applied to CPI history. | 0.25 |
5360

5461
## Telemetry & Metrics
55-
- `predictive_forecasts_total`: incremented each control tick.
56-
- `predictive_downgrades_total`: decision to reduce SIMD width due to forecast.
57-
- `predictive_input_gaps_total`: missing telemetry inputs for a tick.
58-
- `predictive_emergency_transitions_total`: emergency scalar fallbacks.
59-
- `predictive_coeff_reload_errors_total`: failure to read coefficients on reload.
62+
- `predictive_forecasts_total`: ARX/ARMAX forecasts executed with valid telemetry.
63+
- `predictive_decisions_total`: control decisions driven by the predictive controller.
64+
- `predictive_abs_error_millic_total`: accumulated absolute prediction error in millicelsius.
65+
- `predictive_stale_samples_total`: telemetry snapshots rejected due to staleness.
66+
- `predictive_coeff_reload_total`: successful coefficient reloads (including on startup).
67+
- `predictive_coeff_reload_errors_total`: failures to read or parse the coefficient file.
6068

6169
Metrics are exposed through the metrics subsystem documented in [Metrics Endpoints](metrics-endpoints.md).
6270

docs/sandbox-workflow.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ This workflow describes how to exercise the dispatcher in a non-production sandb
5252

5353
## Exit Criteria
5454
- Dispatcher exits 0.
55-
- `artifacts/*/metrics.ndjson` contains expected counters (`predictive_downgrades_total > 0` during spike scenario).
55+
- `artifacts/*/metrics.ndjson` contains expected counters (`predictive_decisions_total > 0` during spike scenario).
5656
- No `state=emergency` logs during nominal runs.
5757

5858
## Automation

include/thermal/simd/metrics.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,12 @@ typedef enum {
2222
TSD_METRIC_PATCH_FAILURES,
2323
TSD_METRIC_HEALTH_CHECK_FAILURES,
2424
TSD_METRIC_SOFTWARE_TIMEOUT_ESCALATIONS,
25+
TSD_METRIC_PREDICTIVE_FORECASTS,
26+
TSD_METRIC_PREDICTIVE_STALE_SAMPLES,
27+
TSD_METRIC_PREDICTIVE_RELOADS,
28+
TSD_METRIC_PREDICTIVE_RELOAD_ERRORS,
29+
TSD_METRIC_PREDICTIVE_ABS_ERROR_MILLIC,
30+
TSD_METRIC_PREDICTIVE_DECISIONS,
2531
TSD_METRIC_COUNT
2632
} tsd_metric_counter_t;
2733

0 commit comments

Comments
 (0)