[Iris] Replace txn_log audit table with structured log lines#5082
[Iris] Replace txn_log audit table with structured log lines#5082
Conversation
Drop the SQLite txn_log/txn_actions tables and their RPC/dashboard surface. Every state-changing transition now emits a single logger.info line of the form event=<action> entity=<id> trigger=<parent> k=v..., which the log server already captures. Cascades carry trigger= naming the parent event (e.g. task->job terminations). Also instruments api_key create/revoke, which previously had no audit trail. Migration 0037 drops the tables. Fixes #4895 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @rjpower's task in 8m 12s —— View job Code review
Two validated issues, both posted inline:
No other bugs or CLAUDE.md/AGENTS.md violations found. The table/RPC/dashboard deletions, migration 0037, and auth instrumentation are consistent; deleted symbols have no other lingering references in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 171c7fcf21
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return list(reversed(actions)) | ||
|
|
||
|
|
||
| def _live_user_stats(db: ControllerDB) -> list[UserStats]: |
There was a problem hiding this comment.
Remove stale _transaction_actions benchmark dependency
This change deletes the _transaction_actions helper from service.py, but lib/iris/scripts/benchmark_db_queries.py in the same commit still imports and benchmarks _transaction_actions (import at line 70 and call at line 442), so running the benchmark script now raises an import-time failure instead of executing. Either keep a compatibility helper here or remove/replace the benchmark path in the script in the same change.
Useful? React with 👍 / 👎.
| -- What the controller has been doing | ||
| SELECT kind, count(*) FROM txn_log GROUP BY kind ORDER BY count(*) DESC LIMIT 10; | ||
| ```bash | ||
| iris process logs --since 24h | grep 'action=worker_heartbeat_failed' |
There was a problem hiding this comment.
Grep audit events using the
event= key
The new audit logging format emitted by controller transitions uses event=<action> (for example event=worker_heartbeat_failed), but this OPS example filters on action=worker_heartbeat_failed; operators following it will miss the intended audit entries during debugging. Update the filter/example to match the emitted field name.
Useful? React with 👍 / 👎.
| Controller audit events (`event=<kind> action=<action> entity=<id> ...`) are | ||
| emitted as structured `logger.info` lines — query them through | ||
| `iris process logs` with a substring filter, not via SQL. Example: | ||
|
|
||
| -- What the controller has been doing | ||
| SELECT kind, count(*) FROM txn_log GROUP BY kind ORDER BY count(*) DESC LIMIT 10; | ||
| ```bash | ||
| iris process logs --since 24h | grep 'action=worker_heartbeat_failed' |
There was a problem hiding this comment.
The grammar description and grep example don't match what log_event actually emits. log_event in transitions.py emits event=<action> entity=<entity_id> trigger=<trigger> k=v ... — there is no separate action= token. The documented grep 'action=worker_heartbeat_failed' will match nothing. Either the grammar line should drop the phantom action=<action> and the example should be grep 'event=worker_heartbeat_failed'.
| Controller audit events (`event=<kind> action=<action> entity=<id> ...`) are | |
| emitted as structured `logger.info` lines — query them through | |
| `iris process logs` with a substring filter, not via SQL. Example: | |
| -- What the controller has been doing | |
| SELECT kind, count(*) FROM txn_log GROUP BY kind ORDER BY count(*) DESC LIMIT 10; | |
| ```bash | |
| iris process logs --since 24h | grep 'action=worker_heartbeat_failed' | |
| Controller audit events (`event=<action> entity=<id> trigger=<parent> k=v ...`) | |
| are emitted as structured `logger.info` lines — query them through | |
| `iris process logs` with a substring filter, not via SQL. Example: | |
| ```bash | |
| iris process logs --since 24h | grep 'event=worker_heartbeat_failed' |
| "reservation_claims", | ||
| "txn_log", | ||
| "txn_actions", | ||
| "meta", |
There was a problem hiding this comment.
This file has two more dangling references to the deleted _transaction_actions helper that this PR missed:
- line 70 — still imported in the
from iris.cluster.controller.service import (...)block - line 442 — still called via
bench("_transaction_actions", lambda: _transaction_actions(db))
Since _transaction_actions was removed from service.py, running the script now raises ImportError: cannot import name '_transaction_actions' from 'iris.cluster.controller.service' at module load. Drop both the import entry and the bench(...) call.
Wire log_event into scheduling pass completion, StartTasks RPC dispatch failure, worker-rejected task acks, worker termination (with cause trigger), sibling worker failover, reservation claim mutations, and the provider-sync health summary. Enriches worker_heartbeat_failed with address/age/rpc_action context and drops the unstructured duplicate warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the SQLite txn_log/txn_actions tables and their RPC/dashboard surface. Every state-changing transition now emits a single logger.info line of the form event= entity= trigger= k=v..., which the log server already captures. Cascades carry trigger= naming the parent event. Also instruments api_key create/revoke, which previously had no audit trail. Migration 0037 drops the tables.
Fixes #4895