[iris] benchmark_controller: production-mix scenario over real RPC#5775
Conversation
Refactor lib/iris/scripts/benchmark_controller.py so every benchmark exercises the controller through its real Connect/HTTP surface rather than calling service/transitions helpers directly. The previous in-process call path missed serialization, ASGI dispatch, the @on_loop vs threadpool decision in AsyncServiceAdapter, and the GIL-shared cost of running clients alongside the server -- all of which dominate the production tail. Major changes: * New RpcHarness boots a real Controller(dry_run=True) backed by the cloned snapshot. The scheduler / polling / ping / heartbeat loops tick at their natural production cadence; only destructive writes are gated by dry_run. * New SubprocessRpcHarness spawns the controller in a separate Python process via a new `serve` subcommand on the same script. The benchmark process owns the clients; the controller owns the event loop. Mirrors production where ~200 task processes each push their own SetTaskStatusText from independent interpreters. * New scenario abstraction (RpcLoad / Scenario / ScenarioRunner) that drives a *mix* of RPCs at independently-configurable target rates. Each RpcLoad also carries an n_clients_min so we can match the production fleet's connection count, not just throughput. * PRODUCTION_MIX_RPS preset derived from the controller stats page (SetTaskStatusText dominates at 316 rps, every other RPC < 8 rps). Run via `... run --only scenario --scale 1.0 --scenario-duration 60`. --scale multiplies all rates uniformly for overload runs. * 17 load_* factories cover every RPC in the production distribution. benchmark_rpcs now drives the per-call microbenches through the same factories so request-construction lives in one place. * New benchmark_set_task_status_text group probes the @on_loop vs threadpool dispatch decision head-to-head and reports the blast radius a 1000 rps SetTaskStatusText storm imposes on a victim UpdateTaskStatus. CLI is now a click.group: `... run [...]` for benchmarks, `... serve --db-path X --state-dir Y` for the controller subprocess. Also bumps the _wait_until_job_drained backoff (0.05->1.0 initial, 1.0->10.0 max, 1.5->2 factor) to reduce poll pressure on the heartbeat path during long replacement drains.
|
Claude finished @rjpower's task in 4m 26s —— View job Code review
Posted 5 inline comments on high-signal issues:
Items considered but not flagged (low signal or borderline):
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: da3d35b9fd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| line = self._proc.stdout.readline() if self._proc.stdout else "" | ||
| if not line: |
There was a problem hiding this comment.
Make subprocess startup timeout actually enforceable
SubprocessRpcHarness.__init__ uses self._proc.stdout.readline() inside the timeout loop, but readline() blocks until a newline/EOF. If the child process hangs before emitting a line, this call blocks indefinitely and the startup_timeout_s deadline is never checked, so benchmark_scenario can hang forever instead of failing fast.
Useful? React with 👍 / 👎.
| # Restore the function attribute to whatever the source file declares so | ||
| # subsequent groups in the same run aren't observably mutated. | ||
| _toggle_set_task_status_text_on_loop(False) |
There was a problem hiding this comment.
Restore SetTaskStatusText dispatch mode to source default
This cleanup call always forces set_task_status_text out of @on_loop mode for the rest of the process, but the source method is currently decorated with @on_loop. In the default run ordering, set_task_status_text runs before scenario, so the production-mix scenario is measured with the wrong dispatch mode and reports misleading latency/throughput numbers.
Useful? React with 👍 / 👎.
| tp_baseline, tp_under = _measure_update_task_status_under_storm( | ||
| db, on_loop=False, storm_rps_target=storm_rps, storm_threads=storm_threads, victim_duration_s=victim_seconds | ||
| ) |
There was a problem hiding this comment.
Run contention probe against a cloned controller DB
benchmark_set_task_status_text passes the shared db into _measure_update_task_status_under_storm, which repeatedly invokes update_task_status (a write path). Unlike the other write-heavy benchmark groups that clone the DB first, this mutates the same checkpoint backing the rest of the run, contaminating later groups (especially scenario) and the cached local benchmark DB state.
Useful? React with 👍 / 👎.
| # Branch removed Tx.fetchall/fetchone; restore for this benchmark script. | ||
| if not hasattr(_Tx, "fetchall"): | ||
| _Tx.fetchall = lambda self, stmt, params=None: self.execute(stmt, params).all() | ||
| _Tx.fetchone = lambda self, stmt, params=None: self.execute(stmt, params).first() | ||
| from iris.cluster.controller.projections.endpoints import EndpointQuery, EndpointRow, EndpointsProjection |
There was a problem hiding this comment.
CLAUDE.md violation: ad-hoc compatibility hack via hasattr.
AGENTS.md says: "No ad-hoc compatibility hacks (hasattr(m, 'old_attr')); update code consistently." The leading comment ("Branch removed Tx.fetchall/fetchone; restore for this benchmark script.") makes this an exact instance of the banned pattern.
Two follow-on problems:
- The import on line 62 is displaced after the conditional block, breaking "All imports at the top of the file".
- Monkey-patching
_Txglobally mutates the class for any code that imports it after this script runs (theservesubcommand is in the same module).
Fix: update this script's call sites to use tx.execute(stmt, params).all() / .first() directly (matching the rest of the codebase), then drop the patch and move the displaced import back to the top block.
| _SCENARIO_SCALE: float = 1.0 | ||
| _SCENARIO_DURATION: float = 60.0 |
There was a problem hiding this comment.
CLAUDE.md violation: mutable module globals used as an implicit parameter channel.
These look like top-level constants (ALL_CAPS) but are mutated by run_cmd via global at lines 2560-2562 and then read by benchmark_scenario at line 2325.
AGENTS.md requires: "Force explicit specification of critical parameters (no silent defaults). Centralize defaults in one canonical location. Prefer explicit constructor/config parameters over env vars." A global mutated by a CLI handler and read by a sibling benchmark function is the same anti-pattern this rule targets — the value flows invisibly between functions.
Suggested fix: thread scale / duration_s through benchmark_scenario(db, *, scale, duration_s) and the groups dispatch loop. The other group benchmarks don't need these params, so either accept **kwargs or branch on the group name to pass them only to benchmark_scenario.
| """ | ||
| from iris.cluster.controller.service import ControllerServiceImpl | ||
| from iris.rpc.async_adapter import _ON_LOOP_ATTR |
There was a problem hiding this comment.
CLAUDE.md violation: local imports without a circular-dependency / optional-dep justification.
AGENTS.md requires: "All imports at the top of the file. No local imports except to break circular dependencies or guard optional deps."
Neither symbol here is on a circular-import path:
ControllerServiceImpllives iniris.cluster.controller.service, which is already imported at lines 74-78 (forUSER_JOB_STATESet al.)._ON_LOOP_ATTRis fromiris.rpc.async_adapter, which the script does not yet import — adding it at the top is fine.
A similar local import sits in _FakeProvider.reconcile_workers at line 154 (from iris.cluster.controller.worker_provider import WorkerReconcileResult) — same pattern, please move that one too. There are also several from iris.cluster.controller import db as db_mod lines inside closures (around lines 1283, 1309, 1440, 1553) that should be hoisted.
| @click.group() | ||
| def main() -> None: | ||
| """Benchmark the Iris controller against a local checkpoint.""" | ||
|
|
||
|
|
||
| @main.command("run") |
There was a problem hiding this comment.
The click.group() conversion makes the module docstring's Usage: examples stale.
The module docstring at lines 12-22 still shows:
uv run python lib/iris/scripts/benchmark_controller.py
uv run python lib/iris/scripts/benchmark_controller.py --db ./controller.sqlite3
uv run python lib/iris/scripts/benchmark_controller.py --only polling
After this change all three fail with Error: No such option: --db / --only — the flags now live on the run subcommand. Please update the docstring to use ... run --db ... / ... run --only ... (and consider listing the new set_task_status_text / scenario groups while you're there).
yonromai
left a comment
There was a problem hiding this comment.
🤖 Requesting changes because auto-merge is enabled and there are clear correctness issues to resolve before merge. I did not repost the existing inline feedback, but I agree the current branch can hang forever waiting for serve startup output and that benchmark_set_task_status_text runs UpdateTaskStatus against the shared checkpoint DB instead of a clone, contaminating the cached/input benchmark state. I also verified the now-existing stale-usage comment: the script's documented pre-run --db / --only invocations are rejected by the new click group shape.
Generated with Codex
…cess Every bench group now spawns the dry-run controller as a subprocess (the production shape: each client is its own interpreter). The in-process ``RpcHarness`` was kept only for ``benchmark_rpcs``'s endpoint-cache rehydrate trick, which we now drop — ``load_register_endpoint`` already mints unique endpoint names per call, so the table can grow during the bench without skewing measurements. - Delete the in-process ``RpcHarness`` (and ``_find_free_port``) + the ``_reset_endpoint`` callback + the ``transitions``/``health``/ ``endpoints``/``worker_attrs`` projection exposure that only worked in-process. - Rename ``SubprocessRpcHarness`` → ``RpcHarness``. - Wire ``--on-loop`` through ``serve_cmd`` so the SetTaskStatusText ``@on_loop`` toggle now happens in the child before the controller boots (the previous in-parent patch can't reach the child interpreter). ``_run_set_status_text_variant`` and ``_measure_update_task_status_under_storm`` pass ``on_loop`` through the harness constructor. Verification: scenario --scale 0.1 --scenario-duration 8, the full rpcs group, and the full set_task_status_text group (both on_loop modes) all run end-to-end against the new subprocess harness. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
yonromai
left a comment
There was a problem hiding this comment.
🤖 Still requesting changes on f383e13dd8bf5402200d201a32e03d42c5896254. The follow-up commit fixes the shared-checkpoint mutation concern by cloning before the SetTaskStatusText probes, and it moves the @on_loop toggle into the subprocess path. The blocking startup issue is still present, though: RpcHarness.__init__ still calls blocking stdout.readline() inside the timeout loop, so a child that hangs before writing a newline can still hang the benchmark forever. I also rechecked the CLI usage issue; the module doc still advertises benchmark_controller.py --db ... / --only ..., and those options are still rejected unless they are placed after run.
Generated with Codex
Summary
lib/iris/scripts/benchmark_controller.pyso every benchmark exercises the controller through its real Connect/HTTP surface (serialization, ASGI dispatch,@on_loopvs threadpool, real network) rather than calling service/transitions helpers directly in-process.scenarioabstraction (RpcLoad/Scenario/ScenarioRunner) that drives a mix of RPCs at independently-configurable target rates, with aPRODUCTION_MIX_RPSpreset derived from the controller stats page (SetTaskStatusText dominates at 316 rps, every other RPC < 8 rps).--scalemultiplies all rates for overload runs._wait_until_job_drained's backoff inservice.pyso long replacement drains don't hammer the heartbeat path.What's new
RpcHarnessController(dry_run=True)in-process; production-cadence loops, destructive writes gated.SubprocessRpcHarnessservesubcommand. Parent owns clients, child owns the event loop.RpcLoad/Scenario/ScenarioRunnern_clients_minknob so e.g. SetTaskStatusText runs from 200 connection threads, matching the production fleet's connection count.load_*factoriesbenchmark_rpcsnow drives its per-call microbenches through the same factories.benchmark_scenariogroup... run --only scenario --scale 1.0 --scenario-duration 60.benchmark_set_task_status_textgroup@on_loopvs threadpool comparison + blast-radius probe (1000 RPS SetTaskStatusText storm vs victim UpdateTaskStatus).CLI is now a click.group:
... run [--only X] [--scale Y] [--scenario-duration Z]— benchmark groups... serve --db-path X --state-dir Y— controller subprocess used bySubprocessRpcHarness