Skip to content

[iris] benchmark_controller: production-mix scenario over real RPC#5775

Merged
rjpower merged 2 commits into
mainfrom
iris-bench-scenario
May 15, 2026
Merged

[iris] benchmark_controller: production-mix scenario over real RPC#5775
rjpower merged 2 commits into
mainfrom
iris-bench-scenario

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented May 15, 2026

Summary

  • Reworks lib/iris/scripts/benchmark_controller.py so every benchmark exercises the controller through its real Connect/HTTP surface (serialization, ASGI dispatch, @on_loop vs threadpool, real network) rather than calling service/transitions helpers directly in-process.
  • Adds an out-of-process controller harness so the benchmark process's client threads don't share a GIL or asyncio loop with the server.
  • Adds a scenario abstraction (RpcLoad / Scenario / ScenarioRunner) that drives a mix of RPCs at independently-configurable target rates, with a PRODUCTION_MIX_RPS preset derived from the controller stats page (SetTaskStatusText dominates at 316 rps, every other RPC < 8 rps). --scale multiplies all rates for overload runs.
  • Also bumps _wait_until_job_drained's backoff in service.py so long replacement drains don't hammer the heartbeat path.

What's new

component purpose
RpcHarness Real Controller(dry_run=True) in-process; production-cadence loops, destructive writes gated.
SubprocessRpcHarness Spawns the controller in a child process via the new serve subcommand. Parent owns clients, child owns the event loop.
RpcLoad / Scenario / ScenarioRunner Multi-RPC mix runner. n_clients_min knob so e.g. SetTaskStatusText runs from 200 connection threads, matching the production fleet's connection count.
17 load_* factories One per RPC in the production distribution. benchmark_rpcs now drives its per-call microbenches through the same factories.
benchmark_scenario group Production-shape mix run via ... run --only scenario --scale 1.0 --scenario-duration 60.
benchmark_set_task_status_text group Head-to-head @on_loop vs threadpool comparison + blast-radius probe (1000 RPS SetTaskStatusText storm vs victim UpdateTaskStatus).

CLI is now a click.group:

  • ... run [--only X] [--scale Y] [--scenario-duration Z] — benchmark groups
  • ... serve --db-path X --state-dir Y — controller subprocess used by SubprocessRpcHarness

Refactor lib/iris/scripts/benchmark_controller.py so every benchmark
exercises the controller through its real Connect/HTTP surface rather
than calling service/transitions helpers directly. The previous
in-process call path missed serialization, ASGI dispatch, the @on_loop
vs threadpool decision in AsyncServiceAdapter, and the GIL-shared cost
of running clients alongside the server -- all of which dominate the
production tail.

Major changes:

* New RpcHarness boots a real Controller(dry_run=True) backed by the
  cloned snapshot. The scheduler / polling / ping / heartbeat loops
  tick at their natural production cadence; only destructive writes
  are gated by dry_run.

* New SubprocessRpcHarness spawns the controller in a separate Python
  process via a new `serve` subcommand on the same script. The
  benchmark process owns the clients; the controller owns the event
  loop. Mirrors production where ~200 task processes each push their
  own SetTaskStatusText from independent interpreters.

* New scenario abstraction (RpcLoad / Scenario / ScenarioRunner) that
  drives a *mix* of RPCs at independently-configurable target rates.
  Each RpcLoad also carries an n_clients_min so we can match the
  production fleet's connection count, not just throughput.

* PRODUCTION_MIX_RPS preset derived from the controller stats page
  (SetTaskStatusText dominates at 316 rps, every other RPC < 8 rps).
  Run via `... run --only scenario --scale 1.0 --scenario-duration 60`.
  --scale multiplies all rates uniformly for overload runs.

* 17 load_* factories cover every RPC in the production distribution.
  benchmark_rpcs now drives the per-call microbenches through the same
  factories so request-construction lives in one place.

* New benchmark_set_task_status_text group probes the @on_loop vs
  threadpool dispatch decision head-to-head and reports the blast
  radius a 1000 rps SetTaskStatusText storm imposes on a victim
  UpdateTaskStatus.

CLI is now a click.group: `... run [...]` for benchmarks, `... serve
--db-path X --state-dir Y` for the controller subprocess.

Also bumps the _wait_until_job_drained backoff (0.05->1.0 initial,
1.0->10.0 max, 1.5->2 factor) to reduce poll pressure on the heartbeat
path during long replacement drains.
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 15, 2026

Claude finished @rjpower's task in 4m 26s —— View job


Code review

  • Check PR eligibility for review
  • Identify relevant CLAUDE.md / AGENTS.md files
  • Summarize PR changes
  • Run parallel review agents (2x CLAUDE.md compliance, 2x bug scan)
  • Validate flagged issues
  • Post inline review comments

Posted 5 inline comments on high-signal issues:

  1. hasattr(_Tx, "fetchall") compat hack (L58-62) — explicit AGENTS.md violation ("No ad-hoc compatibility hacks (hasattr); update code consistently").
  2. Stale Usage: docstring after click.group() conversion (L2525-2530) — the three example invocations in the module docstring fail with No such option: --db / --only because those flags moved onto the run subcommand.
  3. Mutable module globals _SCENARIO_SCALE / _SCENARIO_DURATION (L356-357) — set via global in run_cmd, read inside benchmark_scenario; AGENTS.md requires explicit constructor/config parameters over implicit channels.
  4. Local imports inside _toggle_set_task_status_text_on_loop (L1968-1969) — no circular-dep / optional-dep justification; same pattern in _FakeProvider.reconcile_workers and several closures around db_mod.
  5. _toggle_set_task_status_text_on_loop(False) at L2418 doesn't restore source state — source has @on_loop, so forcing False is the opposite of "restore"; benign for the default group ordering but a landmine for any future in-process group after set_task_status_text.

Items considered but not flagged (low signal or borderline):

  • service.py ExponentialBackoff change — explicitly mentioned in the PR description, intentional behavior change, not a bug.
  • getattr(t, "current_worker_id", None) at L96 — defensive but lacks the explicit "old_attr restore" character of the hasattr rule.
  • SubprocessRpcHarness.__init__ blocking readline() ignoring startup_timeout_s — real but uvicorn's bootup logging keeps it from blocking forever in practice.
  • Concurrent ANALYZE between parent sample_db and the subprocess controller — speculative without a repro.
  • scenarios raw-dict list in _run_apply_under_contention (~L1919) — style nit, not a clear correctness issue.
    · Branch: iris-bench-scenario

@rjpower rjpower added the agent-generated Created by automation/agent label May 15, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da3d35b9fd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +304 to +305
line = self._proc.stdout.readline() if self._proc.stdout else ""
if not line:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make subprocess startup timeout actually enforceable

SubprocessRpcHarness.__init__ uses self._proc.stdout.readline() inside the timeout loop, but readline() blocks until a newline/EOF. If the child process hangs before emitting a line, this call blocks indefinitely and the startup_timeout_s deadline is never checked, so benchmark_scenario can hang forever instead of failing fast.

Useful? React with 👍 / 👎.

Comment on lines +2416 to +2418
# Restore the function attribute to whatever the source file declares so
# subsequent groups in the same run aren't observably mutated.
_toggle_set_task_status_text_on_loop(False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore SetTaskStatusText dispatch mode to source default

This cleanup call always forces set_task_status_text out of @on_loop mode for the rest of the process, but the source method is currently decorated with @on_loop. In the default run ordering, set_task_status_text runs before scenario, so the production-mix scenario is measured with the wrong dispatch mode and reports misleading latency/throughput numbers.

Useful? React with 👍 / 👎.

Comment on lines +2383 to +2385
tp_baseline, tp_under = _measure_update_task_status_under_storm(
db, on_loop=False, storm_rps_target=storm_rps, storm_threads=storm_threads, victim_duration_s=victim_seconds
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run contention probe against a cloned controller DB

benchmark_set_task_status_text passes the shared db into _measure_update_task_status_under_storm, which repeatedly invokes update_task_status (a write path). Unlike the other write-heavy benchmark groups that clone the DB first, this mutates the same checkpoint backing the rest of the run, contaminating later groups (especially scenario) and the cached local benchmark DB state.

Useful? React with 👍 / 👎.

@rjpower rjpower requested a review from yonromai May 15, 2026 19:44
@rjpower rjpower enabled auto-merge (squash) May 15, 2026 19:44
Comment on lines 58 to 62
# Branch removed Tx.fetchall/fetchone; restore for this benchmark script.
if not hasattr(_Tx, "fetchall"):
_Tx.fetchall = lambda self, stmt, params=None: self.execute(stmt, params).all()
_Tx.fetchone = lambda self, stmt, params=None: self.execute(stmt, params).first()
from iris.cluster.controller.projections.endpoints import EndpointQuery, EndpointRow, EndpointsProjection
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLAUDE.md violation: ad-hoc compatibility hack via hasattr.

AGENTS.md says: "No ad-hoc compatibility hacks (hasattr(m, 'old_attr')); update code consistently." The leading comment ("Branch removed Tx.fetchall/fetchone; restore for this benchmark script.") makes this an exact instance of the banned pattern.

Two follow-on problems:

  • The import on line 62 is displaced after the conditional block, breaking "All imports at the top of the file".
  • Monkey-patching _Tx globally mutates the class for any code that imports it after this script runs (the serve subcommand is in the same module).

Fix: update this script's call sites to use tx.execute(stmt, params).all() / .first() directly (matching the rest of the codebase), then drop the patch and move the displaced import back to the top block.

Comment on lines +356 to +357
_SCENARIO_SCALE: float = 1.0
_SCENARIO_DURATION: float = 60.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLAUDE.md violation: mutable module globals used as an implicit parameter channel.

These look like top-level constants (ALL_CAPS) but are mutated by run_cmd via global at lines 2560-2562 and then read by benchmark_scenario at line 2325.

AGENTS.md requires: "Force explicit specification of critical parameters (no silent defaults). Centralize defaults in one canonical location. Prefer explicit constructor/config parameters over env vars." A global mutated by a CLI handler and read by a sibling benchmark function is the same anti-pattern this rule targets — the value flows invisibly between functions.

Suggested fix: thread scale / duration_s through benchmark_scenario(db, *, scale, duration_s) and the groups dispatch loop. The other group benchmarks don't need these params, so either accept **kwargs or branch on the group name to pass them only to benchmark_scenario.

Comment on lines +1967 to +1969
"""
from iris.cluster.controller.service import ControllerServiceImpl
from iris.rpc.async_adapter import _ON_LOOP_ATTR
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLAUDE.md violation: local imports without a circular-dependency / optional-dep justification.

AGENTS.md requires: "All imports at the top of the file. No local imports except to break circular dependencies or guard optional deps."

Neither symbol here is on a circular-import path:

  • ControllerServiceImpl lives in iris.cluster.controller.service, which is already imported at lines 74-78 (for USER_JOB_STATES et al.).
  • _ON_LOOP_ATTR is from iris.rpc.async_adapter, which the script does not yet import — adding it at the top is fine.

A similar local import sits in _FakeProvider.reconcile_workers at line 154 (from iris.cluster.controller.worker_provider import WorkerReconcileResult) — same pattern, please move that one too. There are also several from iris.cluster.controller import db as db_mod lines inside closures (around lines 1283, 1309, 1440, 1553) that should be hoisted.

Comment thread lib/iris/scripts/benchmark_controller.py Outdated
Comment on lines +2525 to +2530
@click.group()
def main() -> None:
"""Benchmark the Iris controller against a local checkpoint."""


@main.command("run")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The click.group() conversion makes the module docstring's Usage: examples stale.

The module docstring at lines 12-22 still shows:

uv run python lib/iris/scripts/benchmark_controller.py
uv run python lib/iris/scripts/benchmark_controller.py --db ./controller.sqlite3
uv run python lib/iris/scripts/benchmark_controller.py --only polling

After this change all three fail with Error: No such option: --db / --only — the flags now live on the run subcommand. Please update the docstring to use ... run --db ... / ... run --only ... (and consider listing the new set_task_status_text / scenario groups while you're there).

Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Requesting changes because auto-merge is enabled and there are clear correctness issues to resolve before merge. I did not repost the existing inline feedback, but I agree the current branch can hang forever waiting for serve startup output and that benchmark_set_task_status_text runs UpdateTaskStatus against the shared checkpoint DB instead of a clone, contaminating the cached/input benchmark state. I also verified the now-existing stale-usage comment: the script's documented pre-run --db / --only invocations are rejected by the new click group shape.

Generated with Codex

…cess

Every bench group now spawns the dry-run controller as a subprocess (the
production shape: each client is its own interpreter). The in-process
``RpcHarness`` was kept only for ``benchmark_rpcs``'s endpoint-cache
rehydrate trick, which we now drop — ``load_register_endpoint`` already
mints unique endpoint names per call, so the table can grow during the
bench without skewing measurements.

- Delete the in-process ``RpcHarness`` (and ``_find_free_port``) + the
  ``_reset_endpoint`` callback + the ``transitions``/``health``/
  ``endpoints``/``worker_attrs`` projection exposure that only worked
  in-process.
- Rename ``SubprocessRpcHarness`` → ``RpcHarness``.
- Wire ``--on-loop`` through ``serve_cmd`` so the SetTaskStatusText
  ``@on_loop`` toggle now happens in the child before the controller
  boots (the previous in-parent patch can't reach the child interpreter).
  ``_run_set_status_text_variant`` and
  ``_measure_update_task_status_under_storm`` pass ``on_loop`` through
  the harness constructor.

Verification: scenario --scale 0.1 --scenario-duration 8, the full rpcs
group, and the full set_task_status_text group (both on_loop modes)
all run end-to-end against the new subprocess harness.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Still requesting changes on f383e13dd8bf5402200d201a32e03d42c5896254. The follow-up commit fixes the shared-checkpoint mutation concern by cloning before the SetTaskStatusText probes, and it moves the @on_loop toggle into the subprocess path. The blocking startup issue is still present, though: RpcHarness.__init__ still calls blocking stdout.readline() inside the timeout loop, so a child that hangs before writing a newline can still hang the benchmark forever. I also rechecked the CLI usage issue; the module doc still advertises benchmark_controller.py --db ... / --only ..., and those options are still rejected unless they are placed after run.

Generated with Codex

@rjpower rjpower merged commit 0c0a3ae into main May 15, 2026
36 checks passed
@rjpower rjpower deleted the iris-bench-scenario branch May 15, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants