[iris] Lift log store and log server into new lib/finelog package by rjpower · Pull Request #5212 · marin-community/marin

rjpower · 2026-04-27T19:46:14Z

Project Idea

Lift the logging service out of the iris controller into a standalone lib/finelog package per issue #5210's 1-pager process. Reduces the controller's footprint, lets the log subsystem evolve independently, and unblocks future work on a sibling stats service. Also a forcing function for figuring out our service-extraction template (proxy management, proto sharing, deploy artifacts) on a small target before tackling stats.

Challenges

Cross-project proto dependencies. Iris's controller.proto and job.proto embed LogEntry directly in their RPC schemas, so finelog's proto can't replace iris's without a cross-package proto import (currently disallowed). Resolution: iris keeps a wire-compatible duplicate at iris_logging.proto and transcodes via SerializeToString round-trip at the boundary in service.py. Iris's time.proto was renamed to iris_time.proto for descriptor-pool dedup; iris.rpc.{logging_pb2, time_pb2} are now shim re-exports.

Proxy / discovery. Workers already resolve the log server via controller.list_endpoints("/system/log_server"), so the data plane lifts cleanly. The new question is how operators tell iris where the external server lives. Solved with a generic cluster_config.endpoints map (logical name → EndpointSpec{uri, metadata}) and a resolve_endpoint_uri scheme dispatch (http:// works today; gcp:// and k8s:// are explicit NotImplementedError stubs to be filled when we deploy that way).

Costs / Risks

Iris startup gains a new soft dependency. To keep "iris on a laptop" working, the controller hosts a bundled in-process MemStore-backed log server (capped at 200k rows, FIFO eviction) when /system/log_server is absent from the endpoints config — pure dev convenience, lossy on restart. Production is expected to declare an external finelog endpoint.

Churn with no immediate user-visible improvement. The benefit shows up later: independent iteration on logging, easier stats service spike, and a worked example for the next service lift.

Auth churn. The deleted log-path JWT plumbing means the new server is unauthenticated; deployments must restrict access at the network layer (k8s NetworkPolicy / GCP firewall / VPC). The current iris-on-coreweave and iris-on-gcp paths satisfy this; flagged for any future deploy that doesn't.

Design

lib/finelog/ is a new top-level package owning proto/{logging,time}.proto (package finelog.logging), generated rpc/, store/ (MemStore + DuckDBLogStore), server/ (LogServiceImpl + minimal CLI launcher; no auth, no stats), client/ (LogPusher + RemoteLogHandler + LogServiceProxy). Iris keeps iris/cluster/log_store_helpers.py for the JobName/TaskAttempt-aware key builders that wrap finelog opaque-string primitives. Deploy artifacts (Dockerfile, k8s YAML, Cloud Run YAML) ship under lib/finelog/deploy/. Iris's _start_log_server subprocess path is gone; the bundled in-process fallback lives in Controller._start_local_log_server with a TODO(#5215) pointer at the rigging-sink follow-up that would replace it with a NullSink/StderrSink pair.

Testing

Tested against the iris dev controller's unit + integration suite. After cutover: 1975 iris tests pass / 1 skipped (excl. slow/docker/e2e); 24 finelog tests pass standalone (mem store, duckdb store, server concurrency cap + RPC round-trip). Pre-commit and pyrefly clean.

The two rollout cases worth checking on dev cluster:

New controller + old workers: workers still resolve /system/log_server from the controller's endpoint registry — same wire format, same client. Expect no rollout coordination.
Full restart with new workers + bundled fallback: confirms the dev-mode path works without an external finelog deployment.

Open Questions

How gcp:// and k8s:// resolution should actually be implemented — currently stubs that raise. Likely a follow-up once the first non-http:// deployment shows up.

Whether the iris-side LogEntry duplicate is worth collapsing later (would require either lifting controller/job protos out of iris or relaxing the no-cross-proto-import rule for finelog specifically).

Whether dashboard log fetch should grow alternate read sources (direct GCS, GCP Logging) — orthogonal to this PR but would reuse the same endpoint-config mechanism.

rjpower · 2026-04-27T19:46:42Z

Specification

Problem
The log subsystem (iris/cluster/log_store/, iris/log_server/) was wired into iris-specific types (JobName, TaskAttempt), the iris RPC plumbing (auth, interceptors, stats), and the controller's subprocess launcher. It is logically standalone — a key/value structured-log store + push/fetch RPCs — and was a candidate for extraction so it can evolve independently and be deployed separately.

Approach
New top-level package lib/finelog owns everything log-related: types.py (str_to_log_level, parse_attempt_id, is_retryable_error, LogReadResult), proto (logging.proto, time.proto, package finelog.logging), generated rpc/, store/ (MemStore + DuckDBLogStore + factory), server/ (LogServiceImpl, build_log_server_asgi, minimal main.py CLI, ConcurrencyLimitInterceptor), client/ (LogPusher, RemoteLogHandler, LogServiceProxy). Iris keeps log_store_helpers.py for the JobName/TaskAttempt-aware key builders. Controller startup iterates cluster_config.endpoints (new EndpointSpec map at field 90) and registers each via a scheme-dispatching resolve_endpoint_uri (http identity, gcp/k8s stubbed). _start_log_server and JWT env vars deleted.

Key code

lib/iris/src/iris/cluster/endpoints.py — scheme dispatch, fail-fast on unknown.
lib/iris/src/iris/cluster/controller/controller.py — endpoints loop, requires /system/log_server, registers /system/log-server alias.
lib/iris/src/iris/cluster/controller/service.py — _to_iris_log_entries transcoder bridges finelog.logging.LogEntry to iris.logging.LogEntry across the in-iris RPC boundary (proto cross-imports forbidden, so iris keeps a wire-compatible duplicate at iris_logging.proto).

Trade-off — duplicated LogEntry
finelog.logging.LogEntry and iris.logging.LogEntry are wire-compatible duplicates. iris embeds LogEntry in controller.proto / job.proto and the no-cross-proto-imports rule prevents referencing finelog.logging from iris .proto files. Transcoding via SerializeToString round-trip at the boundary keeps both schemas independent. iris/rpc/time.proto was renamed to iris_time.proto to avoid descriptor-pool collision with finelog/proto/time.proto; iris.rpc.{logging_pb2,time_pb2} are now shim re-exports.

Breaking change — endpoints config required
Every iris deployment must add an endpoints map declaring /system/log_server before upgrading. There is no in-process subprocess fallback. Operators run finelog-server themselves (lib/finelog/deploy/ has Dockerfile, k8s manifests, Cloud Run YAML).

Tests

lib/finelog/tests/ — 21 tests (mem store, duckdb store, server concurrency cap + RPC round-trip).
lib/iris/tests/ — 1972 pass, 1 skipped after the cutover (excl. slow/docker/e2e). All call-site rewrites verified.
Pre-commit and pyrefly clean.

Out of scope

Real gcp:// / k8s:// resolver implementations (currently raise NotImplementedError; http:// works).
Auth on the finelog server.
Removing the /system/log-server hyphen alias.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3be728c1af

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

rjpower · 2026-04-27T20:34:38Z

Update: A1 (bundled in-process MemStore fallback) is now in. /system/log_server in the endpoints config is optional — Controller starts a 200k-row capped MemStore log server in-process when it's absent. Logs lost on restart, capped in memory; production deployments should still run finelog-server externally and declare it in the endpoints config. Long-term cleanup tracked in #5215.

yonromai

🤖 Approved with three actionable deployment-path findings. The Python unit and targeted Iris tests are green, and GitHub checks are currently green; the issues are in paths not exercised by PR CI (finelog-image Docker build and k8s deploy rendering).

Generated with Codex.

yonromai · 2026-04-28T02:27:56Z

+
+# rigging is a workspace-local dep not on PyPI.
+COPY lib/rigging/pyproject.toml ./lib/rigging/pyproject.toml
+COPY lib/finelog/pyproject.toml ./lib/finelog/pyproject.toml


🤖 P1 - Copy lib/finelog/config/ before syncing the finelog image

lib/finelog/pyproject.toml force-includes config/, but this Dockerfile only copies pyproject.toml before the first uv sync --package marin-finelog --no-install-project, and never copies lib/finelog/config/ before installing the project. The Iris Dockerfile already had to copy that directory ahead of uv sync for the same reason. As written, the scheduled/manual finelog-image build can fail during metadata validation, and even if the first sync were bypassed the final package install still lacks the forced-included config directory. Please copy lib/finelog/config/ into ./lib/finelog/config/ before the first sync and keep it available for the final project install.

Generated with Codex.

yonromai · 2026-04-28T02:27:56Z

+        f"storageClassName: {k8s.storage_class}" if k8s.storage_class else "# storageClassName: <cluster default>"
+    )
+    template = template_path.read_text()
+    return render_template(


🤖 P1 - Render k8s templates with only the variables each template uses

render_template() intentionally raises on unused variables, but _render_manifest() passes the full variable set to every template. That makes the k8s backend unusable before it reaches kubectl: 01-pvc.yaml.tmpl rejects image/port/remote_log_dir, 02-deployment.yaml.tmpl rejects storage_class_block/storage_gb, and 03-service.yaml.tmpl rejects the deployment/PVC-only variables. Please either pass a per-template variable map or relax the unused-variable check for this caller, and add a small render test so finelog deploy up covers all three manifests.

Generated with Codex.

yonromai · 2026-04-28T02:27:56Z

+# prod (gs://bucket/path → archive segments to GCS).
+ENV FINELOG_REMOTE_DIR=""
+
+CMD ["sh", "-c", "exec python -m finelog.server.main --port 10001 --log-dir /var/cache/finelog --remote-log-dir \"${FINELOG_REMOTE_DIR}\""]


🤖 P2 - Keep FinelogConfig.port and the container command in sync

The deploy config exposes port, and both GCP bootstrap health checks and k8s manifests use cfg.port, but the image itself always starts finelog.server.main --port 10001. Any config that sets a different port will advertise/probe that port while the process listens on 10001, so GCP startup and k8s readiness will fail. Please either make the port fixed in the schema/configs or thread it into the container command, for example with a FINELOG_PORT env var or explicit deployment args.

Generated with Codex.

Extracts the structured-log subsystem out of iris into a standalone lib/finelog. Iris loses its in-process log server entirely: the log server is now an external service operators run via the deploy artifacts under lib/finelog/deploy/ (Dockerfile, k8s, Cloud Run) and iris reaches it via a new endpoints config (cluster_config.endpoints map keyed by logical name -> EndpointSpec{uri, metadata}). Controller startup now requires /system/log_server in the endpoints config and fails fast if absent; this is a deliberate breaking change for every deployment, replacing the old subprocess launch path. Iris retains a wire-compatible LogEntry under iris_logging.proto purely because controller/job RPCs embed it and proto cross-imports between projects are forbidden; service.py transcodes at the boundary. The auth/ JWT plumbing on the log path is gone — finelog is unauthenticated and secured at the network layer.

…mits /system/log_server Drops the fail-fast in _resolve_cluster_endpoints. When /system/log_server is absent from cluster_config.endpoints, the Controller starts a bundled MemStore-backed log server in-process (capped at 200k rows, FIFO eviction) so workers and the dashboard work out of the box without an external finelog deployment. Logs are lost on controller restart and capped in memory; production deployments still want an external finelog-server. Adds max_rows cap to MemStore + tests. Code comment in controller.py points at #5215 (rigging-mediated log sinks) for the longer-term cleanup that replaces the bundled-server fallback with a NullSink + StderrSink pair.

Folds finelog.time.Timestamp into finelog.logging as a nested message and deletes finelog/proto/time.proto. With no time.proto in finelog there is no descriptor-pool basename collision, so iris reverts iris_time.proto and iris_time_pb2 back to time.proto / time_pb2 (and the .pyi shim goes away). The iris_logging.proto rename stays — that collision is on the logging.proto basename and we still keep iris's wire-compatible LogEntry duplicate for the controller/job RPCs that embed it. API unchanged on both sides: entry.timestamp.epoch_ms still works.

The lift agent deleted iris/tests/test_remote_log_handler.py claiming finelog had equivalent coverage; that was wrong — finelog only had store/server tests. Lift the deleted test into finelog/tests/test_client.py, pruned to 7 tests covering: handler emit, no-deadlock on push failure, flush blocks until shipped, flush timeout returns False, batch_size trigger, close drains, max_buffer_size overflow.

Standardize the log server endpoint name on /system/log-server everywhere (worker resolver, controller registration, config docs, tests). The canonical+alias-with-deprecation-warning split was unnecessary churn — keep the original hyphen name and remove the underscore variant entirely.

Replace the if/elif dispatcher with a small registry; add concrete handlers for the two schemes that were previously stubs. gcp:// resolves a GCE instance to its internal IP via `gcloud compute describe`; k8s:// templates an in-cluster Service DNS name. Project/zone/port/namespace flow from EndpointSpec.metadata. No new dependencies.

Replace the stale Cloud Run yaml and deploy READMEs with a click-based CLI that creates and restarts a GCE VM running the finelog Docker image. Mirrors iris's pattern: subprocess gcloud, idempotent bootstrap re-run over SSH for restart, no separate persistent disk (boot disk + GCS offload via FINELOG_REMOTE_DIR). Pins the image by digest at the CLI layer before passing to the bootstrap script.

The deploy bootstrap polls /health to confirm the container is up; the ASGI app didn't expose it. Add a Starlette /health route returning "ok". Hardcode --boot-disk-type=pd-ssd in the deploy CLI for consistent log-write IOPS. Add a finelog matrix entry to docker-images.yaml so the image is rebuilt weekly alongside iris.

Empty FINELOG_REMOTE_DIR disables GCS offload (the Dockerfile already handles it). Required-by-default forced operators to invent a bucket just to smoke-test a VM. Verified end-to-end on GCE in hai-gcp-models / us-central1-a: create, status, restart, delete all green.

Replace the GCE-specific create/restart/delete/status/logs commands with config-driven up/down/restart/status/logs that read a finelog config file and dispatch to a gcp or k8s backend. The config picks the deployment platform; the CLI is platform-agnostic, mirroring how `iris cluster start` decides backend from cluster yaml. New: load_finelog_config / derive_endpoint_uri helpers consumed by iris's controller for /system/log-server auto-injection. Adds k8s deployment via templated manifests + kubectl. Bundles canonical `marin.yaml` and `marin-dev.yaml` configs.

Add IrisClusterConfig.log_server_config (a finelog config name). When set, the controller auto-derives /system/log-server from the finelog config and `iris cluster log-server {up,down,restart, status,logs}` forwards to the matching finelog deploy commands. Mutually exclusive with explicit endpoints[/system/log-server]. The bundled-MemStore fallback (no config at all) is unchanged. Updates marin-dev.yaml to demonstrate the field.

Two fixes that surfaced when running `iris cluster log-server up --cluster marin-dev` end-to-end: 1. _wait_health used `gcloud compute ssh ... curl /health` to verify the container, which requires the operator to have OS Login on the VM. When the VM runs as iris-controller@... that's not the operator's user. Switch to polling the VM's serial console for the bootstrap-script's own healthy/FAILED markers — no SSH needed. The bootstrap already validates /health from inside the VM, so this is the authoritative signal. 2. SSH-using commands (restart, logs, status container probe) now pass --impersonate-service-account=<sa> when the config sets one, mirroring iris.cluster.providers.remote_exec. Without this, the operator's gcloud client can't push an SSH key via OS Login when the SA owns the VM.

Three failures surfaced on CI that the local pytest run didn't catch: 1. lib/iris/Dockerfile didn't list lib/finelog as a workspace member when synthesizing the build-stage workspace pyproject. After iris added marin-finelog as a runtime dep, uv sync inside the controller image build failed to resolve. Add lib/finelog to members + sources and COPY its pyproject + src + config alongside rigging. 2. iris e2e and integration tests imported FetchLogsRequest / FetchLogsResponse from iris.rpc.logging_pb2, but those types moved to finelog during the lift. Switch the imports. 3. Dashboard's _proxy_log_rpc still posted upstream at the old /iris.logging.LogService/ path, returning 404 against the renamed /finelog.logging.LogService/ service.

Two related fixes for `iris cluster start --fresh` (CW CI): 1. lib/iris/Dockerfile.dockerignore allowlists exactly the libs that the Dockerfile COPYs. Add lib/finelog/{pyproject.toml,src,config}. Without this, COPY lib/finelog/* fails with "not found" because buildx applies the per-Dockerfile dockerignore over the marin-root build context. 2. The first deps-stage uv sync runs --no-install-project, but uv still validates each workspace member's [tool.hatch.build.targets. wheel.force-include] paths. finelog's pyproject force-includes the config/ dir, which must exist even at the metadata-only sync step. Move COPY lib/finelog/config/ ahead of the first uv sync.

The integration test still pulled LogServiceImpl from the deleted iris.log_server package. Switch to finelog.server.service.

iris-e2e-smoke: dashboard frontend (useRpc.ts, rsbuild.config.ts) and the in-process controller's proxy route were still using the old /iris.logging.LogService path. The log service moved to finelog.logging during the lift, so /iris.logging.LogService/FetchLogs returned 404. Rename frontend client paths and the dashboard route to /finelog.logging.LogService — the controller's WSGI mount already serves finelog's path. marin-tests: marin.mcp.babysitter still imported build_log_source from the deleted iris.cluster.log_store package. Switch to iris.cluster.log_store_helpers.

Two more iris.rpc.logging_* references in marin.mcp.babysitter that broke after the lift: LogServiceClientSync (logging_connect) and the LogEntry/FetchLogsRequest types (logging_pb2). Switch both to finelog.

…py config/ - main.py: argparse → click with FINELOG_{PORT,LOG_DIR,REMOTE_DIR,LOG_LEVEL} envvars so the image picks up FinelogConfig.port without overriding CMD. - Dockerfile: drop hardcoded --port 10001; copy lib/finelog/config/ before the project install so non-editable wheel builds satisfy hatchling's force-include directive. - k8s 02-deployment.yaml.tmpl + bootstrap.py: pass FINELOG_PORT to the container so probes/health checks line up on non-default ports. - _k8s.py: filter the variable set to what each template references — the three manifests use disjoint subsets and render_template raises on unused vars, which made `finelog deploy up` unusable before reaching kubectl. - New test_deploy_k8s.py covers all three manifests + port threading.

rjpower added the agent-generated Created by automation/agent label Apr 27, 2026

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread lib/iris/src/iris/cluster/controller/controller.py

rjpower mentioned this pull request Apr 27, 2026

[iris] Route log destination through rigging.log_setup sinks #5215

Open

rjpower force-pushed the lift-finelog-from-iris branch from 6b1af68 to 39f4c12 Compare April 27, 2026 20:37

rjpower requested a review from yonromai April 27, 2026 21:10

rjpower force-pushed the lift-finelog-from-iris branch from 46d4d63 to e7a4b1e Compare April 28, 2026 00:37

yonromai approved these changes Apr 28, 2026

View reviewed changes

rjpower added 20 commits April 28, 2026 10:14

[finelog] Add license header to tests/__init__.py

c54f508

[lift-finelog] test_iris_kind: import LogServiceImpl from finelog

8cf8d44

The integration test still pulled LogServiceImpl from the deleted iris.log_server package. Switch to finelog.server.service.

[lift-finelog] babysitter: import logging_{pb2,connect} from finelog

f0cf72f

Two more iris.rpc.logging_* references in marin.mcp.babysitter that broke after the lift: LogServiceClientSync (logging_connect) and the LogEntry/FetchLogsRequest types (logging_pb2). Switch both to finelog.

Fix log-forwarding.

d156e48

rjpower force-pushed the lift-finelog-from-iris branch from e7a4b1e to da1e609 Compare April 28, 2026 17:26

rjpower enabled auto-merge (squash) April 28, 2026 17:26

rjpower merged commit b212f00 into main Apr 28, 2026
40 checks passed

rjpower deleted the lift-finelog-from-iris branch April 28, 2026 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Lift log store and log server into new lib/finelog package#5212

[iris] Lift log store and log server into new lib/finelog package#5212
rjpower merged 20 commits intomainfrom
lift-finelog-from-iris

rjpower commented Apr 27, 2026 •

edited

Loading

Uh oh!

rjpower commented Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

rjpower commented Apr 27, 2026

Uh oh!

yonromai left a comment

Uh oh!

yonromai Apr 28, 2026

Uh oh!

yonromai Apr 28, 2026

Uh oh!

yonromai Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjpower commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Project Idea

Challenges

Costs / Risks

Design

Testing

Open Questions

Uh oh!

rjpower commented Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

rjpower commented Apr 27, 2026

Uh oh!

yonromai left a comment

Choose a reason for hiding this comment

Uh oh!

yonromai Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjpower commented Apr 27, 2026 •

edited

Loading