Skip to content

[iris] Lift log store and log server into new lib/finelog package#5212

Merged
rjpower merged 20 commits intomainfrom
lift-finelog-from-iris
Apr 28, 2026
Merged

[iris] Lift log store and log server into new lib/finelog package#5212
rjpower merged 20 commits intomainfrom
lift-finelog-from-iris

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 27, 2026

Project Idea

Lift the logging service out of the iris controller into a standalone lib/finelog package per issue #5210's 1-pager process. Reduces the controller's footprint, lets the log subsystem evolve independently, and unblocks future work on a sibling stats service. Also a forcing function for figuring out our service-extraction template (proxy management, proto sharing, deploy artifacts) on a small target before tackling stats.

Challenges

Cross-project proto dependencies. Iris's controller.proto and job.proto embed LogEntry directly in their RPC schemas, so finelog's proto can't replace iris's without a cross-package proto import (currently disallowed). Resolution: iris keeps a wire-compatible duplicate at iris_logging.proto and transcodes via SerializeToString round-trip at the boundary in service.py. Iris's time.proto was renamed to iris_time.proto for descriptor-pool dedup; iris.rpc.{logging_pb2, time_pb2} are now shim re-exports.

Proxy / discovery. Workers already resolve the log server via controller.list_endpoints("/system/log_server"), so the data plane lifts cleanly. The new question is how operators tell iris where the external server lives. Solved with a generic cluster_config.endpoints map (logical name → EndpointSpec{uri, metadata}) and a resolve_endpoint_uri scheme dispatch (http:// works today; gcp:// and k8s:// are explicit NotImplementedError stubs to be filled when we deploy that way).

Costs / Risks

Iris startup gains a new soft dependency. To keep "iris on a laptop" working, the controller hosts a bundled in-process MemStore-backed log server (capped at 200k rows, FIFO eviction) when /system/log_server is absent from the endpoints config — pure dev convenience, lossy on restart. Production is expected to declare an external finelog endpoint.

Churn with no immediate user-visible improvement. The benefit shows up later: independent iteration on logging, easier stats service spike, and a worked example for the next service lift.

Auth churn. The deleted log-path JWT plumbing means the new server is unauthenticated; deployments must restrict access at the network layer (k8s NetworkPolicy / GCP firewall / VPC). The current iris-on-coreweave and iris-on-gcp paths satisfy this; flagged for any future deploy that doesn't.

Design

lib/finelog/ is a new top-level package owning proto/{logging,time}.proto (package finelog.logging), generated rpc/, store/ (MemStore + DuckDBLogStore), server/ (LogServiceImpl + minimal CLI launcher; no auth, no stats), client/ (LogPusher + RemoteLogHandler + LogServiceProxy). Iris keeps iris/cluster/log_store_helpers.py for the JobName/TaskAttempt-aware key builders that wrap finelog opaque-string primitives. Deploy artifacts (Dockerfile, k8s YAML, Cloud Run YAML) ship under lib/finelog/deploy/. Iris's _start_log_server subprocess path is gone; the bundled in-process fallback lives in Controller._start_local_log_server with a TODO(#5215) pointer at the rigging-sink follow-up that would replace it with a NullSink/StderrSink pair.

Testing

Tested against the iris dev controller's unit + integration suite. After cutover: 1975 iris tests pass / 1 skipped (excl. slow/docker/e2e); 24 finelog tests pass standalone (mem store, duckdb store, server concurrency cap + RPC round-trip). Pre-commit and pyrefly clean.

The two rollout cases worth checking on dev cluster:

  • New controller + old workers: workers still resolve /system/log_server from the controller's endpoint registry — same wire format, same client. Expect no rollout coordination.
  • Full restart with new workers + bundled fallback: confirms the dev-mode path works without an external finelog deployment.

Open Questions

How gcp:// and k8s:// resolution should actually be implemented — currently stubs that raise. Likely a follow-up once the first non-http:// deployment shows up.

Whether the iris-side LogEntry duplicate is worth collapsing later (would require either lifting controller/job protos out of iris or relaxing the no-cross-proto-import rule for finelog specifically).

Whether dashboard log fetch should grow alternate read sources (direct GCS, GCP Logging) — orthogonal to this PR but would reuse the same endpoint-config mechanism.

@rjpower rjpower added the agent-generated Created by automation/agent label Apr 27, 2026
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 27, 2026

Specification

Problem
The log subsystem (iris/cluster/log_store/, iris/log_server/) was wired into iris-specific types (JobName, TaskAttempt), the iris RPC plumbing (auth, interceptors, stats), and the controller's subprocess launcher. It is logically standalone — a key/value structured-log store + push/fetch RPCs — and was a candidate for extraction so it can evolve independently and be deployed separately.

Approach
New top-level package lib/finelog owns everything log-related: types.py (str_to_log_level, parse_attempt_id, is_retryable_error, LogReadResult), proto (logging.proto, time.proto, package finelog.logging), generated rpc/, store/ (MemStore + DuckDBLogStore + factory), server/ (LogServiceImpl, build_log_server_asgi, minimal main.py CLI, ConcurrencyLimitInterceptor), client/ (LogPusher, RemoteLogHandler, LogServiceProxy). Iris keeps log_store_helpers.py for the JobName/TaskAttempt-aware key builders. Controller startup iterates cluster_config.endpoints (new EndpointSpec map at field 90) and registers each via a scheme-dispatching resolve_endpoint_uri (http identity, gcp/k8s stubbed). _start_log_server and JWT env vars deleted.

Key code

  • lib/iris/src/iris/cluster/endpoints.py — scheme dispatch, fail-fast on unknown.
  • lib/iris/src/iris/cluster/controller/controller.py — endpoints loop, requires /system/log_server, registers /system/log-server alias.
  • lib/iris/src/iris/cluster/controller/service.py — _to_iris_log_entries transcoder bridges finelog.logging.LogEntry to iris.logging.LogEntry across the in-iris RPC boundary (proto cross-imports forbidden, so iris keeps a wire-compatible duplicate at iris_logging.proto).

Trade-off — duplicated LogEntry
finelog.logging.LogEntry and iris.logging.LogEntry are wire-compatible duplicates. iris embeds LogEntry in controller.proto / job.proto and the no-cross-proto-imports rule prevents referencing finelog.logging from iris .proto files. Transcoding via SerializeToString round-trip at the boundary keeps both schemas independent. iris/rpc/time.proto was renamed to iris_time.proto to avoid descriptor-pool collision with finelog/proto/time.proto; iris.rpc.{logging_pb2,time_pb2} are now shim re-exports.

Breaking change — endpoints config required
Every iris deployment must add an endpoints map declaring /system/log_server before upgrading. There is no in-process subprocess fallback. Operators run finelog-server themselves (lib/finelog/deploy/ has Dockerfile, k8s manifests, Cloud Run YAML).

Tests

  • lib/finelog/tests/ — 21 tests (mem store, duckdb store, server concurrency cap + RPC round-trip).
  • lib/iris/tests/ — 1972 pass, 1 skipped after the cutover (excl. slow/docker/e2e). All call-site rewrites verified.
  • Pre-commit and pyrefly clean.

Out of scope

  • Real gcp:// / k8s:// resolver implementations (currently raise NotImplementedError; http:// works).
  • Auth on the finelog server.
  • Removing the /system/log-server hyphen alias.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3be728c1af

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/controller/controller.py
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 27, 2026

Update: A1 (bundled in-process MemStore fallback) is now in. /system/log_server in the endpoints config is optional — Controller starts a 200k-row capped MemStore log server in-process when it's absent. Logs lost on restart, capped in memory; production deployments should still run finelog-server externally and declare it in the endpoints config. Long-term cleanup tracked in #5215.

@rjpower rjpower force-pushed the lift-finelog-from-iris branch from 6b1af68 to 39f4c12 Compare April 27, 2026 20:37
@rjpower rjpower requested a review from yonromai April 27, 2026 21:10
@rjpower rjpower force-pushed the lift-finelog-from-iris branch from 46d4d63 to e7a4b1e Compare April 28, 2026 00:37
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Approved with three actionable deployment-path findings. The Python unit and targeted Iris tests are green, and GitHub checks are currently green; the issues are in paths not exercised by PR CI (finelog-image Docker build and k8s deploy rendering).

Generated with Codex.


# rigging is a workspace-local dep not on PyPI.
COPY lib/rigging/pyproject.toml ./lib/rigging/pyproject.toml
COPY lib/finelog/pyproject.toml ./lib/finelog/pyproject.toml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 P1 - Copy lib/finelog/config/ before syncing the finelog image

lib/finelog/pyproject.toml force-includes config/, but this Dockerfile only copies pyproject.toml before the first uv sync --package marin-finelog --no-install-project, and never copies lib/finelog/config/ before installing the project. The Iris Dockerfile already had to copy that directory ahead of uv sync for the same reason. As written, the scheduled/manual finelog-image build can fail during metadata validation, and even if the first sync were bypassed the final package install still lacks the forced-included config directory. Please copy lib/finelog/config/ into ./lib/finelog/config/ before the first sync and keep it available for the final project install.

Generated with Codex.

Comment thread lib/finelog/src/finelog/deploy/_k8s.py Outdated
f"storageClassName: {k8s.storage_class}" if k8s.storage_class else "# storageClassName: <cluster default>"
)
template = template_path.read_text()
return render_template(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 P1 - Render k8s templates with only the variables each template uses

render_template() intentionally raises on unused variables, but _render_manifest() passes the full variable set to every template. That makes the k8s backend unusable before it reaches kubectl: 01-pvc.yaml.tmpl rejects image/port/remote_log_dir, 02-deployment.yaml.tmpl rejects storage_class_block/storage_gb, and 03-service.yaml.tmpl rejects the deployment/PVC-only variables. Please either pass a per-template variable map or relax the unused-variable check for this caller, and add a small render test so finelog deploy up covers all three manifests.

Generated with Codex.

Comment thread lib/finelog/deploy/Dockerfile Outdated
# prod (gs://bucket/path → archive segments to GCS).
ENV FINELOG_REMOTE_DIR=""

CMD ["sh", "-c", "exec python -m finelog.server.main --port 10001 --log-dir /var/cache/finelog --remote-log-dir \"${FINELOG_REMOTE_DIR}\""]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 P2 - Keep FinelogConfig.port and the container command in sync

The deploy config exposes port, and both GCP bootstrap health checks and k8s manifests use cfg.port, but the image itself always starts finelog.server.main --port 10001. Any config that sets a different port will advertise/probe that port while the process listens on 10001, so GCP startup and k8s readiness will fail. Please either make the port fixed in the schema/configs or thread it into the container command, for example with a FINELOG_PORT env var or explicit deployment args.

Generated with Codex.

rjpower added 20 commits April 28, 2026 10:14
Extracts the structured-log subsystem out of iris into a standalone
lib/finelog. Iris loses its in-process log server entirely: the log
server is now an external service operators run via the deploy artifacts
under lib/finelog/deploy/ (Dockerfile, k8s, Cloud Run) and iris reaches
it via a new endpoints config (cluster_config.endpoints map keyed by
logical name -> EndpointSpec{uri, metadata}). Controller startup now
requires /system/log_server in the endpoints config and fails fast if
absent; this is a deliberate breaking change for every deployment,
replacing the old subprocess launch path.

Iris retains a wire-compatible LogEntry under iris_logging.proto purely
because controller/job RPCs embed it and proto cross-imports between
projects are forbidden; service.py transcodes at the boundary. The auth/
JWT plumbing on the log path is gone — finelog is unauthenticated and
secured at the network layer.
…mits /system/log_server

Drops the fail-fast in _resolve_cluster_endpoints. When /system/log_server
is absent from cluster_config.endpoints, the Controller starts a bundled
MemStore-backed log server in-process (capped at 200k rows, FIFO
eviction) so workers and the dashboard work out of the box without an
external finelog deployment. Logs are lost on controller restart and
capped in memory; production deployments still want an external
finelog-server. Adds max_rows cap to MemStore + tests.

Code comment in controller.py points at #5215 (rigging-mediated log
sinks) for the longer-term cleanup that replaces the bundled-server
fallback with a NullSink + StderrSink pair.
Folds finelog.time.Timestamp into finelog.logging as a nested message and
deletes finelog/proto/time.proto. With no time.proto in finelog there is
no descriptor-pool basename collision, so iris reverts iris_time.proto
and iris_time_pb2 back to time.proto / time_pb2 (and the .pyi shim
goes away). The iris_logging.proto rename stays — that collision is on
the logging.proto basename and we still keep iris's wire-compatible
LogEntry duplicate for the controller/job RPCs that embed it.

API unchanged on both sides: entry.timestamp.epoch_ms still works.
The lift agent deleted iris/tests/test_remote_log_handler.py claiming
finelog had equivalent coverage; that was wrong — finelog only had
store/server tests. Lift the deleted test into finelog/tests/test_client.py,
pruned to 7 tests covering: handler emit, no-deadlock on push failure,
flush blocks until shipped, flush timeout returns False, batch_size
trigger, close drains, max_buffer_size overflow.
Standardize the log server endpoint name on /system/log-server everywhere
(worker resolver, controller registration, config docs, tests). The
canonical+alias-with-deprecation-warning split was unnecessary churn — keep
the original hyphen name and remove the underscore variant entirely.
Replace the if/elif dispatcher with a small registry; add concrete
handlers for the two schemes that were previously stubs. gcp:// resolves
a GCE instance to its internal IP via `gcloud compute describe`; k8s://
templates an in-cluster Service DNS name. Project/zone/port/namespace
flow from EndpointSpec.metadata. No new dependencies.
Replace the stale Cloud Run yaml and deploy READMEs with a click-based
CLI that creates and restarts a GCE VM running the finelog Docker image.
Mirrors iris's pattern: subprocess gcloud, idempotent bootstrap re-run
over SSH for restart, no separate persistent disk (boot disk + GCS
offload via FINELOG_REMOTE_DIR). Pins the image by digest at the CLI
layer before passing to the bootstrap script.
The deploy bootstrap polls /health to confirm the container is up; the
ASGI app didn't expose it. Add a Starlette /health route returning "ok".
Hardcode --boot-disk-type=pd-ssd in the deploy CLI for consistent
log-write IOPS. Add a finelog matrix entry to docker-images.yaml so the
image is rebuilt weekly alongside iris.
Empty FINELOG_REMOTE_DIR disables GCS offload (the Dockerfile already
handles it). Required-by-default forced operators to invent a bucket
just to smoke-test a VM. Verified end-to-end on GCE in hai-gcp-models /
us-central1-a: create, status, restart, delete all green.
Replace the GCE-specific create/restart/delete/status/logs commands
with config-driven up/down/restart/status/logs that read a finelog
config file and dispatch to a gcp or k8s backend. The config picks
the deployment platform; the CLI is platform-agnostic, mirroring how
`iris cluster start` decides backend from cluster yaml.

New: load_finelog_config / derive_endpoint_uri helpers consumed by
iris's controller for /system/log-server auto-injection.

Adds k8s deployment via templated manifests + kubectl. Bundles
canonical `marin.yaml` and `marin-dev.yaml` configs.
Add IrisClusterConfig.log_server_config (a finelog config name).
When set, the controller auto-derives /system/log-server from the
finelog config and `iris cluster log-server {up,down,restart,
status,logs}` forwards to the matching finelog deploy commands.
Mutually exclusive with explicit endpoints[/system/log-server].
The bundled-MemStore fallback (no config at all) is unchanged.

Updates marin-dev.yaml to demonstrate the field.
Two fixes that surfaced when running `iris cluster log-server up
--cluster marin-dev` end-to-end:

1. _wait_health used `gcloud compute ssh ... curl /health` to
   verify the container, which requires the operator to have OS
   Login on the VM. When the VM runs as iris-controller@... that's
   not the operator's user. Switch to polling the VM's serial
   console for the bootstrap-script's own healthy/FAILED markers —
   no SSH needed. The bootstrap already validates /health from
   inside the VM, so this is the authoritative signal.

2. SSH-using commands (restart, logs, status container probe) now
   pass --impersonate-service-account=<sa> when the config sets
   one, mirroring iris.cluster.providers.remote_exec. Without this,
   the operator's gcloud client can't push an SSH key via OS Login
   when the SA owns the VM.
Three failures surfaced on CI that the local pytest run didn't catch:

1. lib/iris/Dockerfile didn't list lib/finelog as a workspace member
   when synthesizing the build-stage workspace pyproject. After iris
   added marin-finelog as a runtime dep, uv sync inside the controller
   image build failed to resolve. Add lib/finelog to members + sources
   and COPY its pyproject + src + config alongside rigging.

2. iris e2e and integration tests imported FetchLogsRequest /
   FetchLogsResponse from iris.rpc.logging_pb2, but those types moved
   to finelog during the lift. Switch the imports.

3. Dashboard's _proxy_log_rpc still posted upstream at the old
   /iris.logging.LogService/ path, returning 404 against the renamed
   /finelog.logging.LogService/ service.
Two related fixes for `iris cluster start --fresh` (CW CI):

1. lib/iris/Dockerfile.dockerignore allowlists exactly the libs that
   the Dockerfile COPYs. Add lib/finelog/{pyproject.toml,src,config}.
   Without this, COPY lib/finelog/* fails with "not found" because
   buildx applies the per-Dockerfile dockerignore over the marin-root
   build context.

2. The first deps-stage uv sync runs --no-install-project, but uv
   still validates each workspace member's [tool.hatch.build.targets.
   wheel.force-include] paths. finelog's pyproject force-includes the
   config/ dir, which must exist even at the metadata-only sync step.
   Move COPY lib/finelog/config/ ahead of the first uv sync.
The integration test still pulled LogServiceImpl from the deleted
iris.log_server package. Switch to finelog.server.service.
iris-e2e-smoke: dashboard frontend (useRpc.ts, rsbuild.config.ts) and
the in-process controller's proxy route were still using the old
/iris.logging.LogService path. The log service moved to
finelog.logging during the lift, so /iris.logging.LogService/FetchLogs
returned 404. Rename frontend client paths and the dashboard route to
/finelog.logging.LogService — the controller's WSGI mount already
serves finelog's path.

marin-tests: marin.mcp.babysitter still imported build_log_source
from the deleted iris.cluster.log_store package. Switch to
iris.cluster.log_store_helpers.
Two more iris.rpc.logging_* references in marin.mcp.babysitter that
broke after the lift: LogServiceClientSync (logging_connect) and the
LogEntry/FetchLogsRequest types (logging_pb2). Switch both to finelog.
…py config/

- main.py: argparse → click with FINELOG_{PORT,LOG_DIR,REMOTE_DIR,LOG_LEVEL}
  envvars so the image picks up FinelogConfig.port without overriding CMD.
- Dockerfile: drop hardcoded --port 10001; copy lib/finelog/config/ before
  the project install so non-editable wheel builds satisfy hatchling's
  force-include directive.
- k8s 02-deployment.yaml.tmpl + bootstrap.py: pass FINELOG_PORT to the
  container so probes/health checks line up on non-default ports.
- _k8s.py: filter the variable set to what each template references — the
  three manifests use disjoint subsets and render_template raises on unused
  vars, which made `finelog deploy up` unusable before reaching kubectl.
- New test_deploy_k8s.py covers all three manifests + port threading.
@rjpower rjpower force-pushed the lift-finelog-from-iris branch from e7a4b1e to da1e609 Compare April 28, 2026 17:26
@rjpower rjpower enabled auto-merge (squash) April 28, 2026 17:26
@rjpower rjpower merged commit b212f00 into main Apr 28, 2026
40 checks passed
@rjpower rjpower deleted the lift-finelog-from-iris branch April 28, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants