[iris] Lift log store and log server into new lib/finelog package#5212
[iris] Lift log store and log server into new lib/finelog package#5212
Conversation
|
Specification Problem Approach Key code
Trade-off — duplicated LogEntry Breaking change — endpoints config required Tests
Out of scope
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3be728c1af
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Update: A1 (bundled in-process MemStore fallback) is now in. /system/log_server in the endpoints config is optional — Controller starts a 200k-row capped MemStore log server in-process when it's absent. Logs lost on restart, capped in memory; production deployments should still run finelog-server externally and declare it in the endpoints config. Long-term cleanup tracked in #5215. |
6b1af68 to
39f4c12
Compare
46d4d63 to
e7a4b1e
Compare
yonromai
left a comment
There was a problem hiding this comment.
🤖 Approved with three actionable deployment-path findings. The Python unit and targeted Iris tests are green, and GitHub checks are currently green; the issues are in paths not exercised by PR CI (finelog-image Docker build and k8s deploy rendering).
Generated with Codex.
|
|
||
| # rigging is a workspace-local dep not on PyPI. | ||
| COPY lib/rigging/pyproject.toml ./lib/rigging/pyproject.toml | ||
| COPY lib/finelog/pyproject.toml ./lib/finelog/pyproject.toml |
There was a problem hiding this comment.
🤖 P1 - Copy lib/finelog/config/ before syncing the finelog image
lib/finelog/pyproject.toml force-includes config/, but this Dockerfile only copies pyproject.toml before the first uv sync --package marin-finelog --no-install-project, and never copies lib/finelog/config/ before installing the project. The Iris Dockerfile already had to copy that directory ahead of uv sync for the same reason. As written, the scheduled/manual finelog-image build can fail during metadata validation, and even if the first sync were bypassed the final package install still lacks the forced-included config directory. Please copy lib/finelog/config/ into ./lib/finelog/config/ before the first sync and keep it available for the final project install.
Generated with Codex.
| f"storageClassName: {k8s.storage_class}" if k8s.storage_class else "# storageClassName: <cluster default>" | ||
| ) | ||
| template = template_path.read_text() | ||
| return render_template( |
There was a problem hiding this comment.
🤖 P1 - Render k8s templates with only the variables each template uses
render_template() intentionally raises on unused variables, but _render_manifest() passes the full variable set to every template. That makes the k8s backend unusable before it reaches kubectl: 01-pvc.yaml.tmpl rejects image/port/remote_log_dir, 02-deployment.yaml.tmpl rejects storage_class_block/storage_gb, and 03-service.yaml.tmpl rejects the deployment/PVC-only variables. Please either pass a per-template variable map or relax the unused-variable check for this caller, and add a small render test so finelog deploy up covers all three manifests.
Generated with Codex.
| # prod (gs://bucket/path → archive segments to GCS). | ||
| ENV FINELOG_REMOTE_DIR="" | ||
|
|
||
| CMD ["sh", "-c", "exec python -m finelog.server.main --port 10001 --log-dir /var/cache/finelog --remote-log-dir \"${FINELOG_REMOTE_DIR}\""] |
There was a problem hiding this comment.
🤖 P2 - Keep FinelogConfig.port and the container command in sync
The deploy config exposes port, and both GCP bootstrap health checks and k8s manifests use cfg.port, but the image itself always starts finelog.server.main --port 10001. Any config that sets a different port will advertise/probe that port while the process listens on 10001, so GCP startup and k8s readiness will fail. Please either make the port fixed in the schema/configs or thread it into the container command, for example with a FINELOG_PORT env var or explicit deployment args.
Generated with Codex.
Extracts the structured-log subsystem out of iris into a standalone
lib/finelog. Iris loses its in-process log server entirely: the log
server is now an external service operators run via the deploy artifacts
under lib/finelog/deploy/ (Dockerfile, k8s, Cloud Run) and iris reaches
it via a new endpoints config (cluster_config.endpoints map keyed by
logical name -> EndpointSpec{uri, metadata}). Controller startup now
requires /system/log_server in the endpoints config and fails fast if
absent; this is a deliberate breaking change for every deployment,
replacing the old subprocess launch path.
Iris retains a wire-compatible LogEntry under iris_logging.proto purely
because controller/job RPCs embed it and proto cross-imports between
projects are forbidden; service.py transcodes at the boundary. The auth/
JWT plumbing on the log path is gone — finelog is unauthenticated and
secured at the network layer.
…mits /system/log_server Drops the fail-fast in _resolve_cluster_endpoints. When /system/log_server is absent from cluster_config.endpoints, the Controller starts a bundled MemStore-backed log server in-process (capped at 200k rows, FIFO eviction) so workers and the dashboard work out of the box without an external finelog deployment. Logs are lost on controller restart and capped in memory; production deployments still want an external finelog-server. Adds max_rows cap to MemStore + tests. Code comment in controller.py points at #5215 (rigging-mediated log sinks) for the longer-term cleanup that replaces the bundled-server fallback with a NullSink + StderrSink pair.
Folds finelog.time.Timestamp into finelog.logging as a nested message and deletes finelog/proto/time.proto. With no time.proto in finelog there is no descriptor-pool basename collision, so iris reverts iris_time.proto and iris_time_pb2 back to time.proto / time_pb2 (and the .pyi shim goes away). The iris_logging.proto rename stays — that collision is on the logging.proto basename and we still keep iris's wire-compatible LogEntry duplicate for the controller/job RPCs that embed it. API unchanged on both sides: entry.timestamp.epoch_ms still works.
The lift agent deleted iris/tests/test_remote_log_handler.py claiming finelog had equivalent coverage; that was wrong — finelog only had store/server tests. Lift the deleted test into finelog/tests/test_client.py, pruned to 7 tests covering: handler emit, no-deadlock on push failure, flush blocks until shipped, flush timeout returns False, batch_size trigger, close drains, max_buffer_size overflow.
Standardize the log server endpoint name on /system/log-server everywhere (worker resolver, controller registration, config docs, tests). The canonical+alias-with-deprecation-warning split was unnecessary churn — keep the original hyphen name and remove the underscore variant entirely.
Replace the if/elif dispatcher with a small registry; add concrete handlers for the two schemes that were previously stubs. gcp:// resolves a GCE instance to its internal IP via `gcloud compute describe`; k8s:// templates an in-cluster Service DNS name. Project/zone/port/namespace flow from EndpointSpec.metadata. No new dependencies.
Replace the stale Cloud Run yaml and deploy READMEs with a click-based CLI that creates and restarts a GCE VM running the finelog Docker image. Mirrors iris's pattern: subprocess gcloud, idempotent bootstrap re-run over SSH for restart, no separate persistent disk (boot disk + GCS offload via FINELOG_REMOTE_DIR). Pins the image by digest at the CLI layer before passing to the bootstrap script.
The deploy bootstrap polls /health to confirm the container is up; the ASGI app didn't expose it. Add a Starlette /health route returning "ok". Hardcode --boot-disk-type=pd-ssd in the deploy CLI for consistent log-write IOPS. Add a finelog matrix entry to docker-images.yaml so the image is rebuilt weekly alongside iris.
Empty FINELOG_REMOTE_DIR disables GCS offload (the Dockerfile already handles it). Required-by-default forced operators to invent a bucket just to smoke-test a VM. Verified end-to-end on GCE in hai-gcp-models / us-central1-a: create, status, restart, delete all green.
Replace the GCE-specific create/restart/delete/status/logs commands with config-driven up/down/restart/status/logs that read a finelog config file and dispatch to a gcp or k8s backend. The config picks the deployment platform; the CLI is platform-agnostic, mirroring how `iris cluster start` decides backend from cluster yaml. New: load_finelog_config / derive_endpoint_uri helpers consumed by iris's controller for /system/log-server auto-injection. Adds k8s deployment via templated manifests + kubectl. Bundles canonical `marin.yaml` and `marin-dev.yaml` configs.
Add IrisClusterConfig.log_server_config (a finelog config name).
When set, the controller auto-derives /system/log-server from the
finelog config and `iris cluster log-server {up,down,restart,
status,logs}` forwards to the matching finelog deploy commands.
Mutually exclusive with explicit endpoints[/system/log-server].
The bundled-MemStore fallback (no config at all) is unchanged.
Updates marin-dev.yaml to demonstrate the field.
Two fixes that surfaced when running `iris cluster log-server up --cluster marin-dev` end-to-end: 1. _wait_health used `gcloud compute ssh ... curl /health` to verify the container, which requires the operator to have OS Login on the VM. When the VM runs as iris-controller@... that's not the operator's user. Switch to polling the VM's serial console for the bootstrap-script's own healthy/FAILED markers — no SSH needed. The bootstrap already validates /health from inside the VM, so this is the authoritative signal. 2. SSH-using commands (restart, logs, status container probe) now pass --impersonate-service-account=<sa> when the config sets one, mirroring iris.cluster.providers.remote_exec. Without this, the operator's gcloud client can't push an SSH key via OS Login when the SA owns the VM.
Three failures surfaced on CI that the local pytest run didn't catch: 1. lib/iris/Dockerfile didn't list lib/finelog as a workspace member when synthesizing the build-stage workspace pyproject. After iris added marin-finelog as a runtime dep, uv sync inside the controller image build failed to resolve. Add lib/finelog to members + sources and COPY its pyproject + src + config alongside rigging. 2. iris e2e and integration tests imported FetchLogsRequest / FetchLogsResponse from iris.rpc.logging_pb2, but those types moved to finelog during the lift. Switch the imports. 3. Dashboard's _proxy_log_rpc still posted upstream at the old /iris.logging.LogService/ path, returning 404 against the renamed /finelog.logging.LogService/ service.
Two related fixes for `iris cluster start --fresh` (CW CI):
1. lib/iris/Dockerfile.dockerignore allowlists exactly the libs that
the Dockerfile COPYs. Add lib/finelog/{pyproject.toml,src,config}.
Without this, COPY lib/finelog/* fails with "not found" because
buildx applies the per-Dockerfile dockerignore over the marin-root
build context.
2. The first deps-stage uv sync runs --no-install-project, but uv
still validates each workspace member's [tool.hatch.build.targets.
wheel.force-include] paths. finelog's pyproject force-includes the
config/ dir, which must exist even at the metadata-only sync step.
Move COPY lib/finelog/config/ ahead of the first uv sync.
The integration test still pulled LogServiceImpl from the deleted iris.log_server package. Switch to finelog.server.service.
iris-e2e-smoke: dashboard frontend (useRpc.ts, rsbuild.config.ts) and the in-process controller's proxy route were still using the old /iris.logging.LogService path. The log service moved to finelog.logging during the lift, so /iris.logging.LogService/FetchLogs returned 404. Rename frontend client paths and the dashboard route to /finelog.logging.LogService — the controller's WSGI mount already serves finelog's path. marin-tests: marin.mcp.babysitter still imported build_log_source from the deleted iris.cluster.log_store package. Switch to iris.cluster.log_store_helpers.
Two more iris.rpc.logging_* references in marin.mcp.babysitter that broke after the lift: LogServiceClientSync (logging_connect) and the LogEntry/FetchLogsRequest types (logging_pb2). Switch both to finelog.
…py config/
- main.py: argparse → click with FINELOG_{PORT,LOG_DIR,REMOTE_DIR,LOG_LEVEL}
envvars so the image picks up FinelogConfig.port without overriding CMD.
- Dockerfile: drop hardcoded --port 10001; copy lib/finelog/config/ before
the project install so non-editable wheel builds satisfy hatchling's
force-include directive.
- k8s 02-deployment.yaml.tmpl + bootstrap.py: pass FINELOG_PORT to the
container so probes/health checks line up on non-default ports.
- _k8s.py: filter the variable set to what each template references — the
three manifests use disjoint subsets and render_template raises on unused
vars, which made `finelog deploy up` unusable before reaching kubectl.
- New test_deploy_k8s.py covers all three manifests + port threading.
e7a4b1e to
da1e609
Compare
Project Idea
Lift the logging service out of the iris controller into a standalone
lib/finelogpackage per issue #5210's 1-pager process. Reduces the controller's footprint, lets the log subsystem evolve independently, and unblocks future work on a sibling stats service. Also a forcing function for figuring out our service-extraction template (proxy management, proto sharing, deploy artifacts) on a small target before tackling stats.Challenges
Cross-project proto dependencies. Iris's
controller.protoandjob.protoembedLogEntrydirectly in their RPC schemas, so finelog's proto can't replace iris's without a cross-package proto import (currently disallowed). Resolution: iris keeps a wire-compatible duplicate atiris_logging.protoand transcodes viaSerializeToStringround-trip at the boundary inservice.py. Iris'stime.protowas renamed toiris_time.protofor descriptor-pool dedup;iris.rpc.{logging_pb2, time_pb2}are now shim re-exports.Proxy / discovery. Workers already resolve the log server via
controller.list_endpoints("/system/log_server"), so the data plane lifts cleanly. The new question is how operators tell iris where the external server lives. Solved with a genericcluster_config.endpointsmap (logical name →EndpointSpec{uri, metadata}) and aresolve_endpoint_urischeme dispatch (http://works today;gcp://andk8s://are explicitNotImplementedErrorstubs to be filled when we deploy that way).Costs / Risks
Iris startup gains a new soft dependency. To keep "iris on a laptop" working, the controller hosts a bundled in-process MemStore-backed log server (capped at 200k rows, FIFO eviction) when
/system/log_serveris absent from the endpoints config — pure dev convenience, lossy on restart. Production is expected to declare an external finelog endpoint.Churn with no immediate user-visible improvement. The benefit shows up later: independent iteration on logging, easier stats service spike, and a worked example for the next service lift.
Auth churn. The deleted log-path JWT plumbing means the new server is unauthenticated; deployments must restrict access at the network layer (k8s NetworkPolicy / GCP firewall / VPC). The current
iris-on-coreweaveandiris-on-gcppaths satisfy this; flagged for any future deploy that doesn't.Design
lib/finelog/is a new top-level package owningproto/{logging,time}.proto(packagefinelog.logging), generatedrpc/,store/(MemStore + DuckDBLogStore),server/(LogServiceImpl + minimal CLI launcher; no auth, no stats),client/(LogPusher + RemoteLogHandler + LogServiceProxy). Iris keepsiris/cluster/log_store_helpers.pyfor the JobName/TaskAttempt-aware key builders that wrap finelog opaque-string primitives. Deploy artifacts (Dockerfile, k8s YAML, Cloud Run YAML) ship underlib/finelog/deploy/. Iris's_start_log_serversubprocess path is gone; the bundled in-process fallback lives inController._start_local_log_serverwith aTODO(#5215)pointer at the rigging-sink follow-up that would replace it with a NullSink/StderrSink pair.Testing
Tested against the iris dev controller's unit + integration suite. After cutover: 1975 iris tests pass / 1 skipped (excl. slow/docker/e2e); 24 finelog tests pass standalone (mem store, duckdb store, server concurrency cap + RPC round-trip). Pre-commit and pyrefly clean.
The two rollout cases worth checking on dev cluster:
/system/log_serverfrom the controller's endpoint registry — same wire format, same client. Expect no rollout coordination.Open Questions
How
gcp://andk8s://resolution should actually be implemented — currently stubs that raise. Likely a follow-up once the first non-http://deployment shows up.Whether the iris-side
LogEntryduplicate is worth collapsing later (would require either lifting controller/job protos out of iris or relaxing the no-cross-proto-import rule for finelog specifically).Whether dashboard log fetch should grow alternate read sources (direct GCS, GCP Logging) — orthogonal to this PR but would reuse the same endpoint-config mechanism.