Skip to content

observability: implement structured logger and EVENTS registry#106

Merged
AbdelStark merged 1 commit into
mainfrom
observability/logger-and-events
May 20, 2026
Merged

observability: implement structured logger and EVENTS registry#106
AbdelStark merged 1 commit into
mainfrom
observability/logger-and-events

Conversation

@AbdelStark

Copy link
Copy Markdown
Owner

Problem

RFC-0013 / docs/spec/05-observability.md define a JSONL structured-log contract that every other subsystem references (training metrics, eval book-ends, cache hit/miss, attestation verify, error events). Nothing downstream can take a hard dependency on event names until the EVENTS registry and the per-run JSONL sink land.

Solution

  • geno_lewm/observability.py

    • EVENTS: tuple of EventSpec(name, severity, summary) covering all 22 canonical v0.1 events. Renaming a name is a MAJOR change. Duplicate-detection at import.
    • LogRecord: dataclass carrying every spec-required field; standardized optional fields (step, epoch, phase, duration_ms, trace_id, span_id, error_code) are emitted only when non-None.
    • get_logger(component, run_id?, log_dir?, level?, pretty?): cached factory; identical args return the same instance, so independent subsystems share a single ordered stream per run.
    • JSONL sink: line-buffered, append-only, lock-serialized. Default path ${GENO_LEWM_LOG_DIR}/{run_id}.jsonl, falling back to ~/.geno-lewm/logs/. Concurrent threads safe.
    • Pretty stderr: auto-on when stderr is a TTY (or GENO_LEWM_LOG_FORMAT=pretty); jsonl otherwise.
    • Trace context: contextvar pair + set_trace_context() block; IDs attach only when set.
    • logged_run(): context manager that flushes the sink on any exception. For GenoLeWMError subclasses, emits a final event="error" record carrying the typed error_code before re-raising (INV-OBS-6).
  • tests/unit/test_observability.py (20 cases):

    • Canonical event coverage (locks the spec list).
    • Required-field record shape; ISO-8601-ms Z timestamp; severity threshold; standardized-field promotion; trace context present/absent; error_code only when supplied; factory caching; env-var log-dir resolution; unique default run_id; JSONL path; book-end events; crash survival (typed + untyped exceptions); set_level rejection raises InputError; thread-safety smoke (4 threads × 50 writes); JSON isolation of data.

Validation

$ python -m pytest tests/ -q
164 passed in 0.14s

$ python -m tools.lint.check_error_codes
$ echo $?
0

The error linter from #22 immediately caught two RuntimeError / ValueError raises in this module's first draft. Both replaced with InvariantViolation / InputError (the public surface intentionally takes the typed-error contract — see RFC-0012).

Caveats / out of scope (deferred to follow-ups)

Closes #23

Add geno_lewm/observability.py implementing the JSONL structured-log
contract from docs/spec/05-observability.md and RFC-0013:

- EVENTS: immutable registry of EventSpec(name, severity, summary)
  rows covering every event named in the canonical v0.1 table.
- LogRecord: dataclass carrying the spec-required fields (ts,
  severity, event, run_id, component, data) plus the standardized
  optional fields (step, epoch, phase, duration_ms, trace_id,
  span_id, error_code). to_dict() emits the stable wire shape; the
  optional fields are omitted when None.
- get_logger(component, run_id, log_dir, level, pretty): cached
  factory returning a thread-safe GenoLeWMLogger bound to a shared
  per-run sink. Defaults respect GENO_LEWM_LOG_DIR /
  GENO_LEWM_LOG_LEVEL / GENO_LEWM_LOG_FORMAT, and falls back to
  ~/.geno-lewm/logs.
- JSONL sink: line-buffered, append-only, mkdir -p the run dir.
  Concurrent components writing under the same run_id share one
  ordered stream guarded by a lock.
- Pretty stderr formatter: auto-enabled when stderr is a TTY (or
  GENO_LEWM_LOG_FORMAT=pretty); jsonl otherwise.
- Trace context: contextvar pair plus set_trace_context() block.
  trace_id / span_id are attached to records iff they are set.
- logged_run(): context manager that flushes the sink on any
  exception (records survive a crash) and, for GenoLeWMError
  subclasses, emits a final ``error`` record carrying the typed
  error_code before re-raising.

Tests in tests/unit/test_observability.py cover record shape, the
canonical event coverage, the ISO-8601-ms timestamp format, severity
threshold filtering, standardized-field promotion, trace context
attach/absent, error_code propagation, factory caching, default
log-dir resolution, run-id uniqueness, crash survival (both typed
and untyped exceptions), set_level validation (raises InputError),
thread-safety smoke (4 threads × 50 writes), and JSON-isolation of
the data field.

Closes #23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

observability: implement structured logger and EVENTS registry

1 participant