Skip to content

Bounded teardown drain: always ingest pending writeback drafts (retire the --flush-outbox-once / --push-local-once split) #305

@willwashburn

Description

@willwashburn

Problem

Writeback drafts are ingested into the durable outbox by the running daemon's sync cycle (pushLocal). A draft written after the daemon's last cycle and just before shutdown — e.g. a one-shot sandbox writing a final fire-and-forget reply right before teardown — is on disk but not yet in the outbox. The teardown cleanup runs --flush-outbox-once (Syncer.FlushOutboxOnce), which is outbox-only, no local scan (the "flush-124 cure"), so that draft is never ingested and is silently dropped.

Interim fix (shipped)

  • relayfile feat(mount): --push-local-once teardown drain (stop dropping last-moment writeback drafts) #304 adds --push-local-once / Syncer.PushLocalAndFlushOnce: one pushLocal pass (scans the on-disk mirror in the fresh cleanup process, ingesting drafts the daemon missed) + outbox flush, skipping pullRemote/digest/websocket.
  • cloud #2268 makes the mount cleanup shell feature-detect --push-local-once and use it only when pending local writes are detected (find -newer $RELAYFILE_MOUNT_FLUSH_MARKER), keeping --flush-outbox-once as the no-pending-writes fast path.

This works, but it's a two-mode design with the "is there pending work?" decision pushed up into the cloud cleanup shell — because pushLocal calls scanLocalFiles (a full local-tree walk), which is the exact cost --flush-outbox-once exists to avoid (flush-124 timeouts on large mounts). So we can't just always run the drain.

The right fix

Make the teardown drain always correct AND always cheap, so there's a single mode and no cloud-side conditional:

  • Give the drain a bounded local-write detection — scan only files changed since a marker (mtime-since-baseline, or a persisted dirty-set / journal), not the whole tree via scanLocalFiles.
  • Then a single teardown call ("ingest pending drafts, then flush outbox") is safe to run unconditionally: O(pending) instead of O(tree), so no flush-124 regression even on large mounts.

Outcome

  • Retire the --flush-outbox-once vs --push-local-once split (or fold both into one always-on drain).
  • Drop the cloud cleanup-shell find -newer mode-selection (cloud #2268) — the daemon becomes correct on its own.
  • Fire-and-forget writebacks become reliable for every agent with no per-call timeouts and no caller having to reason about which flush mode to use.

Pointers

  • internal/mountsync/syncer.go: pushLocalscanLocalFiles (full walk); FlushOutboxOnce (outbox-only); PushLocalAndFlushOnce (feat(mount): --push-local-once teardown drain (stop dropping last-moment writeback drafts) #304, full pushLocal + flush); HandleLocalChange (the event-driven ingest path, which already knows dirty paths in the running daemon — but the cleanup runs as a fresh process without that in-memory state, hence the scan).
  • The watcher/coalescer already tracks per-path changes in the running daemon; a persisted dirty-set (or .relay-tracked changed-since-marker list) would let the fresh cleanup process drain only pending paths.

Motivation

Surfaced while fixing intermittently-dropped threaded Slack replies from scheduled scan agents (AgentWorkforce/cloud#2261 threading + #2268 / relayfile #304 drain). The flag-based fix resolves it; this issue tracks the cleaner single-mode design.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions