Add lifecycle event log and listEvents API#509
Conversation
Add a first pass at sandbox lifecycle events with durable storage and replay support. This gives callers a simple way to inspect sandbox, process, and port state changes without building their own polling log. The new listEvents API is intentionally small and synchronous. It is a safe base for later webhook and streaming work because it defines the event schema, sequence model, and retention behavior in one place.
🦋 Changeset detectedLatest commit: 1e5dc07 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
…rphaning running processes
commit: |
🐳 Docker Images Published
Usage: FROM cloudflare/sandbox:0.0.0-pr-509-1e5dc07Version: 📦 Standalone BinaryFor arbitrary Dockerfiles: COPY --from=cloudflare/sandbox:0.0.0-pr-509-1e5dc07 /container-server/sandbox /sandbox
ENTRYPOINT ["/sandbox"]Download via GitHub CLI: gh run download 23413619720 -n sandbox-binaryExtract from Docker: docker run --rm cloudflare/sandbox:0.0.0-pr-509-1e5dc07 cat /container-server/sandbox > sandbox && chmod +x sandbox |
All call sites bypassed enqueueLifecycleEvent and called recordLifecycleEvent directly, defeating the write queue's serialization guarantee. Concurrent calls could read the same seq value and overwrite each other's events in storage. Route every write through enqueueLifecycleEvent and return the Promise so callers that need to await the result can do so.
…rt is already unexposed
… is already exposed
…ipping onExit callback
A storage failure in any enqueueLifecycleEvent call should never prevent the primary operation from completing. The constructor fix also moves the lifecycleEventsInitialized flag write to after the event attempt, so a failed write does not suppress future retries.
There was a problem hiding this comment.
I went through this change and have some high level feedback. Largely, I think the event schema is good and is roughly similar to your earlier changes under #456. But the fact that we've now landed #456 is super interesting and powerful in the context of this PR - Because the canonical logs carry more context per event (e.g., duration, outcome, origin) and broader coverage (file ops, git ops, command exec, backups).
If I were to hash out a few use-cases users could have around this change and propose alternate approaches:
Audit trails / dashboards
This one is more straightforward. With observability.enabled, all canonical events flow into Workers Logs and are queryable in the dashboard, filterable by any field, retained for 7 days, and importantly is cross-sandbox. For longer retention, Logpush ships workers_trace_events to many different destinations. Both of these are zero-code-change and more capable than our in-SDK implementation here that covers 1000-event DO-scoped logs.
Orchestration / reacting to events
A Tail Worker can be setup rather easily to receive every canonical event in real-time. It can filter by event name and forward to a Queue, HTTP endpoint, or any other destination.
export default {
async tail(events, env) {
for (const event of events) {
for (const log of event.logs) {
if (log.message?.[0]?.event === 'process.exit') {
await env.MY_QUEUE.send(log.message[0]);
}
}
}
}
}This does not cover replay-on-restart, but can still theoretically be done with a Tail Worker writing to D1 or KV with a sequence number, giving the same listEvents({ afterSeq }) semantics but with unlimited retention and cross-sandbox queries - both of which we can't offer within the SDK.
If there's any other use-cases that emerge beyond these, we should discuss it and see how we can work this out at the overall workers platform level than solving for it within the DO/container layers. Or if it's just these, maybe we can more simply write an example and/or docs to clearly illustrate how to achieve the same setup with just the existing primitives.
Summary
This is PR 1 for lifecycle events.
It adds the first durable event log and replay API for sandboxes so we can
build webhook and streaming delivery on top of a stable internal model.
What changed
@repo/sharedsandbox.listEvents()to the public SDK surfaceseqsandbox.createdsandbox.startedsandbox.destroyedprocess.startedprocess.exitedport.exposedport.unexposedWhy this shape
I kept this PR intentionally narrow.
The goal here is to establish the canonical event schema and a replayable
storage model before adding webhook delivery or streaming subscriptions.
That gives us one source of truth for ordering, filtering, and retention.
API
Each event includes:
idseqsandboxIdtimestamptypeprocessId,exitCode,port, orurlReviewer notes
Please review this PR primarily for the lifecycle event model rather than for
final product surface.
The main things worth pressure-testing are:
listEvents({ afterSeq, limit, types })is the right replay API.A few intentional constraints in this PR:
Those should be easier to add once the event schema and ordering semantics are
settled.
Testing
Ran:
npm run check -w @cloudflare/sandboxnpm run typecheck -w @repo/sharednpm test -w @cloudflare/sandbox -- sandbox.test.ts get-sandbox.test.tsFollow-up
I plan to stack PR 2 on top of this branch to continue lifecycle event work.