Skip to content

Add lifecycle event log and listEvents API#509

Closed
whoiskatrin wants to merge 9 commits intomainfrom
feat/lifecycle-events-pr1
Closed

Add lifecycle event log and listEvents API#509
whoiskatrin wants to merge 9 commits intomainfrom
feat/lifecycle-events-pr1

Conversation

@whoiskatrin
Copy link
Copy Markdown
Collaborator

@whoiskatrin whoiskatrin commented Mar 22, 2026

Summary

This is PR 1 for lifecycle events.

It adds the first durable event log and replay API for sandboxes so we can
build webhook and streaming delivery on top of a stable internal model.

What changed

  • adds shared lifecycle event types in @repo/shared
  • adds sandbox.listEvents() to the public SDK surface
  • stores lifecycle events in Durable Object storage with monotonic seq
  • adds bounded retention for the per-sandbox event log
  • records the first event set for:
    • sandbox.created
    • sandbox.started
    • sandbox.destroyed
    • process.started
    • process.exited
    • port.exposed
    • port.unexposed
  • adds unit coverage for replay and event-type filtering

Why this shape

I kept this PR intentionally narrow.

The goal here is to establish the canonical event schema and a replayable
storage model before adding webhook delivery or streaming subscriptions.
That gives us one source of truth for ordering, filtering, and retention.

API

const events = await sandbox.listEvents({
  afterSeq: 10,
  limit: 100,
  types: ['process.exited', 'port.exposed']
});

Each event includes:

  • id
  • seq
  • sandboxId
  • timestamp
  • type
  • event-specific fields such as processId, exitCode, port, or url

Reviewer notes

Please review this PR primarily for the lifecycle event model rather than for
final product surface.

The main things worth pressure-testing are:

  1. Whether the v1 event set is the right minimal base for later webhook work.
  2. Whether listEvents({ afterSeq, limit, types }) is the right replay API.
  3. Whether Durable Object storage is the right source of truth for ordering.
  4. Whether the retention cap of 1000 events feels reasonable for this phase.

A few intentional constraints in this PR:

  • no webhook delivery yet
  • no SSE/WebSocket event stream yet
  • no command-level events yet
  • no cross-sandbox aggregation yet

Those should be easier to add once the event schema and ordering semantics are
settled.

Testing

Ran:

  • npm run check -w @cloudflare/sandbox
  • npm run typecheck -w @repo/shared
  • npm test -w @cloudflare/sandbox -- sandbox.test.ts get-sandbox.test.ts
  • full pre-push typecheck hook

Follow-up

I plan to stack PR 2 on top of this branch to continue lifecycle event work.


Open with Devin

Add a first pass at sandbox lifecycle events with durable storage and
replay support. This gives callers a simple way to inspect sandbox,
process, and port state changes without building their own polling log.

The new listEvents API is intentionally small and synchronous. It is a
safe base for later webhook and streaming work because it defines the
event schema, sequence model, and retention behavior in one place.
@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 22, 2026

🦋 Changeset detected

Latest commit: 1e5dc07

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/sandbox Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

devin-ai-integration[bot]

This comment was marked as resolved.

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Mar 22, 2026

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/sandbox-sdk/@cloudflare/sandbox@509

commit: 1e5dc07

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 22, 2026

🐳 Docker Images Published

Variant Image
Default cloudflare/sandbox:0.0.0-pr-509-1e5dc07
Python cloudflare/sandbox:0.0.0-pr-509-1e5dc07-python
OpenCode cloudflare/sandbox:0.0.0-pr-509-1e5dc07-opencode
Musl cloudflare/sandbox:0.0.0-pr-509-1e5dc07-musl
Desktop cloudflare/sandbox:0.0.0-pr-509-1e5dc07-desktop

Usage:

FROM cloudflare/sandbox:0.0.0-pr-509-1e5dc07

Version: 0.0.0-pr-509-1e5dc07


📦 Standalone Binary

For arbitrary Dockerfiles:

COPY --from=cloudflare/sandbox:0.0.0-pr-509-1e5dc07 /container-server/sandbox /sandbox
ENTRYPOINT ["/sandbox"]

Download via GitHub CLI:

gh run download 23413619720 -n sandbox-binary

Extract from Docker:

docker run --rm cloudflare/sandbox:0.0.0-pr-509-1e5dc07 cat /container-server/sandbox > sandbox && chmod +x sandbox

devin-ai-integration[bot]

This comment was marked as resolved.

All call sites bypassed enqueueLifecycleEvent and called
recordLifecycleEvent directly, defeating the write queue's
serialization guarantee. Concurrent calls could read the same
seq value and overwrite each other's events in storage.

Route every write through enqueueLifecycleEvent and return the
Promise so callers that need to await the result can do so.
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

A storage failure in any enqueueLifecycleEvent call should never
prevent the primary operation from completing. The constructor fix
also moves the lifecycleEventsInitialized flag write to after the
event attempt, so a failed write does not suppress future retries.
Copy link
Copy Markdown
Member

@ghostwriternr ghostwriternr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this change and have some high level feedback. Largely, I think the event schema is good and is roughly similar to your earlier changes under #456. But the fact that we've now landed #456 is super interesting and powerful in the context of this PR - Because the canonical logs carry more context per event (e.g., duration, outcome, origin) and broader coverage (file ops, git ops, command exec, backups).

If I were to hash out a few use-cases users could have around this change and propose alternate approaches:
Audit trails / dashboards
This one is more straightforward. With observability.enabled, all canonical events flow into Workers Logs and are queryable in the dashboard, filterable by any field, retained for 7 days, and importantly is cross-sandbox. For longer retention, Logpush ships workers_trace_events to many different destinations. Both of these are zero-code-change and more capable than our in-SDK implementation here that covers 1000-event DO-scoped logs.

Orchestration / reacting to events
A Tail Worker can be setup rather easily to receive every canonical event in real-time. It can filter by event name and forward to a Queue, HTTP endpoint, or any other destination.

export default {
  async tail(events, env) {
    for (const event of events) {
      for (const log of event.logs) {
        if (log.message?.[0]?.event === 'process.exit') {
          await env.MY_QUEUE.send(log.message[0]);
        }
      }
    }
  }
}

This does not cover replay-on-restart, but can still theoretically be done with a Tail Worker writing to D1 or KV with a sequence number, giving the same listEvents({ afterSeq }) semantics but with unlimited retention and cross-sandbox queries - both of which we can't offer within the SDK.

If there's any other use-cases that emerge beyond these, we should discuss it and see how we can work this out at the overall workers platform level than solving for it within the DO/container layers. Or if it's just these, maybe we can more simply write an example and/or docs to clearly illustrate how to achieve the same setup with just the existing primitives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants