Skip to content

Improve structured logging across SDK and container#456

Merged
ghostwriternr merged 39 commits intomainfrom
improve-logging
Mar 26, 2026
Merged

Improve structured logging across SDK and container#456
ghostwriternr merged 39 commits intomainfrom
improve-logging

Conversation

@whoiskatrin
Copy link
Copy Markdown
Collaborator

@whoiskatrin whoiskatrin commented Mar 5, 2026

What

Replaces scattered start/end log pairs with a single structured event per operation, following the canonical log lines pattern. Each operation emits exactly one log event at completion with outcome, durationMs, and relevant context fields.

Changes

Container services — canonical log events added to:

  • file.read, file.write, file.delete, file.mkdir (with sizeBytes, path)
  • process.start (with pid), process.exit (with exitCode, durationMs)
  • port.expose, port.unexpose
  • git.clone, git.checkout
  • session.create, session.destroy
  • command.exec, command.stream
  • backup.create, backup.restore, bucket.mount, bucket.unmount, sandbox.destroy

SDK / Durable Object — debug-level logs for port.expose, port.unexpose; X-Sandbox-Id header propagated to container so every log event includes sandboxId

Logger — pretty printer now shows HH:MM:SS.mmm timestamp prefix; serviceVersion and instanceId auto-injected into base context

InfrastructureBaseHttpClient supports defaultHeaders merged into every outgoing request

Event naming

All events follow domain.operation: file.read, process.exit, port.expose, git.clone, session.create, command.exec, bucket.mount, etc.

Testing

  • Typecheck: 18/18 packages pass
  • SDK unit tests: all pass
  • Container unit tests: 561 pass, 4 pre-existing PTY failures (local Bun.Terminal unavailable on macOS, unrelated to this PR)

Open with Devin

whoiskatrin and others added 24 commits March 3, 2026 10:48
Scattered start/end log pairs and breadcrumb debug logs made it hard
to query operations as single events. This replaces ~40 scattered
logs with single wide events emitted at completion, following the
canonical log lines pattern from Stripe.

Environment characteristics (serviceVersion, instanceId) are now
injected into the base logger so every log line carries them
automatically without callers needing to pass them.
Cloudflare log viewers split multi-line pretty output into separate
entries, which made request traces hard to read. Emit each pretty log
as one line with inline context so events stay grouped and scannable.
The container's stdout/stderr goes through Cloudflare's log pipeline,
which cannot render ANSI color codes and splits multi-line output into
separate log entries. The container defaults to JSON format when the
env var is absent, which is correct for production.
The container has two wide event layers that capture all request outcomes:
1. LoggingMiddleware — logs every HTTP request with method, path, status, duration
2. session.ts exec/execStream — logs every command execution with full context

Service layers (file, git, port, process, interpreter, session-manager) were
duplicating these by logging errors in every catch block, producing 3-5 log
lines per failed request instead of 1-2. This violates the wide events
principle: one context-rich event per service hop, not scattered fragments.

Changes:
- Remove duplicate error logs from all service catch blocks
- Fold PID pipe fallback reason into the execStream wide event as pidFallback
- Remove unused logger parameter from GitService and PortService
- Update container.ts constructor calls and test files accordingly
Pretty-print mode embedded raw newlines from stack traces and error
messages, causing Cloudflare's log pipeline to fragment single events
into multiple entries. Escape \n and \r in all pretty-print string
values. Also add a Bun.serve() error handler to catch unhandled
fetch errors before Bun's default handler writes raw stack traces.
Accept main's removal of streamProcessLogs from ProcessService,
which was deleted in PR #447.
HTTP middleware now uses "POST /api/contexts" instead of "HTTP request"
so the dashboard list is scannable. Container-side wide events adopt the
same noun.verb convention already used on the DO side.
The dashboard header now reads "POST /api/contexts 501" so you can see
method, path, and outcome without expanding. Removed startedAt, method,
pathname, userAgent, contentLength, and outcome from the wide event
context since they were either in the message, derivable from level,
or noise for internal SDK calls.
Pass captured Error objects to logger.error() in finally blocks
instead of undefined, preserving stack traces for debugging.
Standardize timing fields to durationMs across all wide events.
Add truncation indicator to normalizeLogLine so developers know
when log context was lost. Add missing changeset.
Co-authored-by: whoiskatrin <whoiskatrin@users.noreply.github.com>
Resolve conflicts in session-manager.ts by keeping the
classifyCommandError logic from main (session-destroyed detection)
while removing explicit service-level logging per the canonical
log lines approach.
Emit a single structured event per operation (domain.operation naming)
with outcome, durationMs, and relevant context fields in every major
service so developers can debug sandboxes from logs alone.

- FileService: file.read/write/delete/mkdir with sizeBytes and path
- ProcessService: process.start (with pid) and process.exit (with exitCode)
- PortService: port.expose/unexpose with logger injection via constructor
- GitService: git.clone/checkout with logger injection via constructor
- SessionManager: session.create/destroy with cwd and sessionId
- LoggingMiddleware: include sandboxId from X-Sandbox-Id request header
- BaseHttpClient: support defaultHeaders merged into every outgoing request
- Sandbox DO: pass X-Sandbox-Id header to container; debug logs for port ops
- Logger pretty-printer: show HH:MM:SS.mmm timestamp prefix for readability
- Tests: new LoggingMiddleware tests; defaultHeaders tests; fix constructor
  signatures for PortService and GitService in existing test files
@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 5, 2026

🦋 Changeset detected

Latest commit: 320da09

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/sandbox Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

github-actions[bot]

This comment was marked as outdated.

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Mar 5, 2026

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/sandbox-sdk/@cloudflare/sandbox@456

commit: 320da09

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 5, 2026

🐳 Docker Images Published

Variant Image
Default cloudflare/sandbox:0.0.0-pr-456-8cc7935
Python cloudflare/sandbox:0.0.0-pr-456-8cc7935-python
OpenCode cloudflare/sandbox:0.0.0-pr-456-8cc7935-opencode
Musl cloudflare/sandbox:0.0.0-pr-456-8cc7935-musl
Desktop cloudflare/sandbox:0.0.0-pr-456-8cc7935-desktop

Usage:

FROM cloudflare/sandbox:0.0.0-pr-456-8cc7935

Version: 0.0.0-pr-456-8cc7935


📦 Standalone Binary

For arbitrary Dockerfiles:

COPY --from=cloudflare/sandbox:0.0.0-pr-456-8cc7935 /container-server/sandbox /sandbox
ENTRYPOINT ["/sandbox"]

Download via GitHub CLI:

gh run download 22732170705 -n sandbox-binary

Extract from Docker:

docker run --rm cloudflare/sandbox:0.0.0-pr-456-8cc7935 cat /container-server/sandbox > sandbox && chmod +x sandbox

Log background stream errors instead of silently swallowing them in
ProcessService. Rename snapshot.unmount event to backup.unmount for
consistency with the domain.operation naming convention. Add method
and pathname as discrete fields in the HTTP logging middleware for
JSON log querying. Preserve original error cause in InterpreterService
deleteContext. Remove unrelated skills-lock.json.
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 5, 2026

All fixes are committed. Here's a summary of what was addressed:

Fixes applied

  1. Silent stream error swallowing (process-service.ts:245): The .catch(() => {}) on continueStreaming now logs the error at debug level with process.streamComplete event name, processId, outcome: 'error', and the error message. This ensures stream failures are diagnosable without cluttering info-level logs.

  2. Inconsistent event naming (backup-service.ts:538-541): Renamed snapshot.unmount to backup.unmount to follow the domain.operation convention used by backup.create and backup.restore.

  3. Removed skills-lock.json: Unrelated tooling config that leaked into the PR.

  4. Missing discrete fields in HTTP middleware (middleware/logging.ts:33-34): Added method and pathname as top-level fields in the wide event object, enabling structured JSON log querying without parsing the message string. Updated the test to assert these new fields.

  5. Lost stack trace in InterpreterService (interpreter-service.ts:230): The catch block now captures the error variable and passes it as { cause: error } to the new Error, preserving the original stack trace chain.

github run

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 9, 2026

Conflicts resolved. Here's a summary of what was merged and how each conflict was handled:

Conflicts resolved (3 files)

1. packages/sandbox-container/src/handlers/pty-ws-handler.ts

  • Main added shell to the destructured ws.data fields (needed for the PTY shell option feature)
  • This branch removed the debug log in favor of the canonical log pattern in onClose
  • Resolution: Keep shell in destructuring from main + keep the canonical log comment from this branch

2. packages/sandbox-container/src/services/git-service.ts

  • Main added redactCredentials import and special-cased exit code 124 (timeout) with credential sanitization in error messages
  • This branch refactored to use canonical log lines pattern with try/catch/finally
  • Resolution: Keep canonical logging pattern + restore the exit code 124 timeout handling and redactCredentials usage for error messages. This fixes 2 tests that verify timeout messages contain "timed out" and that credentials are sanitized.

3. packages/sandbox/src/sandbox.ts

  • Main added a 4th error category: "Unrecognized errors treated as transient (503)" for safer retries
  • This branch used canonical log format (container.startup event name with structured fields)
  • Resolution: Keep main's transient 503 behavior for unrecognized errors + use canonical log style (this.logger.warn('container.startup', { outcome: 'unrecognized_error', ... }))

Verification

  • Typecheck: 17/17 packages pass
  • Lint: clean (1 pre-existing warning)
  • Unit tests: 569 container tests pass, all SDK + shared tests pass

github run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 9, 2026

🐳 Docker Images Published

Variant Image
Default cloudflare/sandbox:0.0.0-pr-456-320da09
Python cloudflare/sandbox:0.0.0-pr-456-320da09-python
OpenCode cloudflare/sandbox:0.0.0-pr-456-320da09-opencode
Musl cloudflare/sandbox:0.0.0-pr-456-320da09-musl
Desktop cloudflare/sandbox:0.0.0-pr-456-320da09-desktop

Usage:

FROM cloudflare/sandbox:0.0.0-pr-456-320da09

Version: 0.0.0-pr-456-320da09


📦 Standalone Binary

For arbitrary Dockerfiles:

COPY --from=cloudflare/sandbox:0.0.0-pr-456-320da09 /container-server/sandbox /sandbox
ENTRYPOINT ["/sandbox"]

Download via GitHub CLI:

gh run download 23595389767 -n sandbox-binary

Extract from Docker:

docker run --rm cloudflare/sandbox:0.0.0-pr-456-320da09 cat /container-server/sandbox > sandbox && chmod +x sandbox

scuffi
scuffi previously approved these changes Mar 10, 2026
{ port, name }
);

caughtError = error instanceof Error ? error : new Error(String(error));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Not for this PR, but I noticed we have loads of error instanceof Error ? error : new Error(String(error)); and const errorMessage = error instanceof Error ? error.message : 'Unknown error';. Could be nice to have a helper func like asError() which handles all this just for deduplication? (I've also implemented this a couple times in my PR)

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
devin-ai-integration[bot]

This comment was marked as resolved.

whoiskatrin and others added 7 commits March 23, 2026 12:05
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* Add sanitize helpers, relocate redactCredentials

redactCredentials() is a general-purpose string sanitizer with no
git-specific logic. Moved to logger/sanitize.ts alongside new
helpers for presigned URL params and command truncation.

* Anchor sensitive param regex to query strings

Require [?&] prefix so path segments like /api/token/ are not
matched. Stop value matching at URL delimiters to avoid consuming
trailing command arguments. Clamp truncateForLog for tiny maxLen.

* Add OutputMode to replace boolean pretty flag

The logger now supports three output modes: 'structured' emits raw
objects for Workers Logs auto-indexing, 'json-line' emits JSON strings
for container stdout, and 'pretty' emits ANSI-formatted strings for
local dev. Container component always uses json-line regardless of
SANDBOX_LOG_FORMAT. Renames msg to message for Cloudflare dashboard
compatibility, and removes deprecated duration/operation fields from
LogContext.

* Add canonical event helpers for structured logging

buildMessage() constructs scannable one-line messages from event
payloads. logCanonicalEvent() selects log level from outcome and
sanitizes commands before emitting structured context.

* Remove logSuccess/logError from all DO clients

These methods produced unqueryable string blobs in Workers
Observability. The container already logs all operations with
properly structured, queryable canonical events. DO-specific
canonical events (sandbox.exec, backup.*, etc.) are added in
the next task.

* Migrate DO canonical events to logCanonicalEvent()

All DO-side events now emit structured objects with scannable
messages, consistent field names, and automatic credential redaction.

* Fix command outcome semantics: non-zero exitCode is error

The outcome field was hardcoded to 'success' even when commands exited
with non-zero codes (e.g., exitCode 127 for command not found). This
made outcome unreliable for filtering failed commands in logs. Both
exec() and execStream() now derive outcome from the exit code. Log
level selection uses caughtError to distinguish exceptions from normal
non-zero exits.

* Migrate container canonical events to logCanonicalEvent()

Migrates command.exec, command.stream, session.create, session.destroy,
file.read, file.write, file.delete, file.mkdir, port.expose, port.unexpose,
git.clone, and git.checkout to the shared canonical event helper.

Fixes error-at-info-level pattern across all services. Adds stderrPreview
redaction to close credential leak path in failed backup commands.

* Demote HTTP middleware logs to debug, migrate process events

Successful middleware requests (2xx) now log at debug since domain
events already capture every operation at info. Client errors (4xx)
log at warn, server errors (5xx) remain at error. Process start and
exit events now use logCanonicalEvent() for consistent structure.

* Fix double-redaction and improve CanonicalEventPayload type safety

Sanitize command once in logCanonicalEvent, pass to buildMessage.
Add common domain fields (commandId, processId, stdoutLen, etc.) to
the typed interface for autocomplete and typo detection. Document the
priority ordering in buildMessage's if/else chain.

* Auto-derive errorMessage, extend buildMessage context

errorMessage is now resolved from error.message before buildMessage
so error reasons appear in dashboard messages without requiring
every call site to set errorMessage explicitly.

buildMessage extended for backup, git.checkout, and bucket mount
events via backupId, repoPath, and mountPath fields.

* Clean up canonical event fields and add error coverage

Remove r2Key from backup.create (leaks internal R2 layout), replace
mountResults array with scalar counts on sandbox.destroy, and strip
internal PID detection debug fields from command.stream. Wrap
port.expose, port.unexpose, backup.create, backup.restore, and
sandbox.destroy in try/finally so the canonical event fires on both
success and error paths.

* Migrate bucket.mount/unmount to logCanonicalEvent

Replace this.logger.info() calls with logCanonicalEvent() and wrap
both operations in try/finally for error path coverage.

* Type exec state accumulator, migrate backup events, add errorMessage tracking

Eliminates 12 type casts in session.ts by typing the exec state
accumulator. Migrates container backup.create/restore/unmount to
logCanonicalEvent. Adds errorMessage tracking on non-throwing
error paths across container services.

* Fix credential leak in errorMessage and logging gaps

Sanitize errorMessage with redactSensitiveParams before logging to
prevent presigned URL credentials from appearing in log output. Fix
const/let errorMessage shadowing in catch blocks across five container
services so finally blocks see the correct value. Populate stderrPreview
in the exec path. Move DO validation inside try/finally for exposePort,
unexposePort, mountBucketFuse, and unmountBucket so validation errors
get canonical logging. Add path context to backup.create events. Replace
raw logger call with canonical event for process stream errors.

* Fix credential leak in error objects, outcome semantics, and coverage gaps

Sanitizes Error objects (message + stack) before logging to prevent
presigned URLs from appearing in log output. Uses redactCommand for
errorMessage and stderrPreview instead of redactSensitiveParams.
Fixes sandbox.exec outcome to check exitCode for non-throwing
failures. Moves backup validation inside try/finally at both DO
and container layers. Adds errorMessage on non-throwing error
paths in file and git services. Adds canonical event for local
bucket mounts. Removes duplicate backup info log. Includes branch
in git.checkout buildMessage with repoPath.

* Fix git token leak, demote middleware duplicates, clean up stale logs

Apply redactCommand to git URLs in security-service, git-handler, and
sanitizeGitData to cover query-style tokens (?token=...) that
redactCredentials alone missed. Demote all middleware HTTP logs to
debug since canonical events already cover operational failures.
Remove leftover info logs for backup git-check and bucket unmount.
Add mountsProcessed/mountFailures to sandbox.destroy message.

* Close remaining git token leak and duplicate log

Use redactCommand instead of redactCredentials in git clone
error construction so query-param tokens are also stripped.
Remove duplicate handler-level error log for failed clones.
Remove redundant dir field from container backup.create.

* Log command completions at info, not error

Non-zero exit codes from normal shell commands (grep no match, test -d,
diff) are expected behavior, not infrastructure errors. The exitCode
field is queryable for anyone who needs to filter by it. Only actual
exceptions (container unreachable, timeout) produce error-level logs.

* Respect SANDBOX_LOG_FORMAT across all components, forward to container

All components (DO, container, executor) use pretty mode when
SANDBOX_LOG_FORMAT=pretty is set, giving readable terminal output
during local wrangler dev. Without the var, DOs default to structured
and containers/executors default to json-line for their respective
observability pipelines.

Forwards SANDBOX_LOG_FORMAT to the container environment so the
setting propagates. Removes the var from the E2E wrangler template
so CI deployments default to production behavior.

* Fix backup cleanup session ID and version.check log level

Backup catch blocks used a derived session ID instead of the
hardcoded '__sandbox_backup__' from ensureBackupSession(), making
archive cleanup silently fail on backup errors.

Version mismatches during rolling deployments are expected and
should not produce error-level logs. The versionOutcome field
carries the granular status for filtering.

* Demote background process output to debug, fix session noise

Desktop helper stdout/stderr (VNC, X11, xfwm4) demoted from
info/warn to debug — diagnostic output, not operational events.

User CMD process stdout/stderr demoted from info/warn to debug —
user-spawned server access logs shouldn't pollute container logs.

Session "already exists" outcome changed to success — an existing
session is usable, not an error. The 409 API response is preserved.

* Reduce log noise with origin-based severity and centralized policy (#521)

* Use unique backup session IDs to prevent collision

The fixed __sandbox_backup__ name allowed user-created sessions to
collide with internal backup commands, poisoning cwd/env/PATH.
Unique per-operation IDs with cleanup in finally prevent leaks.

* Centralize severity policy in resolveLogLevel

Errors always stay at error. Internal commands demote to debug.
Session lifecycle and file operations demote to debug on success.
successLevel option for special cases like version.check.

* Add origin metadata to execute paths on DO side

Internal commands (backup, bucket mount, env setup) are marked with
origin: 'internal' so resolveLogLevel can demote them to debug.
User commands default to origin: 'user'.

* Propagate origin through container execute paths

Refactors executeInSession to options object. execute-handler reads
origin from request and forwards through SessionManager to Session.
Canonical events include origin for automatic debug demotion.

* Add version.check severity matrix

compatible→debug, unknown→info, mismatch→warn, failed→warn.
Non-compatible cases include versionOutcome in the message.

* Mark container shell-outs as internal origin

File, backup, git, session-manager, and process-service shell-outs
all pass origin: 'internal' so resolveLogLevel demotes them to
debug automatically.

* Add origin to process-client startProcess for completeness

* Test logCanonicalEvent debug/warn dispatch and precedence

* Keep 5xx middleware logs at warn, fix process-client pattern

5xx responses need visibility at the HTTP summary level even when
canonical domain events capture the operation failure — they answer
different questions. 2xx/3xx/4xx stay at debug.

Fix origin field in process-client to use conditional spread
pattern consistent with all other optional fields.

* Fix middleware test expectations for 5xx at warn level

* Remove SANDBOX_LOG_FORMAT from perf wrangler template
@ghostwriternr ghostwriternr enabled auto-merge (squash) March 26, 2026 12:55
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 23 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 HTTP streaming GET requests don't receive defaultHeaders when transport bypasses doFetch

In doStreamFetch HTTP mode for GET requests, the method calls this.doFetch() which correctly merges defaultHeaders. However, the streamHeaders variable (which correctly includes defaultHeaders for the WebSocket path) is computed but never used in the HTTP path. This works correctly today because doFetch applies defaultHeaders automatically.

However, the inconsistency means the HTTP streaming path for GET requests sends a Content-Type: application/json header (hardcoded at base-client.ts:234) even though there's no body. For WebSocket GET streaming, streamHeaders is just this.options.defaultHeaders without Content-Type. While functionally harmless now, the HTTP path for GET streaming should not force Content-Type: application/json when there's no body — it should match the WebSocket path's behavior. This is pre-existing behavior not introduced by this PR, so I'm noting it's non-severe.

(Refers to lines 232-236)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

} finally {
const statusCode = response?.status ?? 500;
const durationMs = Date.now() - startTime;
const isError = statusCode >= 500 || Boolean(requestError);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Unused isError variable in LoggingMiddleware

The variable isError at line 31 is computed (statusCode >= 500 || Boolean(requestError)) but never referenced again. The subsequent log-level decision at line 44 redundantly recomputes statusCode >= 500. This dead code makes the logic harder to reason about — a reader might expect isError to govern the log level selection, but it doesn't.

Suggested change
const isError = statusCode >= 500 || Boolean(requestError);
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@ghostwriternr ghostwriternr merged commit 734f16d into main Mar 26, 2026
23 checks passed
@ghostwriternr ghostwriternr deleted the improve-logging branch March 26, 2026 13:03
@sandy-bonk sandy-bonk bot mentioned this pull request Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants