Skip to content

feat(strands-py): add Sandbox core abstraction (TS→Python port, core only 1/N)#2665

Merged
mkmeral merged 4 commits into
strands-agents:mainfrom
agent-of-mkmeral:agent-tasks/248
Jun 8, 2026
Merged

feat(strands-py): add Sandbox core abstraction (TS→Python port, core only 1/N)#2665
mkmeral merged 4 commits into
strands-agents:mainfrom
agent-of-mkmeral:agent-tasks/248

Conversation

@agent-of-mkmeral

Copy link
Copy Markdown
Contributor

Description

This PR ports the core Sandbox abstraction from the strands-ts reference implementation into strands-py, under strands-py/src/strands/sandbox/. It is a cross-language parity port: the behavior is derived from the TypeScript oracle, then re-expressed idiomatically in Python.

Behavioral oracle

The source of truth is the TypeScript sandbox public barrel strands-ts/src/sandbox/index.ts in this monorepo, which exports exactly:

export { Sandbox, type ExecuteOptions } from './base.js'
export { PosixShellSandbox } from './posix-shell.js'
export type { StreamType, StreamChunk, FileInfo, OutputFile, ExecutionResult } from './types.js'
export { LANGUAGE_PATTERN } from './constants.js'

The Python module mirrors this barrel one-to-one (strands/sandbox/__init__.py), porting:

  • Sandbox (ABC) — two abstract streaming primitives (execute_streaming, execute_code_streaming), abstract file I/O (read_file, write_file, remove_file, list_files), and non-streaming/text convenience wrappers (execute, execute_code, read_text, write_text).
  • PosixShellSandbox — shell-based defaults: base64-heredoc code execution, base64 file read/write with mkdir -p, rm, and ls -1ap parsing. Subclasses implement only execute_streaming.
  • Data typesStreamChunk, FileInfo, OutputFile, ExecutionResult, and the StreamType literal (as dataclasses + a Literal).
  • LANGUAGE_PATTERN — interpreter-name validation, plus the internal ENV_KEY_PATTERN and the shell_quote / validate_env_keys / build_shell_env_prefix helpers.

Docker/SSH concrete backends, the stream-process event-pump, and the Agent↔Sandbox integration are deliberately out of scope for this "core only (1/N)" PR (see Scope below).

Prior-history note (what happened to the earlier Python sandbox attempt)

A previous Python sandbox attempt — tracked around PR #2198 ("feat: add Sandbox abstraction - core only (1/N)") — was merged to a feature/sandbox branch ONLY, never to main. It used the pre-monorepo path src/strands/sandbox/. Since then the design has evolved (it now lives in the monorepo and the TypeScript implementation is the canonical reference). This port follows the CURRENT strands-ts oracle, not that stale attempt — though its adversarial/security tests were cross-referenced when mirroring coverage.

Accepted Python↔TypeScript divergences

These are intentional, idiomatic translations (behavior preserved, surface adapted to Python):

TypeScript (oracle) Python (this port) Rationale
ExecuteOptions options-object keyword args timeout / cwd / env matches strands-py call-option style
AbortSignal cancellation asyncio task cancellation Pythonic cancellation; no explicit signal param in the core ABC
discriminated-union type tag ('streamChunk' / 'executionResult') isinstance() checks on @dataclass types no redundant tag field in Python
Uint8Array bytes native binary type
camelCase (executeStreaming, isDir, streamType, outputFiles) snake_case (execute_streaming, is_dir, stream_type, output_files) PEP 8
throw new Error(...) FileNotFoundError / OSError / ValueError / RuntimeError idiomatic, specific exception types
re via JS /^...$/.test() re.fullmatch on ^...$ patterns see below

Iteration-1 divergence note (security-relevant): Python's $ in a regex matches just before a trailing \n, unlike JavaScript's /^...$/.test() which anchors to the true end of the string. Using re.match on the $-anchored LANGUAGE_PATTERN / ENV_KEY_PATTERN would have accepted language="python3\n" and env key "FOO\n" (a trailing newline is a shell statement separator) while the TS oracle rejects them. Both validator call sites use re.fullmatch to reproduce the JS semantics; the gotcha is documented in constants.py and covered by two dedicated regression tests.

Scope — "core only (1/N)" + follow-ups

This PR is the first in a series. Out of scope here, to follow as separate PRs mirroring the corresponding TypeScript work:

  • Concrete Docker / SSH sandbox implementations (+ the stream-process helper).
  • Agent↔Sandbox integration (tracks TS strands-agents/harness-sdk#2563).
  • Vended sandbox tools (tracks TS strands-agents/harness-sdk#2649).

Related Issues

Parity-port tracking issue: agent-of-mkmeral/strands-coder-private#248.

Documentation PR

No user-facing docs required for this internal core abstraction (no public package export surface changes beyond the new strands.sandbox module, which is documented via module/class docstrings). Docs will accompany the Agent-integration follow-up.

Type of Change

New feature

Review Loop Summary

This PR went through a fresh-context pre-PR review loop (task-pre-pr-review, max 3 iterations) before being opened. State was tracked in the REVIEW_LOOP_248 repo variable.

Iteration Actor Run Findings / Action
0 porter 27133074838 Initial port of strands-ts/src/sandbox/ core → strands-py/src/strands/sandbox/: Sandbox ABC + PosixShellSandbox + types + LANGUAGE_PATTERN, 55 mirrored tests incl. security/adversarial.
1 reviewer 27134301661 🔴 Found & fixed: re.match vs JS $ transliteration bug in LANGUAGE_PATTERN/ENV_KEY_PATTERN — trailing-newline bypass in a security-sensitive validator. Switched both call sites to re.fullmatch, documented the gotcha in constants.py, added 2 regression tests (verified to fail on the old .match code). 57 tests.
2 reviewer (this PR) 27134910228 Converged — ZERO 🔴 / ZERO 🟡. Independently re-derived parity vs the TS oracle for the whole core. Confirmed both validator call sites use fullmatch; grepped for any other .match/.search validator sites (none); empirically re-verified the iteration-1 regression tests genuinely fail when reverted to .match and pass with fullmatch. lint + mypy (5 files) + 57 tests all green. Opening the PR.

Remaining Findings

None blocking. One 🟢 cosmetic nit was considered and dismissed as a non-issue: list_files strips trailing CR with str.rstrip("\r") (removes all trailing CRs) where the TS oracle uses replace(/\r$/, '') (removes at most one). This can only differ for a filename literally containing a CR byte under CRLF line endings, which POSIX ls -1ap never emits (POSIX uses LF). No behavioral impact; not filed.

Testing

Readiness was run from strands-py/ (with the runner's ambient OTEL_* env unset, which otherwise breaks the hatch-test env at import — an env artifact, not a code defect):

  • hatch run lintAll checks passed!
  • hatch run hatch-static-analysis:mypy ./src/strands/sandboxSuccess: no issues found in 5 source files
  • hatch test tests/strands/sandbox/test_shell.py57 passed

The Python tests mirror strands-ts/src/sandbox/__tests__/posix-shell.test.node.ts and its TestSandbox fixture (ported to _ShellTestSandbox + _stream_process), and cover: streaming primitives + non-streaming/text wrappers; base64-heredoc code exec; base64 binary/all-256-byte/unicode file roundtrip with mkdir -p; rm; ls -1ap parsing (excludes .//../, trailing-slash dir detection); export K=V && env prefix with shell-quoting + ENV_KEY validation; shell_quote single-quote wrapping + '\'' escaping; LANGUAGE_PATTERN injection rejection (path traversal, ;, spaces, trailing newline); env-name injection; shell-metacharacter content/path safety; command-substitution neutralization; timeout; asyncio cancellation; exit-code/signal mapping (127, 128+signal).

  • I ran hatch run prepare (scoped: lint + mypy + sandbox tests — all green). Note: hatch run format flags two unrelated pre-existing files (src/strands/tools/mcp/mcp_client.py, tests/strands/models/test_gemini.py) as drift on main; those are not part of this diff and were left untouched.

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly (module/class/method docstrings)
  • I have added an appropriate example to the documentation to outline the feature (see strands/sandbox/__init__.py and Sandbox docstrings)
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

cc @mkmeral @gautamsirdeshmukh (gautamsirdeshmukh owns the sandbox workstream — TS #2563/#2649).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

🤖 Opened by an autonomous Strands agent after a fresh-context pre-PR review loop converged.

Port the core Sandbox interface from strands-ts/src/sandbox/ to
strands-py/src/strands/sandbox/, mirroring the TypeScript public barrel
(Sandbox, PosixShellSandbox, the data types, and LANGUAGE_PATTERN).

The streaming methods (execute_streaming/execute_code_streaming) are the
abstract primitives; non-streaming and text wrappers consume them.
PosixShellSandbox implements file/code ops over the shell (base64 heredoc
transport, env export prefix, ls -1ap parsing), so subclasses implement
only execute_streaming.

Mirrors the TS posix-shell tests plus recovered strands-agents#2198 security cases.
Docker/SSH backends and Agent<->Sandbox integration are intentionally
out of scope (follow-up PRs).
Python's $ matches before a trailing newline, unlike JS /^...$/.test() in
the strands-ts oracle. LANGUAGE_PATTERN/ENV_KEY_PATTERN were matched with
re.match(), so 'python3\n' and env key 'FOO\n' passed the Python validator
while the TS oracle rejects them — a behavioral-parity break in a
security-sensitive validator (a trailing newline is a shell statement
separator). Switch both call sites to re.fullmatch (idiomatic JS-$ parity),
document the gotcha in constants.py, and add regression tests that fail on
the old .match code.
Comment thread strands-py/src/strands/sandbox/shell.py Outdated
Comment thread strands-py/src/strands/sandbox/shell.py
Comment thread strands-py/tests/strands/sandbox/test_shell.py
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Issue: This PR introduces a new public abstraction (Sandbox, PosixShellSandbox) that customers will subclass and interact with directly. Per the API Bar Raising guidelines, new public classes/abstractions that customers will use should carry the needs-api-review label and include API documentation in the PR description (use cases, example snippets, complete API signatures, module exports).

The PR description is thorough and does document the API surface, divergences from TS, and exports — which is great. However:

  1. The needs-api-review label should be applied to ensure the API review gate is tracked.
  2. Consider whether this falls under "Moderate changes" (new class customers use) or "Substantial changes" (new primitive/frequently-used abstraction) per the bar-raising doc, as it may warrant an explicit API reviewer designation.

Suggestion: Add the needs-api-review label before merge.

Comment thread strands-py/src/strands/sandbox/shell.py
Comment thread strands-py/src/strands/sandbox/__init__.py
Comment thread strands-py/src/strands/sandbox/shell.py
Comment thread strands-py/src/strands/sandbox/shell.py
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Assessment: Comment

This is a well-structured port with faithful parity to the TypeScript oracle, good security testing, and clear documentation of intentional divergences.

Review Themes
  • API Process: This introduces new public abstractions (Sandbox, PosixShellSandbox) that customers will subclass. The needs-api-review label should be applied per API Bar Raising guidelines to ensure proper review tracking.
  • Stdlib Usage: The custom shell_quote function duplicates shlex.quote from Python's stdlib (see mkmeral's existing comment thread).
  • Error Handling: The read_file base64 decode path could benefit from a defensive try/except wrapping corrupt base64 into a meaningful error message.
  • Contract Clarity: The timeout and env subclass contract for execute_streaming implementors could be documented more symmetrically.
  • Module Export: The sandbox module is not yet exported from the top-level strands package; this should be tracked for the integration follow-up per the "Prefer Flat Namespaces" decision record.

The code quality, testing coverage (57 tests including adversarial/security cases), and documentation are strong. The re.fullmatch security fix is well-reasoned and properly regression-tested.

@mkmeral

mkmeral commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@agent-of-mkmeral check the reviews here, update as necessary

Addresses the review comments on PR strands-agents#2665:

- shell.py: replace the hand-rolled shell_quote() with stdlib shlex.quote()
  (answers @mkmeral's 'is there a better library' question; tenet strands-agents#6 'embrace
  common standards'). shlex.quote is shell-equivalent and reduces maintenance
  surface. Updated build_shell_env_prefix and all file-op call sites; updated
  tests (shlex leaves shell-safe values unquoted, e.g. export FOO=bar).
- shell.py read_file: wrap base64 decode in try/except so corrupt/unexpected
  stdout surfaces a clear OSError naming the path instead of a cryptic
  binascii.Error. Added a regression test.
- shell.py PosixShellSandbox docstring: document the subclass contract for
  timeout and cwd symmetrically with env (silently ignored if not honored).
- test_shell.py _stream_process: document idle-timeout (not wall-clock)
  semantics in the docstring and at the wait_for site.
- sandbox/__init__.py: note that top-level strands re-export is deferred to the
  Agent<->Sandbox integration follow-up (per Flat Namespaces decision record).

Validated (OTEL_* unset to avoid the hatch-env import artifact):
ruff check + format clean, mypy clean (5 files), 58 tests pass.
@agent-of-mkmeral

Copy link
Copy Markdown
Contributor Author

@mkmeral went through every review thread and pushed 15a5b7db addressing them.

On your question (shell.py:38"is there a better library instead of search-and-replace blindly?"): yes — stdlib shlex.quote. I deleted the hand-rolled shell_quote() and switched every call site (build_shell_env_prefix + all file ops) to shlex.quote. It's shell-equivalent (uses the '"'"' idiom vs '\'', both render the same literal), well-tested, and immediately recognizable — and drops a chunk of our maintenance surface (tenet #6, embrace common standards). The only visible behavior change is that shlex.quote leaves shell-safe values unquoted, so the env prefix is now e.g. export FOO=bar && instead of export FOO='bar' && (identical to the shell). Tests updated to match; the command-substitution / embedded-quote neutralization tests still pin the security properties.

Other findings addressed:

Thread Action
read_file base64 decode could raise a cryptic binascii.Error Wrapped in try/except ValueError → clear OSError naming the path. Added a regression test.
timeout/cwd subclass contract under-documented (vs env) Expanded the PosixShellSandbox docstring to document all three options symmetrically (silently ignored if a subclass doesn't honor them).
_stream_process timeout is an idle timeout, not wall-clock Documented in the helper docstring + an inline comment at the wait_for site.
sandbox not re-exported from top-level strands Added a .. note:: in sandbox/__init__.py recording that the flat-namespace re-export is deferred to the Agent↔Sandbox integration follow-up.
write_file $(dirname ...) $-expansion concern No change — the reviewer self-corrected and confirmed it's safe (single-quoted arg to dirname; command-substitution output isn't re-expanded). Verified.

Validation (with the runner's ambient OTEL_* unset to dodge the hatch-env import artifact): ruff check + ruff format --check clean, mypy ./src/strands/sandbox clean (5 files), 58 tests pass (was 57 + the new base64-decode regression test).

One thing I can't do myself: the bot flagged this needs the needs-api-review label per the API Bar Raising guidelines (new public Sandbox/PosixShellSandbox abstractions). I don't have label-write permission on this repo from a fork PR — could a maintainer apply it? The PR description already documents the API surface, exports, and TS divergences.

The only red check is label-size (a size-label gate, not a code issue) and the access-gated CI awaiting maintainer approval to run.

🤖 Updated by an autonomous Strands agent.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Assessment: Approve

All previous review feedback has been addressed in 15a5b7d. The code is well-structured, faithful to the TS oracle with documented idiomatic divergences, and follows project conventions (type annotations, Google-style docstrings, test structure mirroring src/).

Verification of addressed items
  • shell_quote replaced with shlex.quote (tenet ci: update ruff requirement from <0.5.0,>=0.4.4 to >=0.4.4,<0.12.0 #6)
  • read_file base64 decode wrapped with meaningful OSError + regression test added
  • PosixShellSandbox docstring now documents timeout/cwd/env subclass responsibilities symmetrically
  • ✅ Idle-timeout semantics documented in _stream_process docstring + inline comment
  • .. note:: in __init__.py tracking the deferred flat-namespace re-export

No new issues found on re-review. The only open process item is the needs-api-review label, which the author has flagged they can't self-apply from a fork — a maintainer should add it.

@agent-of-mkmeral

Copy link
Copy Markdown
Contributor Author

@mkmeral re-checked the reviews — everything actionable is already addressed in 15a5b7db, and I've now resolved all 7 review threads (they were still showing open):

Thread Status
shell_quote → stdlib shlex.quote (your shell.py:38 question) ✅ resolved
read_file base64 decode → wrapped in clear OSError + regression test ✅ resolved
_stream_process idle-timeout semantics documented ✅ resolved
timeout/cwd/env subclass contract documented symmetrically ✅ resolved
sandbox top-level re-export → .. note:: deferring to integration PR ✅ resolved
write_file $(dirname ...) $-expansion ✅ resolved (reviewer self-corrected: safe)

The Strands review bot independently re-reviewed at 13:30Z and flipped to Approve with zero new findings.

Two items only a maintainer can clear (I lack write perms from a fork PR):

  1. needs-api-review label — I tried to self-apply it just now and got does not have the correct permissions. New public Sandbox/PosixShellSandbox abstractions warrant it per the API Bar Raising guidelines; the PR description already documents the full API surface, exports, and TS divergences.
  2. Access-gated CI (check-access-and-checkout is WAITING) needs maintainer approval to run.

The only red check is label-size (a size-label gate, not a code defect). Nothing else outstanding on my end — ready for maintainer review/merge.

🤖 Updated by an autonomous Strands agent.

@agent-of-mkmeral

Copy link
Copy Markdown
Contributor Author

@mkmeral thanks for the approval (15:41Z) — it unblocked the access-gated CI, so the full matrix ran. Re-checked everything:

✅ Green: reviewDecision: APPROVED, mergeable: MERGEABLE, all 7 review threads resolved. The full Python matrix passed — Lint, and Unit Tests on Python 3.10–3.14 across linux / macOS / windows3060 passed, 57 skipped (the 57 skips are exactly the sandbox tests, correctly guarded by skipif(sys.platform == 'win32') since they need a POSIX shell).

🔴 The two remaining red checks are NOT from this PR:

  1. Python / Unit Tests - 3.12 - windows → one flaky failure: test_graph.py::test_graph_streaming_parallel_eventsAssertionError: Expected parallel execution, took 0.46s. This is a wall-clock timing assertion in the multiagent graph module (Windows scheduler jitter), with zero relationship to strands/sandbox/. The 3.11/3.13 windows jobs were just CANCELLED by matrix fail-fast cascading off it. A re-run should clear it — I don't have admin/re-run rights on this repo (Must have admin rights).
  2. label-size → the size-label gate (process, not code), and CI Gate is just the aggregate of those two.

Still maintainer-only: the needs-api-review label (I get does not have the correct permissions from a fork PR).

So: code is approved and fully green on its own surface; the only blockers are a flaky unrelated Windows timing test (needs a re-run) and the two process labels. Nothing actionable left in the diff on my end.

🤖 Updated by an autonomous Strands agent.

…assert on Windows 3.12 runner, unrelated to this PR)
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Assessment: Approve

Clean TS→Python port with faithful parity, comprehensive security testing, and proper adherence to project conventions. All prior review feedback has been addressed.

Review Summary
  • Structure: Clean module organization (types.py, constants.py, base.py, shell.py) with proper barrel exports and __all__. Tests mirror src/ structure.
  • API Design: Aligns with tenets — extensible (**kwargs for forward compat, tenet 2), uses stdlib (shlex.quote, tenet 6), simple hierarchy (only execute_streaming abstract for subclasses, tenet 1). Flat-namespace export deferred and tracked.
  • Security: re.fullmatch over re.match for validator patterns (documented gotcha), base64 heredoc for injection-free code transport, env-key validation.
  • Testing: 57 tests covering happy paths, adversarial inputs, concurrency, streaming, exit codes, and timeout/cancellation. Platform-guarded with skipif(win32).
  • Process: needs-api-review label required (new public Sandbox/PosixShellSandbox abstractions) — author cannot self-apply from fork.

No new issues found. The only remaining action is for a maintainer to apply the needs-api-review label.

@agent-of-mkmeral

Copy link
Copy Markdown
Contributor Author

@mkmeral re-checked — no new code feedback, and things actually improved:

✅ Now two human approvals: you (15:41Z) + @gautamsirdeshmukh (16:40Z). The Strands review bot also re-approved at 16:45Z with zero new findings. All 7 review threads resolved, mergeable: MERGEABLE, reviewDecision: APPROVED.

On the flaky Windows test: I pushed an empty re-trigger commit (69f86ef3, 16:29Z) to rerun the matrix and clear test_graph_streaming_parallel_events (the unrelated multiagent timing assert). But because this is a fork PR, the new commit's CI is back at check-access-and-checkout: WAITING — it needs a maintainer to approve the workflow run again. Once you click approve on the pending run, the matrix should go green.

Two maintainer-only items remain — nothing actionable on my end:

  1. Approve the pending CI run for commit 69f86ef3 (so the re-triggered matrix executes).
  2. Apply the needs-api-review label — PR still only has size/xl; I can't self-apply from a fork (does not have the correct permissions).

The label-size check is the only inherent red, and it's a process/size gate, not a code defect. Code is approved by two humans + the bot and ready to merge once CI is approved and the label is on.

🤖 Updated by an autonomous Strands agent.

@mkmeral mkmeral enabled auto-merge (squash) June 8, 2026 19:04
@mkmeral mkmeral merged commit fb4a2e4 into strands-agents:main Jun 8, 2026
20 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants