feat(strands-py): add Sandbox core abstraction (TS→Python port, core only 1/N) by agent-of-mkmeral · Pull Request #2665 · strands-agents/harness-sdk

agent-of-mkmeral · 2026-06-08T11:46:35Z

Description

This PR ports the core Sandbox abstraction from the strands-ts reference implementation into strands-py, under strands-py/src/strands/sandbox/. It is a cross-language parity port: the behavior is derived from the TypeScript oracle, then re-expressed idiomatically in Python.

Behavioral oracle

The source of truth is the TypeScript sandbox public barrel strands-ts/src/sandbox/index.ts in this monorepo, which exports exactly:

export { Sandbox, type ExecuteOptions } from './base.js'
export { PosixShellSandbox } from './posix-shell.js'
export type { StreamType, StreamChunk, FileInfo, OutputFile, ExecutionResult } from './types.js'
export { LANGUAGE_PATTERN } from './constants.js'

The Python module mirrors this barrel one-to-one (strands/sandbox/__init__.py), porting:

Sandbox (ABC) — two abstract streaming primitives (execute_streaming, execute_code_streaming), abstract file I/O (read_file, write_file, remove_file, list_files), and non-streaming/text convenience wrappers (execute, execute_code, read_text, write_text).
PosixShellSandbox — shell-based defaults: base64-heredoc code execution, base64 file read/write with mkdir -p, rm, and ls -1ap parsing. Subclasses implement only execute_streaming.
Data types — StreamChunk, FileInfo, OutputFile, ExecutionResult, and the StreamType literal (as dataclasses + a Literal).
LANGUAGE_PATTERN — interpreter-name validation, plus the internal ENV_KEY_PATTERN and the shell_quote / validate_env_keys / build_shell_env_prefix helpers.

Docker/SSH concrete backends, the stream-process event-pump, and the Agent↔Sandbox integration are deliberately out of scope for this "core only (1/N)" PR (see Scope below).

Prior-history note (what happened to the earlier Python sandbox attempt)

A previous Python sandbox attempt — tracked around PR #2198 ("feat: add Sandbox abstraction - core only (1/N)") — was merged to a feature/sandbox branch ONLY, never to main. It used the pre-monorepo path src/strands/sandbox/. Since then the design has evolved (it now lives in the monorepo and the TypeScript implementation is the canonical reference). This port follows the CURRENT strands-ts oracle, not that stale attempt — though its adversarial/security tests were cross-referenced when mirroring coverage.

Accepted Python↔TypeScript divergences

These are intentional, idiomatic translations (behavior preserved, surface adapted to Python):

TypeScript (oracle)	Python (this port)	Rationale
`ExecuteOptions` options-object	keyword args `timeout` / `cwd` / `env`	matches `strands-py` call-option style
`AbortSignal` cancellation	`asyncio` task cancellation	Pythonic cancellation; no explicit `signal` param in the core ABC
discriminated-union `type` tag (`'streamChunk'` / `'executionResult'`)	`isinstance()` checks on `@dataclass` types	no redundant tag field in Python
`Uint8Array`	`bytes`	native binary type
camelCase (`executeStreaming`, `isDir`, `streamType`, `outputFiles`)	snake_case (`execute_streaming`, `is_dir`, `stream_type`, `output_files`)	PEP 8
`throw new Error(...)`	`FileNotFoundError` / `OSError` / `ValueError` / `RuntimeError`	idiomatic, specific exception types
`re` via JS `/^...$/.test()`	`re.fullmatch` on `^...$` patterns	see below

Iteration-1 divergence note (security-relevant): Python's $ in a regex matches just before a trailing \n, unlike JavaScript's /^...$/.test() which anchors to the true end of the string. Using re.match on the $-anchored LANGUAGE_PATTERN / ENV_KEY_PATTERN would have accepted language="python3\n" and env key "FOO\n" (a trailing newline is a shell statement separator) while the TS oracle rejects them. Both validator call sites use re.fullmatch to reproduce the JS semantics; the gotcha is documented in constants.py and covered by two dedicated regression tests.

Scope — "core only (1/N)" + follow-ups

This PR is the first in a series. Out of scope here, to follow as separate PRs mirroring the corresponding TypeScript work:

Concrete Docker / SSH sandbox implementations (+ the stream-process helper).
Agent↔Sandbox integration (tracks TS strands-agents/harness-sdk#2563).
Vended sandbox tools (tracks TS strands-agents/harness-sdk#2649).

Related Issues

Parity-port tracking issue: agent-of-mkmeral/strands-coder-private#248.

Documentation PR

No user-facing docs required for this internal core abstraction (no public package export surface changes beyond the new strands.sandbox module, which is documented via module/class docstrings). Docs will accompany the Agent-integration follow-up.

Type of Change

New feature

Review Loop Summary

This PR went through a fresh-context pre-PR review loop (task-pre-pr-review, max 3 iterations) before being opened. State was tracked in the REVIEW_LOOP_248 repo variable.

Iteration	Actor	Run	Findings / Action
0	porter	27133074838	Initial port of `strands-ts/src/sandbox/` core → `strands-py/src/strands/sandbox/`: `Sandbox` ABC + `PosixShellSandbox` + types + `LANGUAGE_PATTERN`, 55 mirrored tests incl. security/adversarial.
1	reviewer	27134301661	🔴 Found & fixed: `re.match` vs JS `$` transliteration bug in `LANGUAGE_PATTERN`/`ENV_KEY_PATTERN` — trailing-newline bypass in a security-sensitive validator. Switched both call sites to `re.fullmatch`, documented the gotcha in `constants.py`, added 2 regression tests (verified to fail on the old `.match` code). 57 tests.
2	reviewer (this PR)	27134910228	Converged — ZERO 🔴 / ZERO 🟡. Independently re-derived parity vs the TS oracle for the whole core. Confirmed both validator call sites use `fullmatch`; grepped for any other `.match`/`.search` validator sites (none); empirically re-verified the iteration-1 regression tests genuinely fail when reverted to `.match` and pass with `fullmatch`. lint + mypy (5 files) + 57 tests all green. Opening the PR.

Remaining Findings

None blocking. One 🟢 cosmetic nit was considered and dismissed as a non-issue: list_files strips trailing CR with str.rstrip("\r") (removes all trailing CRs) where the TS oracle uses replace(/\r$/, '') (removes at most one). This can only differ for a filename literally containing a CR byte under CRLF line endings, which POSIX ls -1ap never emits (POSIX uses LF). No behavioral impact; not filed.

Testing

Readiness was run from strands-py/ (with the runner's ambient OTEL_* env unset, which otherwise breaks the hatch-test env at import — an env artifact, not a code defect):

hatch run lint → All checks passed!
hatch run hatch-static-analysis:mypy ./src/strands/sandbox → Success: no issues found in 5 source files
hatch test tests/strands/sandbox/test_shell.py → 57 passed

The Python tests mirror strands-ts/src/sandbox/__tests__/posix-shell.test.node.ts and its TestSandbox fixture (ported to _ShellTestSandbox + _stream_process), and cover: streaming primitives + non-streaming/text wrappers; base64-heredoc code exec; base64 binary/all-256-byte/unicode file roundtrip with mkdir -p; rm; ls -1ap parsing (excludes .//../, trailing-slash dir detection); export K=V && env prefix with shell-quoting + ENV_KEY validation; shell_quote single-quote wrapping + '\'' escaping; LANGUAGE_PATTERN injection rejection (path traversal, ;, spaces, trailing newline); env-name injection; shell-metacharacter content/path safety; command-substitution neutralization; timeout; asyncio cancellation; exit-code/signal mapping (127, 128+signal).

I ran hatch run prepare (scoped: lint + mypy + sandbox tests — all green). Note: hatch run format flags two unrelated pre-existing files (src/strands/tools/mcp/mcp_client.py, tests/strands/models/test_gemini.py) as drift on main; those are not part of this diff and were left untouched.

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly (module/class/method docstrings)
I have added an appropriate example to the documentation to outline the feature (see strands/sandbox/__init__.py and Sandbox docstrings)
My changes generate no new warnings
Any dependent changes have been merged and published

cc @mkmeral @gautamsirdeshmukh (gautamsirdeshmukh owns the sandbox workstream — TS #2563/#2649).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

🤖 Opened by an autonomous Strands agent after a fresh-context pre-PR review loop converged.

Port the core Sandbox interface from strands-ts/src/sandbox/ to strands-py/src/strands/sandbox/, mirroring the TypeScript public barrel (Sandbox, PosixShellSandbox, the data types, and LANGUAGE_PATTERN). The streaming methods (execute_streaming/execute_code_streaming) are the abstract primitives; non-streaming and text wrappers consume them. PosixShellSandbox implements file/code ops over the shell (base64 heredoc transport, env export prefix, ls -1ap parsing), so subclasses implement only execute_streaming. Mirrors the TS posix-shell tests plus recovered strands-agents#2198 security cases. Docker/SSH backends and Agent<->Sandbox integration are intentionally out of scope (follow-up PRs).

Python's $ matches before a trailing newline, unlike JS /^...$/.test() in the strands-ts oracle. LANGUAGE_PATTERN/ENV_KEY_PATTERN were matched with re.match(), so 'python3\n' and env key 'FOO\n' passed the Python validator while the TS oracle rejects them — a behavioral-parity break in a security-sensitive validator (a trailing newline is a shell statement separator). Switch both call sites to re.fullmatch (idiomatic JS-$ parity), document the gotcha in constants.py, and add regression tests that fail on the old .match code.

github-actions · 2026-06-08T13:09:53Z

Issue: This PR introduces a new public abstraction (Sandbox, PosixShellSandbox) that customers will subclass and interact with directly. Per the API Bar Raising guidelines, new public classes/abstractions that customers will use should carry the needs-api-review label and include API documentation in the PR description (use cases, example snippets, complete API signatures, module exports).

The PR description is thorough and does document the API surface, divergences from TS, and exports — which is great. However:

The needs-api-review label should be applied to ensure the API review gate is tracked.
Consider whether this falls under "Moderate changes" (new class customers use) or "Substantial changes" (new primitive/frequently-used abstraction) per the bar-raising doc, as it may warrant an explicit API reviewer designation.

Suggestion: Add the needs-api-review label before merge.

github-actions · 2026-06-08T13:10:02Z

Assessment: Comment

This is a well-structured port with faithful parity to the TypeScript oracle, good security testing, and clear documentation of intentional divergences.

Review Themes

API Process: This introduces new public abstractions (Sandbox, PosixShellSandbox) that customers will subclass. The needs-api-review label should be applied per API Bar Raising guidelines to ensure proper review tracking.
Stdlib Usage: The custom shell_quote function duplicates shlex.quote from Python's stdlib (see mkmeral's existing comment thread).
Error Handling: The read_file base64 decode path could benefit from a defensive try/except wrapping corrupt base64 into a meaningful error message.
Contract Clarity: The timeout and env subclass contract for execute_streaming implementors could be documented more symmetrically.
Module Export: The sandbox module is not yet exported from the top-level strands package; this should be tracked for the integration follow-up per the "Prefer Flat Namespaces" decision record.

The code quality, testing coverage (57 tests including adversarial/security cases), and documentation are strong. The re.fullmatch security fix is well-reasoned and properly regression-tested.

mkmeral · 2026-06-08T13:11:04Z

@agent-of-mkmeral check the reviews here, update as necessary

@mkmeral

Addresses the review comments on PR strands-agents#2665: - shell.py: replace the hand-rolled shell_quote() with stdlib shlex.quote() (answers @mkmeral's 'is there a better library' question; tenet strands-agents#6 'embrace common standards'). shlex.quote is shell-equivalent and reduces maintenance surface. Updated build_shell_env_prefix and all file-op call sites; updated tests (shlex leaves shell-safe values unquoted, e.g. export FOO=bar). - shell.py read_file: wrap base64 decode in try/except so corrupt/unexpected stdout surfaces a clear OSError naming the path instead of a cryptic binascii.Error. Added a regression test. - shell.py PosixShellSandbox docstring: document the subclass contract for timeout and cwd symmetrically with env (silently ignored if not honored). - test_shell.py _stream_process: document idle-timeout (not wall-clock) semantics in the docstring and at the wait_for site. - sandbox/__init__.py: note that top-level strands re-export is deferred to the Agent<->Sandbox integration follow-up (per Flat Namespaces decision record). Validated (OTEL_* unset to avoid the hatch-env import artifact): ruff check + format clean, mypy clean (5 files), 58 tests pass.

agent-of-mkmeral · 2026-06-08T13:21:04Z

@mkmeral went through every review thread and pushed 15a5b7db addressing them.

On your question (shell.py:38 — "is there a better library instead of search-and-replace blindly?"): yes — stdlib shlex.quote. I deleted the hand-rolled shell_quote() and switched every call site (build_shell_env_prefix + all file ops) to shlex.quote. It's shell-equivalent (uses the '"'"' idiom vs '\'', both render the same literal), well-tested, and immediately recognizable — and drops a chunk of our maintenance surface (tenet #6, embrace common standards). The only visible behavior change is that shlex.quote leaves shell-safe values unquoted, so the env prefix is now e.g. export FOO=bar && instead of export FOO='bar' && (identical to the shell). Tests updated to match; the command-substitution / embedded-quote neutralization tests still pin the security properties.

Other findings addressed:

Thread	Action
`read_file` base64 decode could raise a cryptic `binascii.Error`	Wrapped in `try/except ValueError` → clear `OSError` naming the path. Added a regression test.
`timeout`/`cwd` subclass contract under-documented (vs `env`)	Expanded the `PosixShellSandbox` docstring to document all three options symmetrically (silently ignored if a subclass doesn't honor them).
`_stream_process` timeout is an idle timeout, not wall-clock	Documented in the helper docstring + an inline comment at the `wait_for` site.
`sandbox` not re-exported from top-level `strands`	Added a `.. note::` in `sandbox/__init__.py` recording that the flat-namespace re-export is deferred to the Agent↔Sandbox integration follow-up.
`write_file` `$(dirname ...)` `$`-expansion concern	No change — the reviewer self-corrected and confirmed it's safe (single-quoted arg to `dirname`; command-substitution output isn't re-expanded). Verified.

Validation (with the runner's ambient OTEL_* unset to dodge the hatch-env import artifact): ruff check + ruff format --check clean, mypy ./src/strands/sandbox clean (5 files), 58 tests pass (was 57 + the new base64-decode regression test).

One thing I can't do myself: the bot flagged this needs the needs-api-review label per the API Bar Raising guidelines (new public Sandbox/PosixShellSandbox abstractions). I don't have label-write permission on this repo from a fork PR — could a maintainer apply it? The PR description already documents the API surface, exports, and TS divergences.

The only red check is label-size (a size-label gate, not a code issue) and the access-gated CI awaiting maintainer approval to run.

🤖 Updated by an autonomous Strands agent.

github-actions · 2026-06-08T13:30:38Z

Assessment: Approve

All previous review feedback has been addressed in 15a5b7d. The code is well-structured, faithful to the TS oracle with documented idiomatic divergences, and follows project conventions (type annotations, Google-style docstrings, test structure mirroring src/).

Verification of addressed items

✅ shell_quote replaced with shlex.quote (tenet ci: update ruff requirement from <0.5.0,>=0.4.4 to >=0.4.4,<0.12.0 #6)
✅ read_file base64 decode wrapped with meaningful OSError + regression test added
✅ PosixShellSandbox docstring now documents timeout/cwd/env subclass responsibilities symmetrically
✅ Idle-timeout semantics documented in _stream_process docstring + inline comment
✅ .. note:: in __init__.py tracking the deferred flat-namespace re-export

No new issues found on re-review. The only open process item is the needs-api-review label, which the author has flagged they can't self-apply from a fork — a maintainer should add it.

agent-of-mkmeral · 2026-06-08T14:10:32Z

@mkmeral re-checked the reviews — everything actionable is already addressed in 15a5b7db, and I've now resolved all 7 review threads (they were still showing open):

Thread	Status
`shell_quote` → stdlib `shlex.quote` (your `shell.py:38` question)	✅ resolved
`read_file` base64 decode → wrapped in clear `OSError` + regression test	✅ resolved
`_stream_process` idle-timeout semantics documented	✅ resolved
`timeout`/`cwd`/`env` subclass contract documented symmetrically	✅ resolved
`sandbox` top-level re-export → `.. note::` deferring to integration PR	✅ resolved
`write_file` `$(dirname ...)` `$`-expansion	✅ resolved (reviewer self-corrected: safe)

The Strands review bot independently re-reviewed at 13:30Z and flipped to Approve with zero new findings.

Two items only a maintainer can clear (I lack write perms from a fork PR):

needs-api-review label — I tried to self-apply it just now and got does not have the correct permissions. New public Sandbox/PosixShellSandbox abstractions warrant it per the API Bar Raising guidelines; the PR description already documents the full API surface, exports, and TS divergences.
Access-gated CI (check-access-and-checkout is WAITING) needs maintainer approval to run.

The only red check is label-size (a size-label gate, not a code defect). Nothing else outstanding on my end — ready for maintainer review/merge.

🤖 Updated by an autonomous Strands agent.

agent-of-mkmeral · 2026-06-08T16:29:34Z

@mkmeral thanks for the approval (15:41Z) — it unblocked the access-gated CI, so the full matrix ran. Re-checked everything:

✅ Green: reviewDecision: APPROVED, mergeable: MERGEABLE, all 7 review threads resolved. The full Python matrix passed — Lint, and Unit Tests on Python 3.10–3.14 across linux / macOS / windows — 3060 passed, 57 skipped (the 57 skips are exactly the sandbox tests, correctly guarded by skipif(sys.platform == 'win32') since they need a POSIX shell).

🔴 The two remaining red checks are NOT from this PR:

Python / Unit Tests - 3.12 - windows → one flaky failure: test_graph.py::test_graph_streaming_parallel_events — AssertionError: Expected parallel execution, took 0.46s. This is a wall-clock timing assertion in the multiagent graph module (Windows scheduler jitter), with zero relationship to strands/sandbox/. The 3.11/3.13 windows jobs were just CANCELLED by matrix fail-fast cascading off it. A re-run should clear it — I don't have admin/re-run rights on this repo (Must have admin rights).
label-size → the size-label gate (process, not code), and CI Gate is just the aggregate of those two.

Still maintainer-only: the needs-api-review label (I get does not have the correct permissions from a fork PR).

So: code is approved and fully green on its own surface; the only blockers are a flaky unrelated Windows timing test (needs a re-run) and the two process labels. Nothing actionable left in the diff on my end.

🤖 Updated by an autonomous Strands agent.

…assert on Windows 3.12 runner, unrelated to this PR)

github-actions · 2026-06-08T16:45:19Z

Assessment: Approve

Clean TS→Python port with faithful parity, comprehensive security testing, and proper adherence to project conventions. All prior review feedback has been addressed.

Review Summary

Structure: Clean module organization (types.py, constants.py, base.py, shell.py) with proper barrel exports and __all__. Tests mirror src/ structure.
API Design: Aligns with tenets — extensible (**kwargs for forward compat, tenet 2), uses stdlib (shlex.quote, tenet 6), simple hierarchy (only execute_streaming abstract for subclasses, tenet 1). Flat-namespace export deferred and tracked.
Security: re.fullmatch over re.match for validator patterns (documented gotcha), base64 heredoc for injection-free code transport, env-key validation.
Testing: 57 tests covering happy paths, adversarial inputs, concurrency, streaming, exit codes, and timeout/cancellation. Platform-guarded with skipif(win32).
Process: needs-api-review label required (new public Sandbox/PosixShellSandbox abstractions) — author cannot self-apply from fork.

No new issues found. The only remaining action is for a maintainer to apply the needs-api-review label.

agent-of-mkmeral · 2026-06-08T16:52:10Z

@mkmeral re-checked — no new code feedback, and things actually improved:

✅ Now two human approvals: you (15:41Z) + @gautamsirdeshmukh (16:40Z). The Strands review bot also re-approved at 16:45Z with zero new findings. All 7 review threads resolved, mergeable: MERGEABLE, reviewDecision: APPROVED.

On the flaky Windows test: I pushed an empty re-trigger commit (69f86ef3, 16:29Z) to rerun the matrix and clear test_graph_streaming_parallel_events (the unrelated multiagent timing assert). But because this is a fork PR, the new commit's CI is back at check-access-and-checkout: WAITING — it needs a maintainer to approve the workflow run again. Once you click approve on the pending run, the matrix should go green.

Two maintainer-only items remain — nothing actionable on my end:

Approve the pending CI run for commit 69f86ef3 (so the re-triggered matrix executes).
Apply the needs-api-review label — PR still only has size/xl; I can't self-apply from a fork (does not have the correct permissions).

The label-size check is the only inherent red, and it's a process/size gate, not a code defect. Code is approved by two humans + the bot and ready to merge once CI is approved and the label is on.

🤖 Updated by an autonomous Strands agent.

agent-of-mkmeral added 2 commits June 8, 2026 11:21

github-actions Bot added the size/xl label Jun 8, 2026

agent-of-mkmeral requested a deployment to manual-approval June 8, 2026 11:46 — with GitHub Actions Waiting

agent-of-mkmeral temporarily deployed to manual-approval June 8, 2026 11:46 — with GitHub Actions Inactive

github-actions Bot added the strands-running label Jun 8, 2026

mkmeral reviewed Jun 8, 2026

View reviewed changes

Comment thread strands-py/src/strands/sandbox/shell.py Outdated