refactor(workspace): new workspace v3 container architecture#244
Merged
Conversation
Migrate MCP containers to use UDS-based bridge communication instead of TCP gRPC. Containers now mount runtime binaries and Unix domain sockets from the host, eliminating the need for a dedicated MCP Docker image. - Remove Dockerfile.mcp and entrypoint.sh in favor of standard base images - Add toolkit Dockerfile for building MCP binary separately - Containers use bind mounts for /opt/memoh (runtime) and /run/memoh (UDS) - Update all config files with new runtime_path and socket_dir settings - Support custom base images per bot (debian, alpine, ubuntu, etc.) - Legacy container detection and TCP fallback for pre-bridge containers - Frontend: add base image selector in container creation UI
Add real-time progress feedback during container image pull and creation using Server-Sent Events, without breaking the existing synchronous JSON API (content negotiation via Accept header). Backend: - Add PullProgress/LayerStatus types and OnProgress callback to PullImageOptions (containerd service layer) - DefaultService.PullImage polls ContentStore.ListStatuses every 500ms when OnProgress is set; AppleService ignores it - CreateContainer handler checks Accept: text/event-stream and switches to SSE branch: pulling → pull_progress → creating → complete/error Frontend: - handleCreateContainer/handleRecreateContainer use fetch + SSE instead of the SDK's synchronous postBotsByBotIdContainer - Progress bar shows layer-level pull progress (offset/total) during pulling phase and indeterminate animation during creating phase - i18n keys added for pullingImage and creatingContainer (en/zh)
- Fix unused-receiver lint: rename `s` to `_` on stub methods in manager_legacy_test.go - Fix sloglint: use slog.DiscardHandler instead of slog.NewTextHandler(io.Discard, nil) - Handle missing arm64 musl Node.js builds: unofficial-builds.nodejs.org does not provide arm64 musl binaries, fall back to glibc build
- Discard os.Setenv/os.Remove return values explicitly with _ - Use omitted receiver name instead of _ (staticcheck ST1006) - Tighten directory permissions from 0o755 to 0o750 (gosec G301)
filepath.Clean the env-sourced socket path before os.Remove to avoid path-traversal taint warning.
filepath.Clean does not satisfy gosec's taint analysis. The socket path comes from MCP_SOCKET_PATH env (operator-configured) or a compiled-in default, not from end-user input.
Split internal/mcp/ to separate container lifecycle management from
Model Context Protocol connections, eliminating naming confusion:
- internal/mcp/ (container mgmt) → internal/workspace/
- internal/mcp/mcpclient/ → internal/workspace/bridge/
- internal/mcp/mcpcontainer/ → internal/workspace/bridgepb/
- cmd/mcp/ → cmd/bridge/
- config: MCPConfig → WorkspaceConfig, [mcp] → [workspace]
- container prefix: mcp-{id} → workspace-{id}
- labels: mcp.bot_id → memoh.bot_id, add memoh.workspace=v1
- socket: mcp.sock → bridge.sock, env BRIDGE_SOCKET_PATH
- runtime: /opt/memoh/runtime/mcp → /opt/memoh/runtime/bridge
- devenv: mcp-build.sh → bridge-build.sh
Legacy containers (mcp- prefix) detected by container name prefix
and handled via existing fallback path.
…ner name
Legacy containers use mcp-{botID} naming, so bot ID can be derived
via TrimPrefix instead of looking up the mcp.bot_id label.
c7e6dbc to
25ff0ad
Compare
- Remove synchronous CreateContainer path (SSE-only now) - Move flusher check before WriteHeader to avoid committed 200 on error - Fix legacy container IP not cached via ensureContainerAndTask path - Add atomic guard to prevent stale pull_progress after PullImage returns - Defensive copy for tzEnv slice to avoid mutating shared backing array - Restore network failure severity in restartContainer (return + Error) - Extract duplicate progress bar into ContainerCreateProgress component - Fix codesync comments to use repo-relative paths - Add SaaS image validation note and kernel version comment on reaper
Unify the Node.js + uv download logic into docker/toolkit/install.sh, used by the production Dockerfile and runnable locally for dev. Dev environment no longer bakes toolkit into the Docker image — it is volume-mounted from .toolkit/ instead, so wrapper script changes take effect immediately without rebuilding. The entrypoint checks for the toolkit directory and prints a clear error if missing.
25ff0ad to
3b77dbe
Compare
…swallowing errors
Three root causes were identified and fixed:
1. Delete() used hardcoded "workspace-" prefix to look up legacy "mcp-"
containers, causing GetContainer to return NotFound. CleanupBotContainer
then silently skipped the error and deleted the DB record without ever
calling PreserveData. Fix: resolve the actual container ID via
ContainerID() (DB → label → scan) before operating.
2. Multiple restore error paths were silently swallowed (logged as Warn
but not returned), so the user saw HTTP 200/204 with no data and no
error. Fix: all errors in the preserve/restore chain now block the
workflow and propagate to the caller.
3. tarGzDir used cached DirEntry.Info() for tar header size, which on
overlayfs can differ from the actual file size, causing "archive/tar:
write too long". Fix: open the file first, Fstat the fd for a
race-free size, and use LimitReader as a safeguard.
Also adds a "restoring" SSE phase so the frontend shows a progress
indicator ("Restoring data, this may take a while...") during data
migration on container recreation.
Replace the `containerID func(string) string` field with a single `resolveContainerID(ctx, botID)` method that resolves the actual container ID via DB → label → scan → fallback. All ~16 lookup callsites across manager.go, dataio.go, versioning.go, and manager_lifecycle.go now go through this single resolver, which correctly handles both legacy "mcp-" and new "workspace-" containers. Only `ensureBotWithImage` inlines `ContainerPrefix + botID` for creating brand-new containers — every other path resolves dynamically.
The recreate flow (delete with preserve_data + create with restore_data) blocked on the DELETE call while backing up /data with no progress indication. Add a 'preserving' phase to the progress component so users see "正在备份数据..." instead of an unexplained hang.
Clean up all 112 temporary debug log statements added during the legacy container migration investigation. Kept only meaningful warn-level logs for non-fatal errors (network teardown, rename failures).
7b89d06 to
6c167ec
Compare
sheepbox8646
approved these changes
Mar 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.