Skip to content
This repository was archived by the owner on Jun 3, 2026. It is now read-only.

feat(models): add WebLLM model provider for on-device browser inference#1036

Draft
jsamuel1 wants to merge 1 commit into
strands-agents:mainfrom
jsamuel1:feat/webllm-model-provider
Draft

feat(models): add WebLLM model provider for on-device browser inference#1036
jsamuel1 wants to merge 1 commit into
strands-agents:mainfrom
jsamuel1:feat/webllm-model-provider

Conversation

@jsamuel1

@jsamuel1 jsamuel1 commented May 9, 2026

Copy link
Copy Markdown
Contributor

Motivation

WebLLM runs quantized LLMs entirely in the browser via WebGPU, with model weights cached in IndexedDB/CacheStorage after the first download. Without a first-class provider, users building browser-based agents have to wire up @mlc-ai/web-llm themselves or reach for the community webllm-ai-provider via VercelModel (0.0.1, ~2 weekly downloads on npm).

This adds a WebLLMModel provider under @strands-agents/sdk/models/webllm so on-device, offline-capable agents are a one-import experience — matching how BedrockModel, AnthropicModel, etc. are shipped today.

Resolves strands-agents/harness-sdk#2481

Public API Changes

New subpath export @strands-agents/sdk/models/webllm with a WebLLMModel class and cache-management helpers.

import { Agent } from '@strands-agents/sdk'
import { WebLLMModel } from '@strands-agents/sdk/models/webllm'

const agent = new Agent({
  model: new WebLLMModel({
    modelId: 'Llama-3.1-8B-Instruct-q4f32_1-MLC',
    onProgress: (report) => console.log(report.text, report.progress),
  }),
})

const result = await agent.invoke('Hello!')

Cache helpers let apps pre-download from a settings UI, check what's cached, and evict models independently of an agent invocation:

import {
  downloadWebLLMModel,
  isWebLLMModelCached,
  deleteWebLLMModel,
  listWebLLMModels,
} from '@strands-agents/sdk/models/webllm'

if (!(await isWebLLMModelCached('Phi-3.5-mini-instruct-q4f16_1-MLC'))) {
  await downloadWebLLMModel({
    modelId: 'Phi-3.5-mini-instruct-q4f16_1-MLC',
    onProgress: (r) => updateProgressBar(r.progress, r.text),
  })
}

@mlc-ai/web-llm is declared as an optional peerDependency, so server-side users are unaffected. Attempting to use the provider outside a browser or without the peer installed raises a typed WebLLMUnavailableError.

Use Cases

  • Fully offline browser agents after the initial model download
  • Privacy-sensitive deployments where prompts/responses must not leave the device
  • Zero per-call cost — inference runs on user hardware
  • Demo/education apps with no cloud credentials required

Testing

  • strands-ts/src/models/webllm/__tests__/model.test.ts — unit tests for streaming/formatting/tool-use paths with a mocked MLCEngine
  • strands-ts/src/models/webllm/__tests__/cache.test.node.ts — Node-side environment guards and error surfaces
  • strands-ts/src/models/webllm/__tests__/browser.test.browser.ts — browser smoke test
  • strands-ts/test/packages/{esm-module,cjs-module} — subpath export resolution for the new ./models/webllm entry

All existing suites pass (2554 passed) alongside the new coverage.

Notes

  • Marked as draft until @mlc-ai/web-llm peer-dep wiring is sanity-checked in CI and the browser-integration job lights up end-to-end.
  • AGENTS.md directory map updated to reflect the new webllm/ module.

Adds a new WebLLMModel provider under @strands-agents/sdk/models/webllm
that runs quantized LLMs entirely in the browser via WebGPU using
@mlc-ai/web-llm. Models are cached in browser storage after the first
download.

Includes cache management helpers (downloadWebLLMModel,
isWebLLMModelCached, deleteWebLLMModel, listWebLLMModels) so apps can
pre-download models from a settings UI and report progress via an
onProgress callback.

@mlc-ai/web-llm is added as an optional peer dependency to keep it
out of the default dependency graph for server-side users.

Resolves #1035
@github-actions github-actions Bot added the strands-running <strands-managed> Whether or not an agent is currently running label May 9, 2026
*
* @throws {@link WebLLMUnavailableError} when WebLLM cannot be loaded.
*/
export async function listWebLLMModels(appConfig?: AppConfig): Promise<WebLLMModelInfo[]> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: listWebLLMModels does not call assertBrowserEnvironment() unlike isWebLLMModelCached, deleteWebLLMModel, and downloadWebLLMModel. This is inconsistent — if the module can't be loaded in Node, it will throw WebLLMUnavailableError from loadWebLLMModule() anyway, but the error message won't be the clear "requires a browser" guidance.

Suggestion: Either add assertBrowserEnvironment() for consistency with the other helpers, or add a code comment explaining why listWebLLMModels intentionally skips the check (e.g., if it's designed to work in server-side contexts for listing available models without needing WebGPU).

}

return events
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The mapChunkToEvents, extractUsage, and the streaming state management logic here are nearly line-for-line identical to mapChatChunkToEvents in src/models/openai/chat-adapter.ts. This creates a maintenance burden where fixes to one must be duplicated to the other.

Suggestion: Consider extracting the shared OpenAI-compatible chunk-to-event mapping into a shared utility (e.g. src/models/openai-compatible-streaming.ts) that both the OpenAI chat adapter and WebLLM can import. At minimum, leave a // NOTE: comment cross-referencing the OpenAI adapter so future maintainers know to keep them in sync.

modelLib: record.model_lib,
}
if (record.vram_required_MB !== undefined) info.vramMB = record.vram_required_MB
if (typeof (record as unknown as { model_type?: string }).model_type === 'string') {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The model_type access uses a double cast through unknown ((record as unknown as { model_type?: string }).model_type), which is fragile and circumvents type safety.

Suggestion: Since ModelRecord comes from @mlc-ai/web-llm types, either:

  1. Use optional chaining with an in check: if ('model_type' in record && typeof record.model_type === 'string')
  2. Or extend the ModelRecord type locally if this field is expected but not yet typed upstream

The current double-cast could silently break if model_type is renamed or restructured.


if (bufferedUsage) yield bufferedUsage
if (bufferedStop) yield bufferedStop
} catch (error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The stream method catches errors from the engine and re-throws via normalizeError(error), but if the error occurs during iteration of the async iterable (inside for await), the generator will be in a partially-yielded state. The consumer will see the error, but any buffered modelContentBlockStartEvent won't have a matching modelContentBlockStopEvent, which could leave the SDK's message accumulator in an inconsistent state.

Suggestion: Consider emitting content block stop events in the catch/finally block when state.textContentBlockStarted is true or activeToolCalls is non-empty, to ensure the stream is always well-formed even on errors.

Comment thread strands-ts/package.json
"@aws-sdk/client-s3": "^3.943.0",
"@google/genai": "^1.40.0",
"@modelcontextprotocol/sdk": "^1.25.2",
"@mlc-ai/web-llm": "^0.2.79",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The peer dependency is specified as "^0.2.79" which for a pre-1.0 package (semver treats 0.x specially) only allows 0.2.x patches. This is correctly conservative. However, @mlc-ai/web-llm has a history of frequent breaking changes within minor versions (their API changed between 0.2.x releases).

Suggestion: Consider whether pinning more tightly (e.g. ~0.2.79 or exact 0.2.79) would be safer, or alternatively document in the module TSDoc which web-llm API surface you depend on. If the intent is to support a range, add a comment in package.json or the README noting the tested/verified version range.

events.push({ type: 'modelMessageStartEvent', role: delta.role as 'user' | 'assistant' })
}

if (delta?.content && delta.content.length > 0) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: If the stream starts emitting content deltas without a preceding role delta (e.g. some engines skip the role chunk), no modelMessageStartEvent is ever emitted, but content block events are still produced. This would leave the SDK's stream consumer in an inconsistent state.

Suggestion: Add a guard that emits a synthetic modelMessageStartEvent with role: 'assistant' when content arrives before a role delta, similar to how the text content block start is auto-emitted:

if (delta?.content && delta.content.length > 0) {
  if (!state.messageStarted) {
    state.messageStarted = true
    events.push({ type: 'modelMessageStartEvent', role: 'assistant' })
  }
  // ...
}

return this._enginePromise
}

private async _createEngine(): Promise<MLCEngineInterface> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The _createEngine method calls assertBrowserEnvironment() synchronously, then loadWebLLMModule() which also surfaces an environment error. However, when _getEngine() is called, it caches the promise — if the first call fails (e.g., module not found), it correctly resets _enginePromise allowing retry. But assertBrowserEnvironment() will always throw synchronously in Node, meaning the retry logic is unreachable in that scenario. This is fine but worth noting that the catch reset on line 300 only helps for transient loadWebLLMModule failures, not environment failures.

No action required — just noting for clarity that the retry semantics only apply to module loading/engine init failures in a valid browser environment.

*
* @internal
*/
export function assertBrowserEnvironment(): void {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: assertBrowserEnvironment() checks typeof window === 'undefined' to detect non-browser environments. However, some server-side runtimes (Cloudflare Workers, Deno Deploy) and test environments (jsdom) define window without actually having WebGPU. Conversely, Web Workers (where WebGPU is available) don't have window.

Suggestion: Consider checking for typeof navigator !== 'undefined' && 'gpu' in navigator (or at minimum typeof globalThis.navigator !== 'undefined') as a more accurate browser+WebGPU heuristic, or simply let the CreateMLCEngine call surface WebLLM's own environment check (which it already does) and remove the preemptive check. The error message could also mention Web Workers as a valid environment.

@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

Review Summary

Assessment: Comment (Draft PR - not blocking, providing feedback for iteration)

This is a well-structured addition that follows existing model provider patterns closely. The code is clean, well-documented, and thoroughly tested.

Review Categories
  • Code Duplication: The OpenAI-compatible streaming logic (mapChunkToEvents, state management, usage extraction) is nearly identical to openai/chat-adapter.ts. This is the most impactful improvement opportunity — extracting a shared utility would reduce future maintenance burden across both providers.
  • Robustness: Edge cases around missing role deltas and mid-stream errors could leave the SDK's stream consumer in an inconsistent state. Adding guards for well-formed event sequences would improve reliability.
  • Environment Detection: The assertBrowserEnvironment() check is overly simplistic (window detection) and would incorrectly reject valid environments (Web Workers) while accepting invalid ones (jsdom). Consider relying on WebLLM's own runtime checks.
  • API Review Process: This introduces a new public class with cache management helpers — per the API Bar Raising guidelines, it should carry the needs-api-review label for designated reviewer evaluation before merge.

Good work on the overall design — the cache helper separation, abort signal support, and consistent error class hierarchy are thoughtful touches.

@github-actions github-actions Bot removed the strands-running <strands-managed> Whether or not an agent is currently running label May 9, 2026
@strands-agent

Copy link
Copy Markdown
Collaborator

This repository has been merged into the strands-agents/harness-sdk monorepo and will be archived shortly. All new development happens there.

If this PR is still relevant, please recreate it against the monorepo. The code now lives under strands-ts/. Full commit history was preserved, so your base should be findable.

Apologies for the disruption, and thank you for contributing!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] WebLLM model provider (on-device inference via WebGPU)

2 participants