fix(copilot): support vision via multi-part content messages#2608
fix(copilot): support vision via multi-part content messages#2608TimStewartJ wants to merge 2 commits intozeroclaw-labs:masterfrom
Conversation
|
Thanks for contributing to ZeroClaw. For faster review, please ensure:
See |
PR intake checks found warnings (non-blocking)Fast safe checks found advisory issues. CI lint/test/build gates still enforce merge quality.
Action items:
Detected Linear keys: none Run logs: https://github.com/zeroclaw-labs/zeroclaw/actions/runs/22686016936 Detected blocking line issues (sample):
Detected advisory line issues (sample):
|
|
Note
|
| Cohort / File(s) | Summary |
|---|---|
Copilot provider content model src/providers/copilot.rs |
Replaced ApiMessage.content: Option<String> with Option<ApiContent>; added ApiContent (Text, Parts), ContentPart (Text, ImageUrl), and ImageUrlDetail; added to_api_content() to parse [IMAGE:...] markers into parts; updated message construction, conversion paths, and tests to emit ApiContent instead of raw strings. |
Sequence Diagram(s)
(omitted)
Estimated code review effort
🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related issues
- Anthropic provider sends image content as base64 text, causing token overflow #1626 — Both change provider message content handling to structured multi-part/image variants and convert image markers into image parts.
Possibly related PRs
- fix(channel:discord): handle inbound image attachments as robust [IMAGE] markers #1923 — Touches the same image-marker flow; complements this PR's parsing of
[IMAGE:<url>]into parts. - fix(telegram): route image-extension Documents through vision pipeline #1631 — Emits
[IMAGE:/path]markers for Telegram documents that this PR consumes. - feat(discord): forward inbound image attachments as markers #1872 — Emits
[IMAGE:<url>]markers for Discord attachments; aligns with this PR's to_api_content conversion.
Suggested reviewers
- theonlyhennygod
- chumyin
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. | Write docstrings for the functions missing them to satisfy the coverage threshold. |
✅ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title concisely and accurately describes the main change: adding vision support to Copilot via multi-part content messages, which directly matches the changeset. |
| Description check | ✅ Passed | The PR description comprehensively covers all required template sections with detailed explanations of the problem, solution, validation evidence, security/privacy assessment, backward compatibility, and rollback plans. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Post copyable unit tests in a comment
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/providers/copilot.rs (1)
270-296: Add focused regression tests for multimodal conversion behavior.Given this new message-shaping logic, add deterministic tests that assert
CopilotProvider::to_api_contentreturnsApiContent::Partsfor user image markers andApiContent::Textfor non-user/plain cases.🧪 Suggested test additions
+ #[test] + fn to_api_content_user_with_image_marker_returns_parts() { + let content = "describe this [IMAGE:data:image/png;base64,abc]"; + let converted = CopilotProvider::to_api_content("user", content).unwrap(); + match converted { + ApiContent::Parts(parts) => { + assert!(matches!(parts.first(), Some(ContentPart::Text { .. }))); + assert!(matches!(parts.last(), Some(ContentPart::ImageUrl { .. }))); + } + _ => panic!("expected multipart content"), + } + } + + #[test] + fn to_api_content_non_user_returns_text() { + let converted = CopilotProvider::to_api_content("system", "hello").unwrap(); + assert!(matches!(converted, ApiContent::Text(_))); + }As per coding guidelines,
**/*.rs: “Name tests by behavior/outcome (<subject>_<expected_behavior>) and keep tests deterministic (no flaky timing/network dependence without guardrails)”.Also applies to: 358-358
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/providers/copilot.rs` around lines 270 - 296, Add deterministic unit tests for CopilotProvider::to_api_content to cover multimodal conversion: create a test named to_api_content_user_with_image_returns_parts that passes a user role string and content containing an `[IMAGE:...]` marker and asserts the return equals ApiContent::Parts with appropriate ContentPart::Text and ContentPart::ImageUrl entries; create tests named to_api_content_user_plain_returns_text and to_api_content_non_user_returns_text that pass plain user content (no markers) and a non-"user" role respectively and assert ApiContent::Text is returned; use deterministic inputs (no network/timing) and match on enum variants (ApiContent::Parts / ApiContent::Text) rather than string formatting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/providers/copilot.rs`:
- Around line 667-670: In chat_with_system the new user ApiMessage is being
constructed with ApiContent::Text directly (messages.push ...
ApiContent::Text(message.to_string())), which bypasses multimodal parsing and
leaves [IMAGE:...] markers as plain text; replace the direct ApiContent::Text
creation with a call to the multimodal conversion routine (route the variable
message through the existing multimodal conversion helper — e.g.,
convert_text_to_multimodal or the project’s equivalent) and use its returned
ApiContent when constructing ApiMessage so the messages vector receives parsed
multimodal content instead of raw text.
---
Nitpick comments:
In `@src/providers/copilot.rs`:
- Around line 270-296: Add deterministic unit tests for
CopilotProvider::to_api_content to cover multimodal conversion: create a test
named to_api_content_user_with_image_returns_parts that passes a user role
string and content containing an `[IMAGE:...]` marker and asserts the return
equals ApiContent::Parts with appropriate ContentPart::Text and
ContentPart::ImageUrl entries; create tests named
to_api_content_user_plain_returns_text and to_api_content_non_user_returns_text
that pass plain user content (no markers) and a non-"user" role respectively and
assert ApiContent::Text is returned; use deterministic inputs (no
network/timing) and match on enum variants (ApiContent::Parts /
ApiContent::Text) rather than string formatting.
5b3cc79 to
b149dea
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/providers/copilot.rs (1)
789-818: Good test coverage for coreto_api_content()behavior.Tests cover the main branches: user with image markers, user without markers, and non-user roles.
Consider adding edge-case tests for completeness (optional):
- Multiple images:
"[IMAGE:a][IMAGE:b]"→ verify ordering- Image-only (no text):
"[IMAGE:data:...]"→ verify empty text part is omitted🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/providers/copilot.rs` around lines 789 - 818, Add two unit tests for CopilotProvider::to_api_content to cover edge cases: (1) a test that passes input with multiple image markers like "[IMAGE:data:...][IMAGE:data:...]" and asserts ApiContent::Parts with parts in the correct ordering (text and multiple ContentPart::ImageUrl entries in sequence), and (2) a test that passes image-only input like "[IMAGE:data:...]" and asserts the result is ApiContent::Parts where no empty ContentPart::Text is present (or that only the ContentPart::ImageUrl exists). Reference the existing test functions (to_api_content_user_with_image_returns_parts, to_api_content_user_plain_returns_text) when naming/placing these new tests and use the same pattern of matches! assertions against ApiContent and ContentPart variants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/providers/copilot.rs`:
- Around line 789-818: Add two unit tests for CopilotProvider::to_api_content to
cover edge cases: (1) a test that passes input with multiple image markers like
"[IMAGE:data:...][IMAGE:data:...]" and asserts ApiContent::Parts with parts in
the correct ordering (text and multiple ContentPart::ImageUrl entries in
sequence), and (2) a test that passes image-only input like "[IMAGE:data:...]"
and asserts the result is ApiContent::Parts where no empty ContentPart::Text is
present (or that only the ContentPart::ImageUrl exists). Reference the existing
test functions (to_api_content_user_with_image_returns_parts,
to_api_content_user_plain_returns_text) when naming/placing these new tests and
use the same pattern of matches! assertions against ApiContent and ContentPart
variants.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cd4f6598-0f31-4b8c-b902-037ed09c1849
📒 Files selected for processing (1)
src/providers/copilot.rs
b149dea to
453aad6
Compare
The Copilot provider sent image markers as plain text in the content field. The API expects OpenAI-style multi-part content arrays with separate text and image_url parts for vision input. Add ApiContent enum (untagged) that serializes as either a plain string or an array of ContentPart objects. User messages containing [IMAGE:] markers are converted to multi-part content using the shared multimodal::parse_image_markers() helper, matching the pattern used by the OpenRouter and Compatible providers. Non-user messages and messages without images serialize as plain strings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d tests Fix chat_with_system user message bypassing to_api_content(), which left [IMAGE:] markers as plain text on that code path. Add unit tests for to_api_content() covering image-marker, plain-text, and non-user roles. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
453aad6 to
1eada19
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/providers/copilot.rs`:
- Around line 277-295: The current code uses parse_image_markers(content) which
returns a cleaned text plus a list of image refs and then always emits parts as
"text? followed by all images", losing the original interleaving; update the
logic so ApiContent::Parts contains ContentPart entries in the same source order
as the original content. Either modify parse_image_markers to return ordered
segments (e.g., an enum Segment::Text(String) | Segment::Image(String)) and
iterate those to build ContentPart::Text / ContentPart::ImageUrl
(ImageUrlDetail) in order, or keep parse_image_markers but also return the
positions of markers and rebuild parts by scanning the original content to
interleave text and images; ensure you create ContentPart::Text only for
non-empty text segments and ImageUrlDetail { url: ... } for image refs so
ApiContent::Parts preserves original ordering (and add a regression test for an
interleaved case like "compare [IMAGE:a] and [IMAGE:b]").
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c1c8b0ee-bc8f-4cf5-a755-85ea0bc608d6
📒 Files selected for processing (1)
src/providers/copilot.rs
| let (cleaned_text, image_refs) = crate::multimodal::parse_image_markers(content); | ||
| if image_refs.is_empty() { | ||
| return Some(ApiContent::Text(content.to_string())); | ||
| } | ||
|
|
||
| let mut parts = Vec::with_capacity(image_refs.len() + 1); | ||
| let trimmed = cleaned_text.trim(); | ||
| if !trimmed.is_empty() { | ||
| parts.push(ContentPart::Text { | ||
| text: trimmed.to_string(), | ||
| }); | ||
| } | ||
| for image_ref in image_refs { | ||
| parts.push(ContentPart::ImageUrl { | ||
| image_url: ImageUrlDetail { url: image_ref }, | ||
| }); | ||
| } | ||
|
|
||
| Some(ApiContent::Parts(parts)) |
There was a problem hiding this comment.
Preserve text/image ordering when building multipart content.
parse_image_markers() gives you one cleaned text string plus the collected refs, so this always serializes as text?, image1, image2, ... regardless of where each marker originally appeared. That changes the prompt semantics for inputs like compare [IMAGE:a] and [IMAGE:b] or any interleaved text/image sequence. Please build ContentParts in source order here, or extend src/multimodal.rs to return ordered segments, and add a regression test for that case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/providers/copilot.rs` around lines 277 - 295, The current code uses
parse_image_markers(content) which returns a cleaned text plus a list of image
refs and then always emits parts as "text? followed by all images", losing the
original interleaving; update the logic so ApiContent::Parts contains
ContentPart entries in the same source order as the original content. Either
modify parse_image_markers to return ordered segments (e.g., an enum
Segment::Text(String) | Segment::Image(String)) and iterate those to build
ContentPart::Text / ContentPart::ImageUrl (ImageUrlDetail) in order, or keep
parse_image_markers but also return the positions of markers and rebuild parts
by scanning the original content to interleave text and images; ensure you
create ContentPart::Text only for non-empty text segments and ImageUrlDetail {
url: ... } for image refs so ApiContent::Parts preserves original ordering (and
add a regression test for an interleaved case like "compare [IMAGE:a] and
[IMAGE:b]").
Code Review: Copilot Vision SupportReviewer: Claude Opus 4.6 (via dwillitzer) SummaryThe implementation is architecturally sound and follows established patterns from Compatible/OpenRouter providers. The What Works Well
Critical Issue: Missing TestsThe PR introduces new types ( Recommended tests: #[test]
fn api_content_text_serializes_as_plain_string() {
let content = ApiContent::Text("hello".to_string());
let json = serde_json::to_string(&content).unwrap();
assert_eq!(json, r#""hello""#);
}
#[test]
fn api_content_parts_serializes_as_array() {
let content = ApiContent::Parts(vec![
ContentPart::Text { text: "Look at this".to_string() },
ContentPart::ImageUrl {
image_url: ImageUrlDetail { url: "data:image/png;base64,abc".to_string() }
},
]);
let json = serde_json::to_string(&content).unwrap();
assert!(json.starts_with('['));
}
#[test]
fn to_api_content_non_user_returns_text() {
let result = CopilotProvider::to_api_content("assistant", "Hello [IMAGE:x]");
assert!(matches!(result, Some(ApiContent::Text(_))));
}
#[test]
fn to_api_content_user_with_images_returns_parts() {
let result = CopilotProvider::to_api_content("user", "See [IMAGE:data:abc]");
assert!(matches!(result, Some(ApiContent::Parts(_))));
}Minor Suggestions
VerdictThe implementation is correct, but given that:
I recommend adding unit tests before merge. Review Summary:
🤖 Generated with Claude Code |
SimianAstronaut7
left a comment
There was a problem hiding this comment.
LGTM — quality score: 87/100
Summary
main[IMAGE:data:image/jpeg;base64,...]markers as plain text in thecontentstring field. The Copilot API (OpenAI-compatible) expects multi-part content arrays with separatetextandimage_urlparts for vision input. The model receives base64 data as raw text tokens (~50K tokens per image) and hallucinates instead of analyzing the actual image.ApiContentenum andContentParttypes (matching the OpenRouter/Compatible provider pattern). User messages containing[IMAGE:]markers are converted to multi-part content using the sharedmultimodal::parse_image_markers()helper. Non-user messages and messages without images serialize as plain strings.ApiContent::Textserializes identically to the previousOption<String>.How it works
Label Snapshot (required)
risk: lowsize: Sproviderprovider: copilotChange Metadata
bugproviderLinked Issue
Supersede Attribution (required when
Supersedes #is used)N/A — not superseding any PR.
Validation Evidence (required)
cargo fmt/clippy/test— CI will validate. Change is self-contained in one provider file.Security Impact (required)
Privacy and Data Hygiene (required)
passCompatibility / Migration
ApiContent::Textserializes identically to the previousOption<String>via#[serde(untagged)]. Non-image messages produce identical JSON.model_support_vision = trueandallow_remote_fetch = truefor Discord/remote images (pre-existing config options)i18n Follow-Through (required when docs or user-facing wording changes)
Human Verification (required)
Side Effects / Blast Radius (required)
#[serde(untagged)]enum ensuresApiContent::Text("hello")serializes as"hello"(plain string), not{"Text":"hello"}. Existing non-image conversations produce byte-identical JSON.input_tokenscount — ~50K for a single image = broken (base64 as text), ~1K = working (image sent as binary)Agent Collaboration Notes (recommended)
parse_image_markers()helper → verified live on deployed ZeroClaw instanceAGENTS.md+CONTRIBUTING.md). Uses samemultimodal::parse_image_markers()helper as 5 other providers.Rollback Plan (required)
git revert <commit>— reverts to plain string contentmodel_support_visionconfig flaginput_tokens~50K per image in tracesRisks and Mitigations
#[serde(untagged)]enum serialization order — serde tries variants in order, so a string value always matchesTextbeforeParts.Text(String)is listed first in the enum, which is the correct match for plain strings.Parts(Vec<ContentPart>)only matches when explicitly constructed. This is the same pattern used by OpenRouter and Compatible providers.Summary by CodeRabbit
New Features
Tests