Skip to content

Commit f605b97

Browse files
authored
Support vision input for Planner (#472)
- Modified message formatting to support vision input for OpenAI API - Added a role ImageReader to process input images so the Planner can get the URL/content of the image
2 parents 5505890 + b5b8d12 commit f605b97

26 files changed

+332
-44
lines changed

README.md

+4-3
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ Unlike many agent frameworks that only track the chat history with LLMs in text,
2323

2424

2525
## 🆕 News
26+
- 📅2025-03-13: TaskWeaver now supports vision input for the Planner role. Please check the [vision input](https://microsoft.github.io/TaskWeaver/blog/vision) for more details.👀
2627
- 📅2025-01-16: TaskWeaver has been enhanced with an experimental role called [Recepta](https://microsoft.github.io/TaskWeaver/blog/reasoning) for its reasoning power.🧠
2728
- 📅2024-12-23: TaskWeaver has been integrated with the [AgentOps](https://microsoft.github.io/TaskWeaver/docs/observability) for better observability and monitoring.🔍
2829
- 📅2024-09-13: We introduce the shared memory to store information that is shared between the roles in TaskWeaver. Please check the [memory](https://microsoft.github.io/TaskWeaver/docs/memory) for more details.🧠
@@ -31,7 +32,7 @@ Unlike many agent frameworks that only track the chat history with LLMs in text,
3132
- 📅2024-05-07: We have added two blog posts on [Evaluating a LLM agent](https://microsoft.github.io/TaskWeaver/blog/evaluation) and [Adding new roles to TaskWeaver](https://microsoft.github.io/TaskWeaver/blog/role) in the documentation.📝
3233
- 📅2024-03-28: TaskWeaver now offers all-in-one Docker image, providing a convenient one-stop experience for users. Please check the [docker](https://microsoft.github.io/TaskWeaver/docs/usage/docker) for more details.🐳
3334
- 📅2024-03-27: TaskWeaver now switches to `container` mode by default for code execution. Please check the [code execution](https://microsoft.github.io/TaskWeaver/docs/code_execution) for more details.🐳
34-
- 📅2024-03-07: TaskWeaver now supports configuration of different LLMs for various components, such as the Planner and CodeInterpreter. Please check the [multi-llm](https://microsoft.github.io/TaskWeaver/docs/llms/multi-llm) for more details.🔗
35+
<!-- - 📅2024-03-07: TaskWeaver now supports configuration of different LLMs for various components, such as the Planner and CodeInterpreter. Please check the [multi-llm](https://microsoft.github.io/TaskWeaver/docs/llms/multi-llm) for more details.🔗 -->
3536
<!-- - 📅2024-03-04: TaskWeaver now supports a [container](https://microsoft.github.io/TaskWeaver/docs/code_execution) mode, which provides a more secure environment for code execution.🐳 -->
3637
<!-- - 📅2024-02-28: TaskWeaver now offers a [CLI-only](https://microsoft.github.io/TaskWeaver/docs/advanced/cli_only) mode, enabling users to interact seamlessly with the Command Line Interface (CLI) using natural language.📟 -->
3738
<!-- - 📅2024-02-01: TaskWeaver now has a plugin [document_retriever](https://github.com/microsoft/TaskWeaver/blob/main/project/plugins/README.md#document_retriever) for RAG based on a knowledge base.📚 -->
@@ -43,7 +44,8 @@ Unlike many agent frameworks that only track the chat history with LLMs in text,
4344
<!-- - 📅2023-12-21: TaskWeaver now supports a number of LLMs, such as LiteLLM, Ollama, Gemini, and QWen🎈.) -->
4445
<!-- - 📅2023-12-21: TaskWeaver Website is now [available]&#40;https://microsoft.github.io/TaskWeaver/&#41; with more documentations.) -->
4546
<!-- - 📅2023-12-12: A simple UI demo is available in playground/UI folder, try it [here](https://microsoft.github.io/TaskWeaver/docs/usage/webui)! -->
46-
<!-- - 📅2023-11-30: TaskWeaver is released on GitHub🎈. -->
47+
- ......
48+
- 📅2023-11-30: TaskWeaver is released on GitHub🎈.
4749

4850

4951
## 💥 Highlights
@@ -68,7 +70,6 @@ We are looking forward to your contributions to make TaskWeaver better.
6870
- [ ] Support for prompt template management
6971
- [ ] Better plugin experiences, such as displaying updates or stopping in the middle of running the plugin and user confirmation before running the plugin
7072
- [ ] Async interaction with LLMs
71-
- [ ] Support for vision input for Roles such as the Planner and CodeInterpreter
7273
- [ ] Support for remote code execution
7374

7475

taskweaver/chat/console/chat.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -498,7 +498,7 @@ def _reset_session(self, first_session: bool = False):
498498
self.session.stop()
499499
self.session = self.app.get_session()
500500

501-
self._system_message("--- new session starts ---")
501+
self._system_message("--- new session started ---")
502502
self._assistant_message(
503503
"I am TaskWeaver, an AI assistant. To get started, could you please enter your request?",
504504
)

taskweaver/code_interpreter/code_interpreter/code_generator.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -251,7 +251,7 @@ def compose_conversation(
251251
# for code correction
252252
user_message += self.user_message_head_template.format(
253253
FEEDBACK=format_code_feedback(post),
254-
MESSAGE=f"{post.get_attachment(AttachmentType.revise_message)[0]}",
254+
MESSAGE=f"{post.get_attachment(AttachmentType.revise_message)[0].content}",
255255
)
256256

257257
assistant_message = self.post_translator.post_to_raw_text(

taskweaver/code_interpreter/code_interpreter_cli_only/code_interpreter_cli_only.py

+5-2
Original file line numberDiff line numberDiff line change
@@ -60,9 +60,12 @@ def reply(
6060
prompt_log_path=prompt_log_path,
6161
)
6262

63-
code = post_proxy.post.get_attachment(type=AttachmentType.reply_content)[0]
63+
code = post_proxy.post.get_attachment(type=AttachmentType.reply_content)[0].content
6464
if len(code) == 0:
65-
post_proxy.update_message(post_proxy.post.get_attachment(type=AttachmentType.thought)[0], is_end=True)
65+
post_proxy.update_message(
66+
post_proxy.post.get_attachment(type=AttachmentType.thought)[0].content,
67+
is_end=True,
68+
)
6669
return post_proxy.end()
6770

6871
code_to_exec = "! " + code

taskweaver/code_interpreter/code_interpreter_plugin_only/code_interpreter_plugin_only.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ def reply(
7878
return post_proxy.end()
7979

8080
functions = json.loads(
81-
post_proxy.post.get_attachment(type=AttachmentType.function)[0],
81+
post_proxy.post.get_attachment(type=AttachmentType.function)[0].content,
8282
)
8383
if len(functions) > 0:
8484
code: List[str] = []

taskweaver/ext_role/image_reader/__init__.py

Whitespace-only changes.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
import base64
2+
import json
3+
import os.path
4+
from mimetypes import guess_type
5+
6+
from injector import inject
7+
8+
from taskweaver.llm import LLMApi, format_chat_message
9+
from taskweaver.logging import TelemetryLogger
10+
from taskweaver.memory import Memory, Post
11+
from taskweaver.memory.attachment import AttachmentType
12+
from taskweaver.module.event_emitter import SessionEventEmitter
13+
from taskweaver.module.tracing import Tracing
14+
from taskweaver.role import Role
15+
from taskweaver.role.role import RoleConfig, RoleEntry
16+
from taskweaver.session import SessionMetadata
17+
18+
19+
# Function to encode a local image into data URL
20+
def local_image_to_data_url(image_path):
21+
# Guess the MIME type of the image based on the file extension
22+
mime_type, _ = guess_type(image_path)
23+
if mime_type is None:
24+
mime_type = "application/octet-stream" # Default MIME type if none is found
25+
26+
try:
27+
# Read and encode the image file
28+
with open(image_path, "rb") as image_file:
29+
base64_encoded_data = base64.b64encode(image_file.read()).decode("utf-8")
30+
except FileNotFoundError:
31+
logger.error(f"Error: The file {image_path} does not exist.")
32+
return None
33+
except IOError:
34+
logger.error(f"Error: The file {image_path} could not be read.")
35+
return None
36+
# Construct the data URL
37+
return f"data:{mime_type};base64,{base64_encoded_data}"
38+
39+
40+
class ImageReaderConfig(RoleConfig):
41+
def _configure(self):
42+
pass
43+
44+
45+
class ImageReader(Role):
46+
@inject
47+
def __init__(
48+
self,
49+
config: ImageReaderConfig,
50+
logger: TelemetryLogger,
51+
tracing: Tracing,
52+
event_emitter: SessionEventEmitter,
53+
role_entry: RoleEntry,
54+
llm_api: LLMApi,
55+
session_metadata: SessionMetadata,
56+
):
57+
super().__init__(config, logger, tracing, event_emitter, role_entry)
58+
59+
self.llm_api = llm_api
60+
self.session_metadata = session_metadata
61+
62+
def reply(self, memory: Memory, **kwargs: ...) -> Post:
63+
rounds = memory.get_role_rounds(
64+
role=self.alias,
65+
include_failure_rounds=False,
66+
)
67+
68+
# obtain the query from the last round
69+
last_post = rounds[-1].post_list[-1]
70+
71+
post_proxy = self.event_emitter.create_post_proxy(self.alias)
72+
73+
post_proxy.update_send_to(last_post.send_from)
74+
75+
input_message = last_post.message
76+
prompt = (
77+
f"Input message: {input_message}.\n"
78+
"\n"
79+
"Your response should be a JSON object with the key 'image_url' and the value as the image path. "
80+
"For example, {'image_url': 'c:/images/image.jpg'} or {'image_url': 'http://example.com/image.jpg'}. "
81+
"Do not add any additional information in the response or wrap the JSON with ```json and ```."
82+
)
83+
84+
response = self.llm_api.chat_completion(
85+
messages=[
86+
format_chat_message(
87+
role="system",
88+
message="Your task is to read the image path from the message.",
89+
),
90+
format_chat_message(
91+
role="user",
92+
message=prompt,
93+
),
94+
],
95+
)
96+
97+
image_url = json.loads(response["content"])["image_url"]
98+
if image_url.startswith("http"):
99+
image_content = image_url
100+
attachment_message = f"Image from {image_url}."
101+
else:
102+
if os.path.isabs(image_url):
103+
image_content = local_image_to_data_url(image_url)
104+
else:
105+
image_content = local_image_to_data_url(os.path.join(self.session_metadata.execution_cwd, image_url))
106+
attachment_message = f"Image from {image_url} encoded as a Base64 data URL."
107+
108+
post_proxy.update_attachment(
109+
message=attachment_message,
110+
type=AttachmentType.image_url,
111+
extra={"image_url": image_content},
112+
is_end=True,
113+
)
114+
115+
post_proxy.update_message(
116+
"I have read the image path from the message. The image is attached below.",
117+
)
118+
119+
return post_proxy.end()
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
alias: ImageReader
2+
module: taskweaver.ext_role.image_reader.image_reader.ImageReader
3+
intro : |-
4+
- ImageReader is responsible for helping the Planner to read images.
5+
- The input message must contain the image path, either local or remote.

taskweaver/llm/util.py

+35-5
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
from typing import Any, Dict, List, Literal, Optional, TypedDict, Union
22

33
ChatMessageRoleType = Literal["system", "user", "assistant", "function"]
4-
ChatMessageType = Dict[Literal["role", "name", "content"], str]
4+
ChatContentType = Dict[Literal["type", "text", "image_url"], str | Dict[Literal["url"], str]]
5+
ChatMessageType = Dict[Literal["role", "name", "content"], str | List[ChatContentType]]
6+
57
PromptTypeSimple = List[ChatMessageType]
68

79

@@ -21,15 +23,43 @@ class PromptTypeWithTools(TypedDict):
2123
tools: Optional[List[PromptToolType]]
2224

2325

26+
def format_chat_message_content(
27+
content_type: Literal["text", "image_url"],
28+
content_value: str,
29+
) -> ChatContentType:
30+
if content_type == "image_url":
31+
return {
32+
"type": content_type,
33+
content_type: {
34+
"url": content_value,
35+
},
36+
}
37+
else:
38+
return {
39+
"type": content_type,
40+
content_type: content_value,
41+
}
42+
43+
2444
def format_chat_message(
2545
role: ChatMessageRoleType,
2646
message: str,
47+
image_urls: Optional[List[str]] = None,
2748
name: Optional[str] = None,
2849
) -> ChatMessageType:
29-
msg: ChatMessageType = {
30-
"role": role,
31-
"content": message,
32-
}
50+
if not image_urls:
51+
msg: ChatMessageType = {
52+
"role": role,
53+
"content": message,
54+
}
55+
else:
56+
msg: ChatMessageType = {
57+
"role": role,
58+
"content": [
59+
format_chat_message_content("text", message),
60+
]
61+
+ [format_chat_message_content("image_url", image) for image in image_urls],
62+
}
3363
if name is not None:
3464
msg["name"] = name
3565
return msg

taskweaver/memory/attachment.py

+3
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,9 @@ class AttachmentType(Enum):
4747
# shared memory entry
4848
shared_memory_entry = "shared_memory_entry"
4949

50+
# vision input
51+
image_url = "image_url"
52+
5053

5154
@dataclass
5255
class Attachment:

taskweaver/memory/post.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -87,9 +87,9 @@ def add_attachment(self, attachment: Attachment) -> None:
8787
"""Add an attachment to the post."""
8888
self.attachment_list.append(attachment)
8989

90-
def get_attachment(self, type: AttachmentType) -> List[Any]:
90+
def get_attachment(self, type: AttachmentType) -> List[Attachment]:
9191
"""Get all the attachments of the given type."""
92-
return [attachment.content for attachment in self.attachment_list if attachment.type == type]
92+
return [attachment for attachment in self.attachment_list if attachment.type == type]
9393

9494
def del_attachment(self, type_list: List[AttachmentType]) -> None:
9595
"""Delete all the attachments of the given type."""

taskweaver/planner/planner.py

+22-23
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,7 @@ def compose_conversation_for_prompt(
133133
for post in chat_round.post_list:
134134
if post.send_from == self.alias:
135135
if post.send_to == "User" or post.send_to in self.recipient_alias_set:
136+
# planner responses
136137
planner_message = self.planner_post_translator.post_to_raw_text(
137138
post=post,
138139
)
@@ -144,47 +145,45 @@ def compose_conversation_for_prompt(
144145
)
145146
elif post.send_to == self.alias:
146147
# self correction for planner response, e.g., format error/field check error
148+
# append the invalid response to chat history
147149
conversation.append(
148150
format_chat_message(
149151
role="assistant",
150152
message=post.get_attachment(
151153
type=AttachmentType.invalid_response,
152-
)[0],
154+
)[0].content,
153155
),
154156
)
155157

156-
# append the invalid response to chat history
158+
# append the self correction instruction message to chat history
157159
conversation.append(
158160
format_chat_message(
159161
role="user",
160162
message=self.format_message(
161163
role="User",
162-
message=post.get_attachment(type=AttachmentType.revise_message)[0],
164+
message=post.get_attachment(type=AttachmentType.revise_message)[0].content,
163165
),
164166
),
165167
)
166-
# append the self correction instruction message to chat history
167-
168168
else:
169-
if conv_init_message is not None:
170-
message = self.format_message(
171-
role=post.send_from,
172-
message=conv_init_message + "\n" + post.message,
173-
)
174-
conversation.append(
175-
format_chat_message(role="user", message=message),
176-
)
177-
conv_init_message = None
178-
else:
179-
conversation.append(
180-
format_chat_message(
181-
role="user",
182-
message=self.format_message(
183-
role=post.send_from,
184-
message=post.message,
185-
),
169+
# messages from user or workers
170+
conversation.append(
171+
format_chat_message(
172+
role="user",
173+
message=self.format_message(
174+
role=post.send_from,
175+
message=post.message
176+
if conv_init_message is None
177+
else conv_init_message + "\n" + post.message,
186178
),
187-
)
179+
image_urls=[
180+
attachment.extra["image_url"]
181+
for attachment in post.get_attachment(type=AttachmentType.image_url)
182+
],
183+
),
184+
)
185+
186+
conv_init_message = None
188187

189188
return conversation
190189

website/blog/authors.yml

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
liqli:
2+
name: Liqun Li
3+
url: https://liqul.github.io
4+
title: Principal Researcher
5+
image_url: https://liqul.github.io/assets/logo_small_bw.png
6+
7+
xu:
8+
name: Xu Zhang
9+
url: https://scholar.google.com/citations?user=bqXdMMMAAAAJ&hl=zh-CN
10+
title: Senior Researcher
11+
image_url: https://scholar.googleusercontent.com/citations?view_op=view_photo&user=bqXdMMMAAAAJ&citpid=3

website/blog/evaluation.md

+5-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1-
# How to evaluate a LLM agent?
1+
---
2+
title: How to evaluate a LLM agent?
3+
authors: [liqli, xu]
4+
date: 2024-05-07
5+
---
26

37
## The challenges
48
It is nontrivial to evaluate the performance of a LLM agent.

website/blog/experience.md

+5-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1-
# Experience selection
1+
---
2+
title: Experience Selection in TaskWeaver
3+
authors: liqli
4+
date: 2024-09-14
5+
---
26

37
We have introduced the motivation of the `experience` module in [Experience](/docs/customization/experience)
48
and how to create a handcrafted experience in [Handcrafted Experience](/docs/customization/experience/handcrafted_experience).

0 commit comments

Comments
 (0)