Agents with Vision or returning ImageContent from KernelFunction #11145

floboc · 2025-03-24T07:55:15Z

floboc
Mar 24, 2025

Hello there, I am working on some agentic workflow in which I would like to let agents access and process image content if the planner "thinks" it is useful to perform the task.
For this task, I am using multi-modal models with vision capabilities.

So the idea would be to have some KernelFunction in a plugin returning an ImageContent, for instance:

class MyPlugin
{
  [KernelFunction("render_preview")]
  [Description("Returns a preview image.")]
  public ImageContent RenderPreview()
  {
    // do some rendering here
    var imageContent = new ImageContent(...);
    return imageContent;
  }
}

Then I would register that plugin on a specific agent.
The hope here is that the planner would use the RenderPreview() method to get some image that would then be used to plan the next operations or better perform the task.

That doesn't seem to work however, it seems like returning an ImageContent is not properly handled as an image input, and instead just parsed like any other custom data type a kernel function could return.

Is there a way to have this work, ie to make agents that can properly use vision models inside the planning workflow ?
Thanks !

Answered by RogerBarreto

May 8, 2025

@floboc That's a very interesting question, thanks for bringing it in. So currently when the Plugin is invoked by the AIModel according to the function calling pattern the answer needs to go back to the model as a message.role=tool where there isn't on option to identify the function result in a multi modal way that the AI Model will recognize as image/audio.

BUT, is it possible to be a bit creative here using Semantic Kernel where basically you can inject that generated image in the Chat History everytime you have one in your kernel context.

Here's how I would do it:

inject the Kernel into your RenderPreview(Kernel kernel) function.
Once you get the imageContent created you can lever…

View full answer

sophialagerkranspandey · 2025-05-05T16:46:18Z

sophialagerkranspandey
May 5, 2025

Tagging @RogerBarreto

0 replies

RogerBarreto · 2025-05-08T15:09:55Z

RogerBarreto
May 8, 2025
Collaborator

@floboc That's a very interesting question, thanks for bringing it in. So currently when the Plugin is invoked by the AIModel according to the function calling pattern the answer needs to go back to the model as a message.role=tool where there isn't on option to identify the function result in a multi modal way that the AI Model will recognize as image/audio.

BUT, is it possible to be a bit creative here using Semantic Kernel where basically you can inject that generated image in the Chat History everytime you have one in your kernel context.

Here's how I would do it:

inject the Kernel into your RenderPreview(Kernel kernel) function.
Once you get the imageContent created you can leverage the kernel.Data bag and add the images you want to send to the model there, is a dictionary where you can actually add anything.
Use a ChatCompletionWrapper/Adapter (A class that accepts another existing ChatCompletionService) and everytime you call the model, you can check the kernel argument if it has images in the Data bad.
With the image in your "hands", you can now inject it into the ChatHistory argument as part of the last chatHistory.Last(m => m.Role == User).Items.Add(imageContent).

A bit similar wrapper is available here using a ChatHistoryReducer
- https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/ChatCompletion/ChatHistoryReducers/ChatCompletionServiceWithReducer.cs

2 replies

floboc May 20, 2025
Author

Yes, this is actually very similar what I did to work around this limitation. It did the trick for now but I think we will need something more reliable / official for this kind of use case in the future. Thanks for your reply !

mbenson01 Jun 3, 2025

Hi, I am having the same problem with the ChatCompletionAgent. I would like to return an Image from a kernel function to be processed as ImageContent by the LLM. I am using the ChatCompletionAgent with the InvokeAsync method, which does not have an interface so the above ChatCompletionWrapper suggestion is not working (the example was for the ChatCompletionService not the ChatCompletionAgent).

@RogerBarreto, do you have any suggestions to get this working with the ChatCompletionAgent?

@floboc, can you elaborate on your solution if possible.

I second @floboc’s point… SK should add support for this use case so that we don't need these hacky workarounds.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agents with Vision or returning ImageContent from KernelFunction #11145

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Agents with Vision or returning ImageContent from KernelFunction #11145

Uh oh!

Uh oh!

floboc Mar 24, 2025

Replies: 2 comments · 2 replies

Uh oh!

sophialagerkranspandey May 5, 2025

Uh oh!

RogerBarreto May 8, 2025 Collaborator

Uh oh!

floboc May 20, 2025 Author

Uh oh!

mbenson01 Jun 3, 2025

floboc
Mar 24, 2025

Replies: 2 comments 2 replies

sophialagerkranspandey
May 5, 2025

RogerBarreto
May 8, 2025
Collaborator

floboc May 20, 2025
Author