Skip to content

Add initial support for images in the ai chat #15410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

eneufeld
Copy link
Contributor

@eneufeld eneufeld commented Apr 6, 2025

What it does

Initial implementation of #15407

How to test

Copy an image into the clipboard. Paste it using CTRL+V.

Follow-ups

  • Untested and probably missing support for Drag&Drop (2.2)
  • Missing support for Upload via Button (2.3)
  • Missing support to select from the workspace (2.4)
  • Missing feedback if selected LLM does not support image data (6)

As this is the first introduction of non text data to the request, it must be decided where and how this kind of data should be stored. All adaptations on the communication from input widget to the llm must be reviewed and discussed also in regard to more kind of input data (e.g. audio).

Breaking changes

  • This PR introduces breaking changes and requires careful review. If yes, the breaking changes section in the changelog has been updated.

Attribution

Review checklist

Reminder for reviewers

@github-project-automation github-project-automation bot moved this to Waiting on reviewers in PR Backlog Apr 6, 2025
Copy link
Contributor

@planger planger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thank you!

I'm wondering whether we could integrate the handling of images with the existing context concept and extend it with images, rather than doing it alongside it? Conceptually it would feel like a good fit, alongside other context, like files.

@eneufeld
Copy link
Contributor Author

eneufeld commented Apr 7, 2025

yes I thought about that. Basically we need to decide how we want to handle non message data, eg images and audio. as part of the request or just as "additional information" and thus part of the context?

@JonasHelming
Copy link
Contributor

This is awesome! Thank you!

I'm wondering whether we could integrate the handling of images with the existing context concept and extend it with images, rather than doing it alongside it? Conceptually it would feel like a good fit, alongside other context, like files.

Maybe even both should be possible?

Copy link
Contributor

@planger planger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this PR a bit more and I think it is a great feature that I really would like to get in!

Besides cleaning up the change (some unnecessary comments, out-commented code and formatting changes), I'm mostly thinking about how to best represent images in the chat session and in the UI. Currently it is added orthogonal to all available infrastructure we have.

I'm wondering whether we couldn't just reuse the file context variable or create an image context variable adjacent to it. This would give us the UI more or less for free:

  • The drag'n'drop support to add images
  • The + button support to add images
  • Show images just like files in the chat input if they have been attached with the existing context variable UI

A minor disadvantage would be that we wouldn't show a preview of the image in the chat input. But I'm not sure this is a big deal.

We could still handle the images in the generic chat agent code (like now in this PR) but just filter out all image context variables from the context instead of reading the new request.images. I like this as it is customizable by concrete chat agents. It might be worth though to give the getMessages method access to the language model that is being used and add a property to it whether it even supports images. If it doesn't, we could avoid creating the image messages and show a warning to the user?

What do you think?

Great work with this PR, in any case!!! Looking forward to having this available in Theia AI!

@eneufeld
Copy link
Contributor Author

So my thoughts about the context and why I didn't add it directly to it:

  • In my understanding the context is used to provide information to resolve variables/placeholders of the text.
  • An image and in the future audio input is in my opinion an integral part of the user request. It can even be the only thing you send to the llm, eg if the image contains a screenshot of your console. Doing so by using the context sounds weird to me.
  • On the other hand it is nice to reuse all the other features 🤷

Yes I agree language models need a possibility to provide a feature list so that we can check what is supported.

@planger
Copy link
Contributor

planger commented Apr 15, 2025

@eneufeld Thanks for the feedback! That makes sense. The way we use context so far is indeed a bit different in semantics, especially when thinking about audio. For images, though, I feel the advantages of reusing the existing infrastructure outweigh the semantic mismatch.

For audio, I agree—it’s semantically equivalent to the user input, like a spoken prompt. But for images and files, they usually augment the input. Sure, it may also make sense to send an image without a user prompt (spoken or written). Actually, this may actually also be true for any file, e.g. when I copy a log file and want to send it to the LLM without writing a prompt. I don't see it as part of this PR, but we may want to consider enabling the send button if there is at least one context element (image or file).

All in all, going the context-element route seems reasonable to me—especially as it simplifies integrating the UI behaviors (e.g. Drag&Drop, + button, consistent view in the chat input). But I’m totally open to your thoughts here and happy to go with whatever direction you feel is more robust long term.

Thank you!

@eneufeld
Copy link
Contributor Author

I totally see the advantages of the context. We should just define a consistent logic for us what goes into the context (and thus what it is) and what stays outside of it.

@sdirix any ideas here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Waiting on reviewers
Development

Successfully merging this pull request may close these issues.

None yet

3 participants