Add image embedding support to ml/rag by claudevdm · Pull Request #37628 · apache/beam

claudevdm · 2026-02-17T21:29:27Z

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

gemini-code-assist · 2026-02-17T21:29:54Z

Summary of Changes

Hello @claudevdm, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Apache Beam's RAG capabilities by integrating robust support for image embeddings. It introduces the necessary infrastructure to process and embed image content using popular ML frameworks like HuggingFace and Vertex AI, alongside existing text embedding functionalities. The changes streamline the handling of multimodal data within the RAG pipeline, from data representation to ingestion into various vector databases, thereby broadening the scope of applications that can leverage Beam for advanced retrieval augmented generation.

Highlights

Image Embedding Support: Introduced new classes and functionalities to support image embeddings for both HuggingFace and Vertex AI models within the RAG pipeline. This allows users to generate embeddings from image content, expanding the multimodal capabilities of Apache Beam's ML RAG components.
Enhanced EmbeddableItem: The EmbeddableItem data structure has been updated to include an optional image field in its Content attribute. A new from_image factory method simplifies the creation of image-based EmbeddableItem instances, and a content_string property was added to provide a unified string representation for ingestion, prioritizing text over image URI.
Ingestion Pipeline Updates: BigQuery, MySQL, PostgreSQL, and Spanner ingestion modules were modified to leverage the new content_string property of EmbeddableItem. This ensures that these pipelines can correctly process and store both text and image URI content, adapting to the expanded EmbeddableItem definition.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

sdks/python/apache_beam/ml/rag/embeddings/base_test.py
- Imported EmbeddableItem for testing purposes.
- Added ImageEmbeddableItemTest to validate the EmbeddableItem.from_image factory method.
- Added ContentStringTest to verify the functionality of the new content_string property.
sdks/python/apache_beam/ml/rag/embeddings/huggingface.py
- Added imports for io, Sequence, _add_embedding_fn, EmbeddingTypeAdapter, _ImageEmbeddingHandler, and PIL.Image.
- Updated the docstring for HuggingfaceTextEmbeddings to provide more detailed argument descriptions.
- Implemented _extract_images to convert EmbeddableItem image content into PIL Image objects.
- Created _create_hf_image_adapter to generate an EmbeddingTypeAdapter specifically for HuggingFace image embeddings.
- Introduced HuggingfaceImageEmbeddings class to manage image embedding using HuggingFace models.
sdks/python/apache_beam/ml/rag/embeddings/huggingface_test.py
- Added imports for os, HuggingfaceImageEmbeddings, _create_hf_image_adapter, EmbeddableItem, and PIL.Image.
- Included a PIL_AVAILABLE check to conditionally run tests based on Pillow library availability.
- Added HuggingfaceImageAdapterTest to test the functionality of the HuggingFace image adapter.
- Implemented HuggingfaceImageEmbeddingsTest to verify the end-to-end image embedding pipeline for HuggingFace models, including tests for both image bytes and file paths.
- Added a helper function _create_png_bytes to generate sample PNG image data for tests.
sdks/python/apache_beam/ml/rag/embeddings/vertex_ai.py
- Added imports for Sequence, _add_embedding_fn, EmbeddingTypeAdapter, _ImageEmbeddingHandler, Chunk, _VertexAIImageEmbeddingHandler, and vertexai.vision_models.Image.
- Modified the get_ptransform_for_processing method in VertexAITextEmbeddings to return beam.PCollection[Chunk].
- Implemented _extract_images to convert EmbeddableItem image content into Vertex AI Image objects.
- Created _create_image_adapter to generate an EmbeddingTypeAdapter for Vertex AI image embeddings.
- Introduced VertexAIImageEmbeddings class to manage image embedding using Vertex AI models.
sdks/python/apache_beam/ml/rag/embeddings/vertex_ai_test.py
- Added imports for os, EmbeddableItem, VertexAIImageEmbeddings, and _create_image_adapter.
- Implemented VertexAIImageAdapterTest to test the functionality of the Vertex AI image adapter.
- Added VertexAIImageEmbeddingsTest to verify the end-to-end image embedding pipeline for Vertex AI models, including tests for both image bytes and file paths.
- Included a static helper method _create_png_bytes to generate sample PNG image data for tests.
sdks/python/apache_beam/ml/rag/ingestion/bigquery.py
- Updated example embeddable_to_dict_fn to use item instead of chunk for consistency.
- Modified _default_embeddable_to_dict_fn to utilize item.content_string and adjusted the ValueError message.
- Updated the lambda variable name in the expand method for improved readability.
sdks/python/apache_beam/ml/rag/ingestion/mysql_common.py
- Modified the value_fn within with_content_spec to use chunk.content_string and removed the explicit ValueError check for content.text.
sdks/python/apache_beam/ml/rag/ingestion/postgres_common.py
- Modified the value_fn within with_content_spec to use chunk.content_string and removed the explicit ValueError check for content.text.
sdks/python/apache_beam/ml/rag/ingestion/spanner.py
- Modified the extract_fn within with_content_spec to use embeddable.content_string and updated its return type to Optional[str], removing the ValueError check.
sdks/python/apache_beam/ml/rag/types.py
- Imported Union for type hinting flexibility.
- Added an image: Optional[Union[bytes, str]] = None field to the Content dataclass to support image content.
- Introduced a from_image class method to EmbeddableItem for convenient creation of items with image content.
- Added a content_string property to EmbeddableItem that returns a storable string representation, prioritizing text content or image URI.

Activity

The pull request was opened by claudevdm with the title 'image embeddings'.
The initial description is a placeholder, indicating that a more meaningful description is needed from the author.
No further activity, comments, or reviews have been recorded since the pull request's creation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

claudevdm · 2026-02-17T21:35:09Z

/gemini review

gemini-code-assist

Code Review

The pull request successfully introduces support for image embeddings in the RAG pipeline, covering both HuggingFace and Vertex AI implementations. It correctly updates the core types and ingestion logic to handle multimodal content. The main areas for improvement are ensuring robust handling of GCS URIs when loading images and minor cleanup of error messages and docstrings.

sdks/python/apache_beam/ml/rag/embeddings/huggingface.py

sdks/python/apache_beam/ml/rag/embeddings/vertex_ai.py

sdks/python/apache_beam/ml/rag/types.py

claudevdm · 2026-02-18T14:08:18Z

R: @damccorm

github-actions · 2026-02-18T14:09:29Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

damccorm

This generally LGTM, just had one question

sdks/python/apache_beam/ml/rag/embeddings/huggingface.py

damccorm

LGTM once checks complete

Abacn · 2026-02-20T17:01:46Z

It's likely breaking XVR tests #30601 #30602 #31418 given Beam 2.72.0 branch is good and this is the only commit at first breakage https://github.com/apache/beam/actions/workflows/beam_PostCommit_XVR_Flink.yml?query=

claudevdm · 2026-02-20T17:35:33Z

It's likely breaking XVR tests #30601 #30602 #31418 given Beam 2.72.0 branch is good and this is the only commit at first breakage https://github.com/apache/beam/actions/workflows/beam_PostCommit_XVR_Flink.yml?query=

Responded on #30602

tvalentyn · 2026-02-20T18:31:32Z

sdks/python/setup.py

          'numpy>=1.14.3,<2.5.0',  # Update pyproject.toml as well.
          'objsize>=0.6.1,<0.8.0',
          'packaging>=22.0',
+          'pillow',


I would suggest using an upper bound unless we don't expect breaking changes that can affect us in the future, see:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=290982440#DependencymanagementguidelinesforBeamPythonSDKmaintainers-Howtoaddanewdependency?

image embeddings.

bc0af75

github-actions bot added the python label Feb 17, 2026

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

claudevdm added 2 commits February 18, 2026 07:40

comments.

76fd1aa

lint.

19484bf

claudevdm marked this pull request as ready for review February 18, 2026 14:08

claudevdm changed the title ~~image embeddings.~~ Add image embedding support to ml/rag Feb 18, 2026

damccorm reviewed Feb 18, 2026

View reviewed changes

sdks/python/apache_beam/ml/rag/embeddings/huggingface.py Show resolved Hide resolved

claudevdm added 2 commits February 18, 2026 11:44

Add pillow to default requires.

20074c2

update images

b1b0100

github-actions bot added the docker label Feb 18, 2026

damccorm approved these changes Feb 18, 2026

View reviewed changes

Claude added 2 commits February 18, 2026 18:53

lint

88decb2

mypy

a2312af

claudevdm merged commit cecc2a6 into apache:master Feb 18, 2026
108 of 109 checks passed

Amar3tto mentioned this pull request Feb 19, 2026

The PostCommit YAML Xlang Direct job is flaky #35198

Open

claudevdm mentioned this pull request Feb 20, 2026

The PostCommit XVR Spark3 job is flaky #30602

Open

tvalentyn reviewed Feb 20, 2026

View reviewed changes

Comments

Conversation

claudevdm commented Feb 17, 2026

GitHub Actions Tests Status (on master branch)

Uh oh!

gemini-code-assist bot commented Feb 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

claudevdm commented Feb 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claudevdm commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

damccorm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

damccorm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Abacn commented Feb 20, 2026

Uh oh!

claudevdm commented Feb 20, 2026

Uh oh!

tvalentyn Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants