Skip to content

[Feature][Transform] Support single/batch mode vectorization using Amazon Titan & cohere embedding model #9120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: dev
Choose a base branch
from

Conversation

SEZ9
Copy link
Contributor

@SEZ9 SEZ9 commented Apr 7, 2025

Purpose of this pull request

Does this PR introduce any user-facing change?

Description
Add support for Amazon Titan model in the embedding model_provider configuration;
Implement batch inference support in the embedding process, and send data to the model API in batches at one time;
Support successful detection of batch sending and perform fault tolerance.
Usage Scenario
In large-scale text vectorization and storage in vector databases, users need to vectorize text data efficiently and at low cost and store it in vector databases. For example:

User's reviews analysis scenario, it is necessary to transfer millions or tens of millions of rows of data at one time for vectorization.
Image search scenario, users often have hundreds of thousands or millions of images vectorized into the database for subsequent vector approximation retrieval

How was this patch tested?

Check list

@hailin0 hailin0 requested a review from Copilot April 7, 2025 14:29
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.

Files not reviewed (1)
  • seatunnel-transforms-v2/pom.xml: Language not supported

@hailin0
Copy link
Member

hailin0 commented Apr 7, 2025

Copy link
Member

@hailin0 hailin0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SEZ9 SEZ9 changed the title Feature][Transform] Support batch mode vectorization using Amazon Titan & cohere embedding mode [Feature][Transform] Support single/batch mode vectorization using Amazon Titan & cohere embedding model Apr 7, 2025
@SEZ9
Copy link
Contributor Author

SEZ9 commented Apr 7, 2025

updated doc both en and cn

@corgy-w
Copy link
Contributor

corgy-w commented Apr 8, 2025

Whether Amazon e2e tests are missing

@corgy-w
Copy link
Contributor

corgy-w commented Apr 8, 2025

Please update EmbeddingTransformFactory config

@SEZ9
Copy link
Contributor Author

SEZ9 commented Apr 8, 2025

updated EmbeddingTransformFactory ,add Amazon model config

@github-actions github-actions bot added the e2e label Apr 8, 2025
@SEZ9
Copy link
Contributor Author

SEZ9 commented Apr 8, 2025

updated Amazon e2e tests in embedding_transform.conf

Comment on lines +51 to +58
.conditional(
EmbeddingTransformConfig.MODEL_PROVIDER,
ModelProvider.AMAZON,
EmbeddingTransformConfig.API_KEY,
EmbeddingTransformConfig.SECRET_KEY,
EmbeddingTransformConfig.AWS_REGION,
EmbeddingTransformConfig.MODEL,
EmbeddingTransformConfig.DIMENSION)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is region not here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS region is a required parameter when calling the Amazon model.

@SEZ9
Copy link
Contributor Author

SEZ9 commented Apr 10, 2025

Hi @hailin0 @corgy-w . Is there anything else I need to do before this PR can be merged?

@hailin0
Copy link
Member

hailin0 commented Apr 11, 2025

You need to fix the ci error
image

@corgy-w
Copy link
Contributor

corgy-w commented Apr 15, 2025

image @SEZ9 Hi, You need to see what is wrong with the CI. Sometimes the error is obvious, and you can correct it according to his guidance

@nielifeng nielifeng requested a review from Copilot April 17, 2025 01:27
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds support for single and batch mode vectorization using Amazon Titan and Cohere embedding models. It introduces new test cases for request JSON generation, implements the BedrockModel for Amazon Bedrock integration, and updates configuration and documentation to support the new model provider.

Reviewed Changes

Copilot reviewed 8 out of 10 changed files in this pull request and generated no comments.

Show a summary per file
File Description
seatunnel-transforms-v2/src/test/java/org/apache/seatunnel/transform/embedding/EmbeddingRequestJsonTest.java Adds tests for BedrockModel request JSON generation for both Titan and Cohere.
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/embedding/remote/amazon/BedrockModel.java Implements Amazon Bedrock embedding model support with handling for Titan and Cohere.
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/embedding/EmbeddingTransformFactory.java Extends option rules to include Amazon Bedrock options.
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/embedding/EmbeddingTransform.java Updates the embedding transform to initialize BedrockModel for the AMAZON provider.
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/ModelTransformConfig.java Adds AWS region configuration option for Amazon Bedrock.
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/ModelProvider.java Adds the AMAZON provider with corresponding metadata.
docs/zh/transform-v2/embedding.md Updates documentation to include Amazon Bedrock and region configuration.
docs/en/transform-v2/embedding.md Updates documentation to include Amazon Bedrock and region configuration.
Files not reviewed (2)
  • seatunnel-e2e/seatunnel-transforms-v2-e2e/seatunnel-transforms-v2-e2e-part-1/src/test/resources/embedding_transform.conf: Language not supported
  • seatunnel-transforms-v2/pom.xml: Language not supported

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants