-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[Feature][Transform] Support single/batch mode vectorization using Amazon Titan & cohere embedding model #9120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
init bedrock model files
init parameters and configuration
test complete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.
Files not reviewed (1)
- seatunnel-transforms-v2/pom.xml: Language not supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated doc both |
Whether Amazon e2e tests are missing |
Please update |
updated EmbeddingTransformFactory ,add Amazon model config |
updated Amazon e2e tests in |
.conditional( | ||
EmbeddingTransformConfig.MODEL_PROVIDER, | ||
ModelProvider.AMAZON, | ||
EmbeddingTransformConfig.API_KEY, | ||
EmbeddingTransformConfig.SECRET_KEY, | ||
EmbeddingTransformConfig.AWS_REGION, | ||
EmbeddingTransformConfig.MODEL, | ||
EmbeddingTransformConfig.DIMENSION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is region
not here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AWS region is a required parameter when calling the Amazon model.
![]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds support for single and batch mode vectorization using Amazon Titan and Cohere embedding models. It introduces new test cases for request JSON generation, implements the BedrockModel for Amazon Bedrock integration, and updates configuration and documentation to support the new model provider.
Reviewed Changes
Copilot reviewed 8 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
seatunnel-transforms-v2/src/test/java/org/apache/seatunnel/transform/embedding/EmbeddingRequestJsonTest.java | Adds tests for BedrockModel request JSON generation for both Titan and Cohere. |
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/embedding/remote/amazon/BedrockModel.java | Implements Amazon Bedrock embedding model support with handling for Titan and Cohere. |
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/embedding/EmbeddingTransformFactory.java | Extends option rules to include Amazon Bedrock options. |
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/embedding/EmbeddingTransform.java | Updates the embedding transform to initialize BedrockModel for the AMAZON provider. |
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/ModelTransformConfig.java | Adds AWS region configuration option for Amazon Bedrock. |
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/ModelProvider.java | Adds the AMAZON provider with corresponding metadata. |
docs/zh/transform-v2/embedding.md | Updates documentation to include Amazon Bedrock and region configuration. |
docs/en/transform-v2/embedding.md | Updates documentation to include Amazon Bedrock and region configuration. |
Files not reviewed (2)
- seatunnel-e2e/seatunnel-transforms-v2-e2e/seatunnel-transforms-v2-e2e-part-1/src/test/resources/embedding_transform.conf: Language not supported
- seatunnel-transforms-v2/pom.xml: Language not supported
Purpose of this pull request
Does this PR introduce any user-facing change?
Description
Add support for Amazon Titan model in the embedding model_provider configuration;
Implement batch inference support in the embedding process, and send data to the model API in batches at one time;
Support successful detection of batch sending and perform fault tolerance.
Usage Scenario
In large-scale text vectorization and storage in vector databases, users need to vectorize text data efficiently and at low cost and store it in vector databases. For example:
User's reviews analysis scenario, it is necessary to transfer millions or tens of millions of rows of data at one time for vectorization.
Image search scenario, users often have hundreds of thousands or millions of images vectorized into the database for subsequent vector approximation retrieval
How was this patch tested?
Check list
New License Guide
release-note
.