[Improve][Transform-V2][Embedding]Enhance multimodal embeddings by loupipalien · Pull Request #9996 · apache/seatunnel

loupipalien · 2025-10-29T18:18:58Z

Purpose of this pull request

Multi-field multimodal vectorization, doubao-embedding-vision supports multi-field multimodal mixing as input

vectorization_fields {
      multi_field_text_vector = [product_name, description]

      multi_field_image_vector = [
        {
          field = product_image_url
          modality = jpeg
          format = url
        },
        {
          field = thumbnail_image
          modality = png
          format = url
        }
      ]

      multi_field_video_vector = [
        {
          field = product_video_url
          modality = mp4
          format = url
        },
        {
          field = promotional_video
          modality = mov
          format = url
        }
      ]

      multi_field_mix_vector = [
        product_name,
        {
          field = product_image_url
          modality = jpeg
          format = url
        },
        {
          field = product_video_url
          modality = mp4
          format = url
        }
      ]
 }

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Add new test cases

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

loupipalien · 2025-11-05T13:40:05Z

@Hisoka-X @corgy-w @xiaochen-zhou help to review if have time, thanks

davidzollo

Happy new year!
+1

zhangshenghang · 2026-01-04T12:26:20Z

+        for (Map.Entry<String, Object> fieldConfig : fieldsConfig.entrySet()) {
+            VectorFieldSpec vectorFieldSpec = new VectorFieldSpec(fieldConfig);
+            log.info("Vector field spec: {}", vectorFieldSpec);
+            List<String> srcFieldNames =
+                    vectorFieldSpec.getSrcFieldSpecs().stream()
+                            .map(SrcFieldSpec::getFieldName)
+                            .collect(Collectors.toList());
+            List<Integer> srcFieldIndexes = new ArrayList<>();
+            for (String srcFieldName : srcFieldNames) {
+                try {
+                    srcFieldIndexes.add(inputRowType.indexOf(srcFieldName));
+                } catch (IllegalArgumentException e) {
+                    throw TransformCommonError.cannotFindInputFieldsError(
+                            getPluginName(), srcFieldNames);
+                }
            }
-            fieldSpecMap.put(srcFieldIndex, fieldSpec);
-            fieldNames.add(field.getKey());
+            isMultimodalFields = vectorFieldSpec.isMultimodalField();
+            fieldSpecMap.put(vectorFieldSpec, srcFieldIndexes);
+            fieldNames.add(vectorFieldSpec.getFieldName());


isMultimodalFields = vectorFieldSpec.isMultimodalField();

There is a logical issue; currently, only the last value can be obtained.

@loupipalien

Thanks for your time, let me fix it

davidzollo · 2026-03-02T13:47:10Z

Thank you for this excellent proposal! The refactoring to use VectorFieldSpec and SrcFieldSpec is a very elegant and extensible design for multimodal processing.

During the review, I found a few data-corruption bugs and missing requirements that must be addressed before this can be merged.

1. Output Field Order Mismatch (`EmbeddingTransform.java`)

Problem:
In initOutputFields(), fieldSpecMap is initialized as a HashMap. Since HashMap does not preserve insertion order, iterating over fieldSpecMap.keySet() in getOutputFieldValues() will generate the fieldValues array in a random order. This order will NOT match the order of fieldNames declared in getOutputColumns().
As a result, vectors will be written to the wrong columns downstream (e.g., text vector data inserted into the image vector column), causing severe data corruption in production.

Suggestion:
Please change it to LinkedHashMap to maintain insertion order:

// EmbeddingTransform.java
Map<VectorFieldSpec, List<Integer>> fieldSpecMap = new LinkedHashMap<>();

2. CRITICAL: Corrupted Binary payload in `toBase64` (`SrcField.java`)

Problem:
In the toBase64() method, simply calling fieldValue.toString().getBytes() is extremely dangerous for binary data. If fieldValue is a raw byte[] (which happens when reading actual binary image/video streams), Java's .toString() on a byte array returns its memory address (e.g., [B@1a2b3c), NOT the content. The model will receive garbage string hashes instead of the actual image data.

Suggestion:
Add proper type checking for byte arrays to fix the format conversion:

// SrcField.java
public String toBase64() {
    if (fieldSpec == null || !fieldSpec.isBinary()) {
        throw new IllegalArgumentException("Payload format must be binary");
    }
    if (fieldValue == null) {
        throw new IllegalArgumentException("Binary data cannot be null or empty");
    }
    
    if (fieldValue instanceof byte[]) {
        return Base64.getEncoder().encodeToString((byte[]) fieldValue);
    } else {
        return Base64.getEncoder().encodeToString(String.valueOf(fieldValue).getBytes());
    }
}

3. Missing Assertions in Tests

In DoubaoMultimodalModelTest.java, you verified Assertions.assertTrue(inputNode.has("image_url")). However, considering the Base64 bug mentioned above, please update the test cases (especially testMultimodalBodyWithBinaryImage) to strictly assert the actual Base64 string value to ensure the image sequence is encoded correctly.

4. Missing Documentation (Blocker)

According to SeaTunnel's contribution guidelines, any user-visible configuration changes must be documented. The new config capabilities (like multi_field_image_vector array configurations) are highly valuable, but they are lacking documentation.
Please add usage examples and explanations in both:

docs/en/.../embedding-v2.md or related docs
docs/zh/.../embedding-v2.md

Looking forward to your updates! Let me know if you have any questions.

loupipalien · 2026-03-14T15:51:08Z

@davidzollo Thanks for your patient guidance and valuable suggestions ❤️. Let me fix these problems.

davidzollo

+1 if CI passes.
Good job，thanks for the efforts.
By the way, Testing with ThreadLocalRandom causes non-determinism; it is recommended to change to parameterized testing.

You can feel free to add my LinkedIn https://www.linkedin.com/in/davidzollo or WeChat(davidzollo) to build a connection ^_^

DanielLeens

I pulled the latest branch locally and rechecked the current multimodal embedding path.

The functional blocker I raised earlier is fixed in the current revision:

EmbeddingTransform.getOutputFieldValues()
  -> new SrcField(spec, rowValue)
      -> clone SrcFieldSpec
      -> row-local modality auto-detection
  -> new MultimodalFieldValue(srcFields)
  -> model.vectorization(...)

So the row-to-row shared-state leakage is no longer the blocker.

However, I still do not think this PR is ready to merge as-is, because it still includes unrelated repository-wide changes outside the embedding runtime path:

root Spotless configuration change in pom.xml
unrelated Kubernetes engine E2E image lookup change in KubernetesIT.java

Those changes are not part of the real embedding execution chain (EmbeddingTransform -> SrcField -> MultimodalFieldValue -> DoubaoModel) and they materially widen the review and rollback surface for this feature PR.

I recommend splitting those infrastructure changes into separate PRs, or dropping them from this one and keeping only the embedding-related code, tests, and docs here.

The Build check is also still pending.

DanielLeens

I pulled the latest branch locally and rechecked the current multimodal embedding path.

Current runtime chain:

EmbeddingTransform.getOutputFieldValues()
  -> new SrcField(spec, rowValue)
      -> clone SrcFieldSpec
      -> row-local modality auto-detection
  -> new MultimodalFieldValue(srcFields)
  -> model.vectorization(...)

The functional blocker I raised earlier is fixed in the current revision: the row-to-row shared-state leakage is gone.

However, I still do not think this PR is ready to merge as-is, because it still includes unrelated repository-wide changes outside the embedding runtime path:

root Spotless configuration change in pom.xml
unrelated Kubernetes engine E2E change in KubernetesIT.java

Those files are not part of the real embedding execution chain, and keeping them here materially widens the review and rollback surface for this feature PR.

I recommend splitting those infrastructure changes into separate PRs, or dropping them from this one and keeping only the embedding-related code, tests, and docs here.

The Build check is the remaining CI item.

corgy-w · 2026-04-14T07:09:11Z

                        </googleJavaFormat>
                        <removeUnusedImports />
                        <formatAnnotations />
+                        <toggleOffOn />


Is this necessary?

spotless makes the json in comments very hard to read (for example DoubaoMultimodalModelTest#testMultimodalBodyWithImage), so added this for workaround.

DanielLeens

Hi @loupipalien, thanks for the follow-up. I re-pulled the latest head locally and rechecked the actual PR scope against the runtime path.

The embedding-side changes themselves are meaningful, but the current diff still carries repository-level infrastructure changes that are unrelated to the multimodal embedding feature, including:

pom.xml
seatunnel-e2e/.../KubernetesIT.java

That keeps the PR outside the "one problem per PR" boundary. I do not want to overload this review with a long list, so I will keep the blocker focused:

Conclusion

Conclusion: fix required before merge

Blocking items

Please split the unrelated repository / infra changes out of this PR and keep this branch focused on the embedding feature itself.

Suggested improvements

After the scope is split, the embedding-specific review can move much faster.

The current Build check is green, but the mixed PR scope is still the blocker from my side.

DanielLeens

Hi @loupipalien, thanks for the latest update. I re-pulled the PR locally as seatunnel-review-9996 at 2d0f9da9b0 and reviewed the whole embedding path again.

Runtime path checked:

SeaTunnelRow enters EmbeddingTransform
  -> initOutputFields parses vectorization_fields
  -> getOutputFieldValues()
      -> build SrcField list per output vector
      -> MultimodalFieldValue(srcFields)
  -> DoubaoModel.multimodalVector()
      -> multimodalBody()
      -> inputRawData() expands text/image/video nodes
  -> vector ByteBuffer returned as output column

Conclusion: can merge after fixes

Blocker:

The current PR Build check is red.
- Location: GitHub Checks / Build
- GitHub API reports the current head 2d0f9da9b0 as COMPLETED / FAILURE.
- I did not run local compilation/tests for this review, so the authoritative integration signal is the failed GitHub check.

Recommended next step:

Please inspect the failed Build log and either fix the deterministic failure or rerun it if it is confirmed flaky.

Non-blocking:

The previous scope blocker about unrelated repo-level infrastructure changes looks resolved in the current diff. The PR now appears focused on embedding transform code, docs, and tests.
It would still be nice to reject an empty list in vectorization_fields early, but I do not consider that a merge blocker for this round.

Overall, the embedding-side design is much healthier now. VectorFieldSpec / SrcFieldSpec also fixes the shared-spec mutation concern by creating per-row SrcField instances. Once the Build is green, this should be close from the code-review side.

github-actions Bot added Transform-v2 e2e labels Oct 29, 2025

loupipalien changed the title ~~Enhance multimodal embeddings~~ [Improve][Transform-V2][Embedding]Enhance multimodal embeddings Oct 30, 2025

loupipalien force-pushed the enhance-multimodal-embeddings branch from 2f8b47b to bdbd012 Compare November 4, 2025 14:46

davidzollo previously approved these changes Jan 1, 2026

View reviewed changes

github-actions Bot added approved reviewed labels Jan 1, 2026

zhangshenghang reviewed Jan 4, 2026

View reviewed changes

loupipalien dismissed davidzollo’s stale review via d497cba February 13, 2026 16:56

loupipalien force-pushed the enhance-multimodal-embeddings branch from bdbd012 to d497cba Compare February 13, 2026 16:56

github-actions Bot removed approved reviewed labels Feb 13, 2026

davidzollo requested a review from zhangshenghang February 15, 2026 14:00

loupipalien force-pushed the enhance-multimodal-embeddings branch from 0c8a06b to 5f5745e Compare March 1, 2026 08:52

loupipalien force-pushed the enhance-multimodal-embeddings branch from 5f5745e to a54db18 Compare March 10, 2026 14:05

github-actions Bot added the CI&CD label Mar 12, 2026

loupipalien requested review from corgy-w and davidzollo March 14, 2026 04:13

loupipalien force-pushed the enhance-multimodal-embeddings branch from b67b4d4 to 568fab0 Compare March 14, 2026 15:56

github-actions Bot added the document label Mar 14, 2026

davidzollo previously approved these changes Mar 15, 2026

View reviewed changes

github-actions Bot added approved reviewed labels Mar 15, 2026

loupipalien force-pushed the enhance-multimodal-embeddings branch from 568fab0 to 59cecda Compare March 21, 2026 15:50

loupipalien dismissed davidzollo’s stale review via d7bcb01 March 23, 2026 19:15

loupipalien force-pushed the enhance-multimodal-embeddings branch from 59cecda to d7bcb01 Compare March 23, 2026 19:15

github-actions Bot removed the approved label Mar 23, 2026

loupipalien added 16 commits April 11, 2026 00:11

enhance multimodal embeddings

a5d4773

fix: fix VectorFieldSpecTest test case

39891eb

chore: add spotless toggleOffOn

6ffbfa0

chore: code format

96df9e6

chore:remove deduplicated code

ce90e41

chore: modify comment

af4e659

fix: fix set isMultimodalFileds

a1c423b

test: upgrade testcontainers for docker test

a80f4a3

chore: add log

a63baef

fix: use reference filter instead of image name filter

439778a

chore: increase connector-file-local-it job timeout

050c8f9

fix: fix some bugs and add docs

84a09e5

chore: increase all-connectors-it-1 job timeout

3464cbb

chore: update docs and remove unnecessary comment

f1f54d9

modify: create a new object avoid to share SrcFieldSpec

408ac92

chore: revert testcontainer version and ci timeout minutes

8941cc9

loupipalien force-pushed the enhance-multimodal-embeddings branch from 4611e95 to 8941cc9 Compare April 10, 2026 16:14

github-actions Bot removed the CI&CD label Apr 10, 2026

DanielLeens suggested changes Apr 11, 2026

View reviewed changes

chore: revert to image name filter

8e139c5

DanielLeens suggested changes Apr 12, 2026

View reviewed changes

loupipalien requested review from DanielLeens, corgy-w and davidzollo April 12, 2026 03:08

corgy-w reviewed Apr 14, 2026

View reviewed changes

DanielLeens suggested changes Apr 15, 2026

View reviewed changes

chore: remove spotless toggleOffOn

2d0f9da

DanielLeens reviewed Apr 17, 2026

View reviewed changes

loupipalien requested review from DanielLeens and corgy-w April 20, 2026 16:59

Conversation

loupipalien commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

loupipalien commented Nov 5, 2025

Uh oh!

davidzollo left a comment

Choose a reason for hiding this comment

Uh oh!

zhangshenghang Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

corgy-w Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

loupipalien Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

davidzollo commented Mar 2, 2026

1. Output Field Order Mismatch (EmbeddingTransform.java)

2. CRITICAL: Corrupted Binary payload in toBase64 (SrcField.java)

3. Missing Assertions in Tests

4. Missing Documentation (Blocker)

Uh oh!

loupipalien commented Mar 14, 2026

Uh oh!

davidzollo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

corgy-w Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

loupipalien Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Conclusion: fix required before merge

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Conclusion: can merge after fixes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

loupipalien commented Oct 29, 2025 •

edited

Loading

1. Output Field Order Mismatch (`EmbeddingTransform.java`)

2. CRITICAL: Corrupted Binary payload in `toBase64` (`SrcField.java`)

davidzollo left a comment •

edited

Loading