Skip to content

[Improve][Transform-V2][Embedding]Enhance multimodal embeddings#9996

Open
loupipalien wants to merge 18 commits intoapache:devfrom
loupipalien:enhance-multimodal-embeddings
Open

[Improve][Transform-V2][Embedding]Enhance multimodal embeddings#9996
loupipalien wants to merge 18 commits intoapache:devfrom
loupipalien:enhance-multimodal-embeddings

Conversation

@loupipalien
Copy link
Copy Markdown
Contributor

@loupipalien loupipalien commented Oct 29, 2025

Purpose of this pull request

Multi-field multimodal vectorization, doubao-embedding-vision supports multi-field multimodal mixing as input

vectorization_fields {
      multi_field_text_vector = [product_name, description]

      multi_field_image_vector = [
        {
          field = product_image_url
          modality = jpeg
          format = url
        },
        {
          field = thumbnail_image
          modality = png
          format = url
        }
      ]

      multi_field_video_vector = [
        {
          field = product_video_url
          modality = mp4
          format = url
        },
        {
          field = promotional_video
          modality = mov
          format = url
        }
      ]

      multi_field_mix_vector = [
        product_name,
        {
          field = product_image_url
          modality = jpeg
          format = url
        },
        {
          field = product_video_url
          modality = mp4
          format = url
        }
      ]
 }

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Add new test cases

Check list

@loupipalien loupipalien changed the title Enhance multimodal embeddings [Improve][Transform-V2][Embedding]Enhance multimodal embeddings Oct 30, 2025
@loupipalien loupipalien force-pushed the enhance-multimodal-embeddings branch from 2f8b47b to bdbd012 Compare November 4, 2025 14:46
@loupipalien
Copy link
Copy Markdown
Contributor Author

@Hisoka-X @corgy-w @xiaochen-zhou help to review if have time, thanks

davidzollo
davidzollo previously approved these changes Jan 1, 2026
Copy link
Copy Markdown
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy new year!
+1

Comment on lines +208 to +226
for (Map.Entry<String, Object> fieldConfig : fieldsConfig.entrySet()) {
VectorFieldSpec vectorFieldSpec = new VectorFieldSpec(fieldConfig);
log.info("Vector field spec: {}", vectorFieldSpec);
List<String> srcFieldNames =
vectorFieldSpec.getSrcFieldSpecs().stream()
.map(SrcFieldSpec::getFieldName)
.collect(Collectors.toList());
List<Integer> srcFieldIndexes = new ArrayList<>();
for (String srcFieldName : srcFieldNames) {
try {
srcFieldIndexes.add(inputRowType.indexOf(srcFieldName));
} catch (IllegalArgumentException e) {
throw TransformCommonError.cannotFindInputFieldsError(
getPluginName(), srcFieldNames);
}
}
fieldSpecMap.put(srcFieldIndex, fieldSpec);
fieldNames.add(field.getKey());
isMultimodalFields = vectorFieldSpec.isMultimodalField();
fieldSpecMap.put(vectorFieldSpec, srcFieldIndexes);
fieldNames.add(vectorFieldSpec.getFieldName());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isMultimodalFields = vectorFieldSpec.isMultimodalField();

There is a logical issue; currently, only the last value can be obtained.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your time, let me fix it

@davidzollo
Copy link
Copy Markdown
Contributor

Thank you for this excellent proposal! The refactoring to use VectorFieldSpec and SrcFieldSpec is a very elegant and extensible design for multimodal processing.

During the review, I found a few data-corruption bugs and missing requirements that must be addressed before this can be merged.

1. Output Field Order Mismatch (EmbeddingTransform.java)

Problem:
In initOutputFields(), fieldSpecMap is initialized as a HashMap. Since HashMap does not preserve insertion order, iterating over fieldSpecMap.keySet() in getOutputFieldValues() will generate the fieldValues array in a random order. This order will NOT match the order of fieldNames declared in getOutputColumns().
As a result, vectors will be written to the wrong columns downstream (e.g., text vector data inserted into the image vector column), causing severe data corruption in production.

Suggestion:
Please change it to LinkedHashMap to maintain insertion order:

// EmbeddingTransform.java
Map<VectorFieldSpec, List<Integer>> fieldSpecMap = new LinkedHashMap<>();

2. CRITICAL: Corrupted Binary payload in toBase64 (SrcField.java)

Problem:
In the toBase64() method, simply calling fieldValue.toString().getBytes() is extremely dangerous for binary data. If fieldValue is a raw byte[] (which happens when reading actual binary image/video streams), Java's .toString() on a byte array returns its memory address (e.g., [B@1a2b3c), NOT the content. The model will receive garbage string hashes instead of the actual image data.

Suggestion:
Add proper type checking for byte arrays to fix the format conversion:

// SrcField.java
public String toBase64() {
    if (fieldSpec == null || !fieldSpec.isBinary()) {
        throw new IllegalArgumentException("Payload format must be binary");
    }
    if (fieldValue == null) {
        throw new IllegalArgumentException("Binary data cannot be null or empty");
    }
    
    if (fieldValue instanceof byte[]) {
        return Base64.getEncoder().encodeToString((byte[]) fieldValue);
    } else {
        return Base64.getEncoder().encodeToString(String.valueOf(fieldValue).getBytes());
    }
}

3. Missing Assertions in Tests

In DoubaoMultimodalModelTest.java, you verified Assertions.assertTrue(inputNode.has("image_url")). However, considering the Base64 bug mentioned above, please update the test cases (especially testMultimodalBodyWithBinaryImage) to strictly assert the actual Base64 string value to ensure the image sequence is encoded correctly.

4. Missing Documentation (Blocker)

According to SeaTunnel's contribution guidelines, any user-visible configuration changes must be documented. The new config capabilities (like multi_field_image_vector array configurations) are highly valuable, but they are lacking documentation.
Please add usage examples and explanations in both:

  • docs/en/.../embedding-v2.md or related docs
  • docs/zh/.../embedding-v2.md

Looking forward to your updates! Let me know if you have any questions.

@loupipalien loupipalien force-pushed the enhance-multimodal-embeddings branch from 5f5745e to a54db18 Compare March 10, 2026 14:05
@github-actions github-actions Bot added the CI&CD label Mar 12, 2026
@loupipalien
Copy link
Copy Markdown
Contributor Author

@davidzollo Thanks for your patient guidance and valuable suggestions ❤️. Let me fix these problems.

@loupipalien loupipalien force-pushed the enhance-multimodal-embeddings branch from b67b4d4 to 568fab0 Compare March 14, 2026 15:56
davidzollo
davidzollo previously approved these changes Mar 15, 2026
Copy link
Copy Markdown
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 if CI passes.
Good job,thanks for the efforts.
By the way, Testing with ThreadLocalRandom causes non-determinism; it is recommended to change to parameterized testing.

You can feel free to add my LinkedIn https://www.linkedin.com/in/davidzollo or WeChat(davidzollo) to build a connection ^_^

@loupipalien loupipalien force-pushed the enhance-multimodal-embeddings branch from 4611e95 to 8941cc9 Compare April 10, 2026 16:14
@github-actions github-actions Bot removed the CI&CD label Apr 10, 2026
Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled the latest branch locally and rechecked the current multimodal embedding path.

The functional blocker I raised earlier is fixed in the current revision:

EmbeddingTransform.getOutputFieldValues()
  -> new SrcField(spec, rowValue)
      -> clone SrcFieldSpec
      -> row-local modality auto-detection
  -> new MultimodalFieldValue(srcFields)
  -> model.vectorization(...)

So the row-to-row shared-state leakage is no longer the blocker.

However, I still do not think this PR is ready to merge as-is, because it still includes unrelated repository-wide changes outside the embedding runtime path:

  • root Spotless configuration change in pom.xml
  • unrelated Kubernetes engine E2E image lookup change in KubernetesIT.java

Those changes are not part of the real embedding execution chain (EmbeddingTransform -> SrcField -> MultimodalFieldValue -> DoubaoModel) and they materially widen the review and rollback surface for this feature PR.

I recommend splitting those infrastructure changes into separate PRs, or dropping them from this one and keeping only the embedding-related code, tests, and docs here.

The Build check is also still pending.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled the latest branch locally and rechecked the current multimodal embedding path.

Current runtime chain:

EmbeddingTransform.getOutputFieldValues()
  -> new SrcField(spec, rowValue)
      -> clone SrcFieldSpec
      -> row-local modality auto-detection
  -> new MultimodalFieldValue(srcFields)
  -> model.vectorization(...)

The functional blocker I raised earlier is fixed in the current revision: the row-to-row shared-state leakage is gone.

However, I still do not think this PR is ready to merge as-is, because it still includes unrelated repository-wide changes outside the embedding runtime path:

  • root Spotless configuration change in pom.xml
  • unrelated Kubernetes engine E2E change in KubernetesIT.java

Those files are not part of the real embedding execution chain, and keeping them here materially widens the review and rollback surface for this feature PR.

I recommend splitting those infrastructure changes into separate PRs, or dropping them from this one and keeping only the embedding-related code, tests, and docs here.

The Build check is the remaining CI item.

Comment thread pom.xml Outdated
</googleJavaFormat>
<removeUnusedImports />
<formatAnnotations />
<toggleOffOn />
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spotless makes the json in comments very hard to read (for example DoubaoMultimodalModelTest#testMultimodalBodyWithImage), so added this for workaround.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @loupipalien, thanks for the follow-up. I re-pulled the latest head locally and rechecked the actual PR scope against the runtime path.

The embedding-side changes themselves are meaningful, but the current diff still carries repository-level infrastructure changes that are unrelated to the multimodal embedding feature, including:

  • pom.xml
  • seatunnel-e2e/.../KubernetesIT.java

That keeps the PR outside the "one problem per PR" boundary. I do not want to overload this review with a long list, so I will keep the blocker focused:

Conclusion

Conclusion: fix required before merge

  1. Blocking items
  • Please split the unrelated repository / infra changes out of this PR and keep this branch focused on the embedding feature itself.
  1. Suggested improvements
  • After the scope is split, the embedding-specific review can move much faster.

The current Build check is green, but the mixed PR scope is still the blocker from my side.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @loupipalien, thanks for the latest update. I re-pulled the PR locally as seatunnel-review-9996 at 2d0f9da9b0 and reviewed the whole embedding path again.

Runtime path checked:

SeaTunnelRow enters EmbeddingTransform
  -> initOutputFields parses vectorization_fields
  -> getOutputFieldValues()
      -> build SrcField list per output vector
      -> MultimodalFieldValue(srcFields)
  -> DoubaoModel.multimodalVector()
      -> multimodalBody()
      -> inputRawData() expands text/image/video nodes
  -> vector ByteBuffer returned as output column

Conclusion: can merge after fixes

Blocker:

  1. The current PR Build check is red.
    • Location: GitHub Checks / Build
    • GitHub API reports the current head 2d0f9da9b0 as COMPLETED / FAILURE.
    • I did not run local compilation/tests for this review, so the authoritative integration signal is the failed GitHub check.

Recommended next step:

  • Please inspect the failed Build log and either fix the deterministic failure or rerun it if it is confirmed flaky.

Non-blocking:

  • The previous scope blocker about unrelated repo-level infrastructure changes looks resolved in the current diff. The PR now appears focused on embedding transform code, docs, and tests.
  • It would still be nice to reject an empty list in vectorization_fields early, but I do not consider that a merge blocker for this round.

Overall, the embedding-side design is much healthier now. VectorFieldSpec / SrcFieldSpec also fixes the shared-spec mutation concern by creating per-row SrcField instances. Once the Build is green, this should be close from the code-review side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants