[Improve][Transform-V2][Embedding]Enhance multimodal embeddings#9996
[Improve][Transform-V2][Embedding]Enhance multimodal embeddings#9996loupipalien wants to merge 18 commits intoapache:devfrom
Conversation
2f8b47b to
bdbd012
Compare
|
@Hisoka-X @corgy-w @xiaochen-zhou help to review if have time, thanks |
| for (Map.Entry<String, Object> fieldConfig : fieldsConfig.entrySet()) { | ||
| VectorFieldSpec vectorFieldSpec = new VectorFieldSpec(fieldConfig); | ||
| log.info("Vector field spec: {}", vectorFieldSpec); | ||
| List<String> srcFieldNames = | ||
| vectorFieldSpec.getSrcFieldSpecs().stream() | ||
| .map(SrcFieldSpec::getFieldName) | ||
| .collect(Collectors.toList()); | ||
| List<Integer> srcFieldIndexes = new ArrayList<>(); | ||
| for (String srcFieldName : srcFieldNames) { | ||
| try { | ||
| srcFieldIndexes.add(inputRowType.indexOf(srcFieldName)); | ||
| } catch (IllegalArgumentException e) { | ||
| throw TransformCommonError.cannotFindInputFieldsError( | ||
| getPluginName(), srcFieldNames); | ||
| } | ||
| } | ||
| fieldSpecMap.put(srcFieldIndex, fieldSpec); | ||
| fieldNames.add(field.getKey()); | ||
| isMultimodalFields = vectorFieldSpec.isMultimodalField(); | ||
| fieldSpecMap.put(vectorFieldSpec, srcFieldIndexes); | ||
| fieldNames.add(vectorFieldSpec.getFieldName()); |
There was a problem hiding this comment.
isMultimodalFields = vectorFieldSpec.isMultimodalField();
There is a logical issue; currently, only the last value can be obtained.
There was a problem hiding this comment.
Thanks for your time, let me fix it
bdbd012 to
d497cba
Compare
0c8a06b to
5f5745e
Compare
|
Thank you for this excellent proposal! The refactoring to use During the review, I found a few data-corruption bugs and missing requirements that must be addressed before this can be merged. 1. Output Field Order Mismatch (
|
5f5745e to
a54db18
Compare
|
@davidzollo Thanks for your patient guidance and valuable suggestions ❤️. Let me fix these problems. |
b67b4d4 to
568fab0
Compare
There was a problem hiding this comment.
+1 if CI passes.
Good job,thanks for the efforts.
By the way, Testing with ThreadLocalRandom causes non-determinism; it is recommended to change to parameterized testing.
You can feel free to add my LinkedIn https://www.linkedin.com/in/davidzollo or WeChat(davidzollo) to build a connection ^_^
568fab0 to
59cecda
Compare
59cecda to
d7bcb01
Compare
4611e95 to
8941cc9
Compare
DanielLeens
left a comment
There was a problem hiding this comment.
I pulled the latest branch locally and rechecked the current multimodal embedding path.
The functional blocker I raised earlier is fixed in the current revision:
EmbeddingTransform.getOutputFieldValues()
-> new SrcField(spec, rowValue)
-> clone SrcFieldSpec
-> row-local modality auto-detection
-> new MultimodalFieldValue(srcFields)
-> model.vectorization(...)
So the row-to-row shared-state leakage is no longer the blocker.
However, I still do not think this PR is ready to merge as-is, because it still includes unrelated repository-wide changes outside the embedding runtime path:
- root Spotless configuration change in
pom.xml - unrelated Kubernetes engine E2E image lookup change in
KubernetesIT.java
Those changes are not part of the real embedding execution chain (EmbeddingTransform -> SrcField -> MultimodalFieldValue -> DoubaoModel) and they materially widen the review and rollback surface for this feature PR.
I recommend splitting those infrastructure changes into separate PRs, or dropping them from this one and keeping only the embedding-related code, tests, and docs here.
The Build check is also still pending.
DanielLeens
left a comment
There was a problem hiding this comment.
I pulled the latest branch locally and rechecked the current multimodal embedding path.
Current runtime chain:
EmbeddingTransform.getOutputFieldValues()
-> new SrcField(spec, rowValue)
-> clone SrcFieldSpec
-> row-local modality auto-detection
-> new MultimodalFieldValue(srcFields)
-> model.vectorization(...)
The functional blocker I raised earlier is fixed in the current revision: the row-to-row shared-state leakage is gone.
However, I still do not think this PR is ready to merge as-is, because it still includes unrelated repository-wide changes outside the embedding runtime path:
- root Spotless configuration change in
pom.xml - unrelated Kubernetes engine E2E change in
KubernetesIT.java
Those files are not part of the real embedding execution chain, and keeping them here materially widens the review and rollback surface for this feature PR.
I recommend splitting those infrastructure changes into separate PRs, or dropping them from this one and keeping only the embedding-related code, tests, and docs here.
The Build check is the remaining CI item.
| </googleJavaFormat> | ||
| <removeUnusedImports /> | ||
| <formatAnnotations /> | ||
| <toggleOffOn /> |
There was a problem hiding this comment.
spotless makes the json in comments very hard to read (for example DoubaoMultimodalModelTest#testMultimodalBodyWithImage), so added this for workaround.
DanielLeens
left a comment
There was a problem hiding this comment.
Hi @loupipalien, thanks for the follow-up. I re-pulled the latest head locally and rechecked the actual PR scope against the runtime path.
The embedding-side changes themselves are meaningful, but the current diff still carries repository-level infrastructure changes that are unrelated to the multimodal embedding feature, including:
pom.xmlseatunnel-e2e/.../KubernetesIT.java
That keeps the PR outside the "one problem per PR" boundary. I do not want to overload this review with a long list, so I will keep the blocker focused:
Conclusion
Conclusion: fix required before merge
- Blocking items
- Please split the unrelated repository / infra changes out of this PR and keep this branch focused on the embedding feature itself.
- Suggested improvements
- After the scope is split, the embedding-specific review can move much faster.
The current Build check is green, but the mixed PR scope is still the blocker from my side.
DanielLeens
left a comment
There was a problem hiding this comment.
Hi @loupipalien, thanks for the latest update. I re-pulled the PR locally as seatunnel-review-9996 at 2d0f9da9b0 and reviewed the whole embedding path again.
Runtime path checked:
SeaTunnelRow enters EmbeddingTransform
-> initOutputFields parses vectorization_fields
-> getOutputFieldValues()
-> build SrcField list per output vector
-> MultimodalFieldValue(srcFields)
-> DoubaoModel.multimodalVector()
-> multimodalBody()
-> inputRawData() expands text/image/video nodes
-> vector ByteBuffer returned as output column
Conclusion: can merge after fixes
Blocker:
- The current PR
Buildcheck is red.- Location: GitHub Checks /
Build - GitHub API reports the current head
2d0f9da9b0asCOMPLETED / FAILURE. - I did not run local compilation/tests for this review, so the authoritative integration signal is the failed GitHub check.
- Location: GitHub Checks /
Recommended next step:
- Please inspect the failed Build log and either fix the deterministic failure or rerun it if it is confirmed flaky.
Non-blocking:
- The previous scope blocker about unrelated repo-level infrastructure changes looks resolved in the current diff. The PR now appears focused on embedding transform code, docs, and tests.
- It would still be nice to reject an empty list in
vectorization_fieldsearly, but I do not consider that a merge blocker for this round.
Overall, the embedding-side design is much healthier now. VectorFieldSpec / SrcFieldSpec also fixes the shared-spec mutation concern by creating per-row SrcField instances. Once the Build is green, this should be close from the code-review side.
Purpose of this pull request
Multi-field multimodal vectorization, doubao-embedding-vision supports multi-field multimodal mixing as input
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
Add new test cases
Check list
New License Guide