Skip to content

[Feature][Connector-JDBC] Fix Oracle BLOB data format preservation issue #9270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from

Conversation

yzeng1618
Copy link

Purpose of this pull request

#9268

This PR fixes an issue with the JDBC connector where Oracle BLOB data is not properly preserved during synchronization. Currently, when transferring BLOB fields (containing text, XML, HTML, etc.) from Oracle to target systems like Doris, the data is converted to Base64-encoded strings, making it unusable in its original format.

The implementation enhances the OracleTypeConverter to properly handle BLOB data based on the handle_blob_as_string configuration parameter:

  1. When handle_blob_as_string=true, BLOB data is treated as STRING type, preserving the original content format
  2. When handle_blob_as_string=false (default), BLOB data is treated as BYTES type as before

Does this PR introduce any user-facing change?

Yes, this PR introduces a user-facing change in how Oracle BLOB data is handled during synchronization. Users can now configure the JDBC connector to preserve the original format of BLOB data by setting handle_blob_as_string=true in their connector configuration.

Before this change, Oracle BLOB data containing structured content like XML or HTML would be converted to Base64-encoded strings in the target system, making it difficult to use. With this change, users can choose to preserve the original content format.

How was this patch tested?

  1. Manually tested with Oracle tables containing BLOB fields with various data types (text, XML, HTML)
  2. Verified that the original data format is preserved when handle_blob_as_string=true

For local testing verification, for example, when the parameter handle_blob_as_string=false or not set, the situation is as follows: In the Oracle source table (TEST_BLOB_TABLE), we have BLOB data with different content types:

Row 1: Simple text "Hello, World!"

Row 2: XML content

Row 3: HTML content

However, after synchronization to the Doris target table, all BLOB data is converted to Base64-encoded strings:

Row 1: "SGVsbG8sIFdvcmxkIQ=="

Row 2: "PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4..."

Row 3: "PCFET0NUWVBFIGh0bWw+PGh0bWwgc3R5bGU9Im92..."

When the parameter is set to true, the BLOB fields in the Doris target table maintain data consistent with the source data.

Check list

@hailin0 hailin0 requested a review from Copilot May 7, 2025 01:51
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request fixes the Oracle BLOB data preservation issue by adding a configuration parameter (handle_blob_as_string) to control whether BLOB data is converted to a string or remains as bytes. The changes span multiple modules, including source conversion methods, dialect factories, config builders, and catalog creation for both JDBC and CDC connectors.

Reviewed Changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/utils/JdbcFieldTypeUtils.java Added special handling for BLOB conversion in getString().
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/utils/JdbcCatalogUtils.java Configured catalog options to include handle_blob_as_string.
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/JdbcSourceFactory.java Propagated the new configuration parameter to the dialect loader.
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/* Updated OracleTypeMapper, OracleTypeConverter, OracleDialect, and OracleCatalog to support handle_blob_as_string.
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/config/* Updated configuration options and builders to include handle_blob_as_string.
seatunnel-connectors-v2/connector-cdc/connector-cdc-oracle/src/main/java/org/apache/seatunnel/connectors/seatunnel/cdc/oracle/utils/OracleTypeUtils.java Adjusted conversion call for CDC connector with the new parameter.
Files not reviewed (1)
  • .github/actions/get-workflow-origin: Language not supported

@@ -69,6 +69,6 @@ public static org.apache.seatunnel.api.table.catalog.Column convertToSeaTunnelCo
builder.scale(column.length());
}

return new OracleTypeConverter(false).convert(builder.build());
return new OracleTypeConverter(false, false).convert(builder.build());
Copy link
Preview

Copilot AI May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CDC OracleTypeUtils conversion hardcodes handleBlobAsString to false. Consider exposing this configuration so that the CDC connector's handling of BLOB data can be aligned with the JDBC connector if needed.

Suggested change
return new OracleTypeConverter(false, false).convert(builder.build());
return new OracleTypeConverter(handleBlobAsString, false).convert(builder.build());

Copilot uses AI. Check for mistakes.

@Hisoka-X Hisoka-X added the First-time contributor First-time contributor label May 7, 2025
@Hisoka-X Hisoka-X linked an issue May 7, 2025 that may be closed by this pull request
3 tasks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not update the version of action.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will restore the previous version.

Comment on lines 216 to 226
if (handleBlobAsString) {
builder.dataType(BasicType.STRING_TYPE);
builder.columnLength(BYTES_4GB - 1);
log.info("Converted BLOB to STRING_TYPE with length: {}", BYTES_4GB - 1);
} else {
builder.dataType(PrimitiveByteArrayType.INSTANCE);
builder.columnLength(BYTES_4GB - 1);
log.info(
"Converted BLOB to PrimitiveByteArrayType with length: {}",
BYTES_4GB - 1);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use CLOB to store string? Because blob was originally used to store bytes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your question.The design choice to handle BLOB data as strings is based on real customer use cases. We've encountered numerous situations where customers store XML files or other textual content in Oracle BLOB fields rather than CLOB, particularly in legacy systems or specific application scenarios.

When attempting to process these BLOB data directly in SQL queries using type conversions (like TO_CLOB), users often face byte size limitations, especially with large XML files. These limitations can result in data truncation or conversion failures.

This is why we implemented the handleBlobAsString option, providing users with flexibility during the ETL process:

  1. When the data is actually textual content (like XML), it can be converted to STRING_TYPE
  2. When the data is genuinely binary, it maintains its original binary form
    This approach circumvents Oracle's internal type conversion limitations, offering more reliable processing capabilities for large textual data while maintaining backward compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense to me. cc @hailin0 @corgy-w

@Hisoka-X
Copy link
Member

Hisoka-X commented May 7, 2025

  1. Please add test case.
  2. Please open github action on your fork repository. Please refer https://github.com/apache/seatunnel/pull/9270/checks?check_run_id=41700794268

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature][connector-jdbc] Fix Oracle BLOB data corruption in JDBC connector
3 participants