Skip to content

[fix] support staged doris insert overwrite#660

Open
liujiwen-up wants to merge 1 commit into
apache:masterfrom
liujiwen-up:insert-overwrite-staging
Open

[fix] support staged doris insert overwrite#660
liujiwen-up wants to merge 1 commit into
apache:masterfrom
liujiwen-up:insert-overwrite-staging

Conversation

@liujiwen-up
Copy link
Copy Markdown
Contributor

Proposed changes

Issue Number: close #xxx

Problem Summary:

This PR changes Doris Flink Connector INSERT OVERWRITE sink behavior to avoid truncating the target table before data is successfully written.

Previously, INSERT OVERWRITE executed TRUNCATE TABLE before writing. If the Flink job failed after truncate but before the write completed, the target table could be left empty.

The new implementation uses a staging-table based flow:

  1. Create a staging table with CREATE TABLE staging LIKE target.
  2. Write data into the staging table through Stream Load 2PC.
  3. After all committed data is available, finalize the overwrite with:
    ALTER TABLE target REPLACE WITH TABLE staging PROPERTIES('swap'='false').

This implementation is currently limited to bounded INSERT OVERWRITE with STREAM_LOAD and 2PC enabled. It rejects unsafe configurations such as streaming overwrite, non-Stream Load write modes, sink.ignore.commit-error=true, missing sink.label-prefix, missing jdbc-url, and pre-existing staging tables.

Additional guards were added to require Doris table id metadata from information_schema.metadata_name_ids, so the connector can detect target-table changes before finalization and identify already-finalized overwrite attempts.

Checklist(Required)

  1. Does it affect the original behavior: Yes
  2. Has unit tests been added: Yes
  3. Has document been added or modified: No Need
  4. Does it need to update dependencies: No
  5. Are there any changes that cannot be rolled back: No

Further comments

This change intentionally chooses a conservative first version:

  • Only bounded overwrite is supported.
  • Only Stream Load 2PC is supported.
  • Existing staging tables are not reused to avoid publishing stale or mixed data.
  • Table id metadata is required instead of silently degrading safety checks.

Alternatives considered include continuing to use TRUNCATE TABLE, reusing existing staging tables during recovery, or supporting more write modes immediately. These were not chosen because they either preserve the original data-loss risk or make failure/recovery semantics harder to prove safe.

Validation performed:

mvn -Pflink1 -pl flink-doris-connector-base -Dtest=TestDorisOverwriteManager test
mvn -Pflink1 -pl flink-doris-connector-flink1 -am -DskipTests compile
mvn -Pflink2 -pl flink-doris-connector-flink2 -am -DskipTests compile

@liujiwen-up liujiwen-up changed the title [fix] Support staged Doris insert overwrite [fix] support staged doris insert overwrite May 16, 2026
@JNSimba JNSimba requested a review from Copilot May 18, 2026 09:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants