Skip to content

[Improve](streaming job) support custom table name mapping for CDC streaming job #61317

Open
JNSimba wants to merge 2 commits intoapache:masterfrom
JNSimba:pg-tablemapping
Open

[Improve](streaming job) support custom table name mapping for CDC streaming job #61317
JNSimba wants to merge 2 commits intoapache:masterfrom
JNSimba:pg-tablemapping

Conversation

@JNSimba
Copy link
Member

@JNSimba JNSimba commented Mar 13, 2026

What problem does this PR solve?

Summary

Add support for mapping upstream (PostgreSQL) table names to custom downstream (Doris) table names
in CDC streaming jobs. Without this feature, the Doris target table must have the same name as the
upstream source table.

New configuration

Key format: "table.<srcTable>.target_table" = "<dstTable>" in the FROM clause properties.

CREATE JOB my_job
  ON STREAMING
  FROM POSTGRES (
    ...
    "include_tables" = "pg_orders",
    "table.pg_orders.target_table" = "doris_orders"
  )
  TO DATABASE mydb (...)

When not configured, behavior is unchanged (target table name = source table name).

Key design decisions

  • generateCreateTableCmds returns LinkedHashMap<srcName, CreateTableCommand> so callers can
    distinguish source names (for CDC monitoring) from target names (for DDL) — this fixes a bug
    where the CDC split assigner would look up the Doris target table name in PostgreSQL
  • Multi-table merge is supported: two source tables can map to the same Doris table

Test plan

  • test_streaming_postgres_job_table_mapping: basic mapping (INSERT/UPDATE/DELETE land in mapped table; Doris table
    created with target name, not source name)
  • test_streaming_postgres_job_table_mapping: multi-table merge (two PG tables → one Doris table, snapshot +
    incremental)

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Mar 13, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba JNSimba requested a review from Copilot March 13, 2026 09:37
@JNSimba
Copy link
Member Author

JNSimba commented Mar 13, 2026

/review

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for streaming CDC jobs (Postgres) to map upstream source table names to different Doris target table names via per-table config (table.<src>.target_table), including schema-change DDL routing and regression coverage.

Changes:

  • Introduce per-table config key constants and validation for table.<tableName>.<suffix> (adds target_table suffix).
  • Update FE table auto-creation to create Doris tables using mapped target names while keeping CDC monitoring based on source table names.
  • Update CDC client to route stream-load writes and schema-change DDLs to mapped target tables; add regression tests for mapping + multi-source merge.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
regression-test/suites/job_p0/streaming_job/cdc/test_streaming_postgres_job_table_mapping.groovy New regression suite covering table name mapping and two-source-to-one-target merge.
regression-test/data/job_p0/streaming_job/cdc/test_streaming_postgres_job_table_mapping.out Expected results for the new mapping regression suite.
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java Adds helper to parse all table.<src>.target_table mappings from config.
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/deserialize/PostgresDebeziumJsonDeserializer.java Route schema-change DDLs to the mapped Doris target table.
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/deserialize/DebeziumJsonDeserializer.java Cache parsed source→target mappings and provide resolveTargetTable().
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/service/PipelineCoordinator.java Route stream-load writes to mapped target table names.
fe/fe-core/src/main/java/org/apache/doris/job/util/StreamingJobUtils.java Generate CREATE TABLE commands keyed by source table name and create Doris tables using mapped target names.
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java Use the new source→CreateTableCommand mapping and ensure CDC monitors source tables.
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/DataSourceConfigValidator.java Validate per-table config key format and allowlisted suffixes.
fe/fe-common/src/main/java/org/apache/doris/job/cdc/DataSourceConfigKeys.java Adds per-table key constants (table, exclude_columns, target_table).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +126 to +132
public static Map<String, String> parseAllTargetTableMappings(Map<String, String> config) {
String prefix = DataSourceConfigKeys.TABLE + ".";
String suffix = "." + DataSourceConfigKeys.TABLE_TARGET_TABLE_SUFFIX;
Map<String, String> result = new HashMap<>();
for (Map.Entry<String, String> entry : config.entrySet()) {
String key = entry.getKey();
if (key.startsWith(prefix) && key.endsWith(suffix)) {
String key = entry.getKey();
if (key.startsWith(prefix) && key.endsWith(suffix)) {
String srcTable = key.substring(prefix.length(), key.length() - suffix.length());
String dstTable = entry.getValue().trim();

// Known suffixes for per-table config keys (format: "table.<tableName>.<suffix>")
private static final Set<String> ALLOW_TABLE_LEVEL_SUFFIXES = Sets.newHashSet(
DataSourceConfigKeys.TABLE_TARGET_TABLE_SUFFIX
String suffix = parts[parts.length - 1];
if (!ALLOW_TABLE_LEVEL_SUFFIXES.contains(suffix)) {
throw new IllegalArgumentException("Unknown per-table config key: '" + key + "'");
}
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR adds support for mapping upstream (PostgreSQL) table names to custom downstream (Doris) table names in CDC streaming jobs. The design is generally sound — the mapping is applied consistently across FE table creation, CDC client DML writes, and schema change DDL generation.

Critical Checkpoints

Goal achievement: The code accomplishes the stated goal. Table name mapping is applied in all three necessary places: (1) FE CreateTableCommand uses the mapped target name, (2) PipelineCoordinator routes DML records to the mapped Doris table, (3) PostgresDebeziumJsonDeserializer generates DDLs with the mapped name. Regression tests cover both basic mapping and multi-table merge scenarios.

Modification focus: The change is focused and touches only the necessary files. Good.

Concurrency: targetTableMappingsCache in DebeziumJsonDeserializer is a plain HashMap written once in init() and only read afterwards. This is safe in practice but could be made more robust with Collections.unmodifiableMap(). Low risk, not blocking.

Lifecycle management: No special lifecycle concerns.

Configuration items: New config key format table.<src>.target_table is validated in DataSourceConfigValidator. Dynamic changes not applicable (job creation time only).

Incompatible changes: The return type of generateCreateTableCmds changed from List<CreateTableCommand> to LinkedHashMap<String, CreateTableCommand>. This has only one caller (createTableIfNotExists), so no compatibility concern.

Parallel code paths: MySqlDebeziumJsonDeserializer exists as a parallel path. Its DML write routing goes through the same PipelineCoordinator code (correctly mapped). Its handleSchemaChangeEvent() is a TODO stub that returns empty, so no mapping needed yet. When implemented, it will need to use resolveTargetTable() — the infrastructure is already in the base class. Acceptable.

Test coverage: Good. Two regression test cases cover basic mapping (INSERT/UPDATE/DELETE) and multi-table merge (two PG tables → one Doris table). Tests use ORDER BY for deterministic output. Tables are dropped before use, not after. .out file appears auto-generated.

Observability: No new critical paths requiring additional logging. Existing log messages correctly use source table identifiers.

Persistence/transactions: Not applicable — no EditLog or transaction modifications.

Issues Found

  1. [CRITICAL] Compilation error in ConfigUtil.java: Missing import java.util.HashMap and import org.apache.doris.job.cdc.DataSourceConfigKeys. The new parseAllTargetTableMappings method uses HashMap and DataSourceConfigKeys but neither is imported. This file will not compile.

  2. [Minor] Dead code: TABLE_EXCLUDE_COLUMNS_SUFFIX is declared in DataSourceConfigKeys but never referenced anywhere in the codebase (not in the validator's ALLOW_TABLE_LEVEL_SUFFIXES, not in any consumer). If this is a placeholder for a future feature, it should be removed from this PR and added when actually needed to avoid confusion.

  3. [Minor] Validator rejects table names containing dots: The DataSourceConfigValidator splits on . and requires exactly 3 parts. A table name containing a dot (e.g., my.table) would produce more parts and be rejected. PostgreSQL allows dots in quoted identifiers. Consider documenting this limitation or using indexOf/lastIndexOf instead of split.

for (Map.Entry<String, String> entry : config.entrySet()) {
String key = entry.getKey();
if (key.startsWith(prefix) && key.endsWith(suffix)) {
String srcTable = key.substring(prefix.length(), key.length() - suffix.length());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL] Compilation error: HashMap is used here but java.util.HashMap is not imported. Similarly, DataSourceConfigKeys (used on the two lines above) is not imported either. This file will fail to compile.

Add the following imports:

import org.apache.doris.job.cdc.DataSourceConfigKeys;
import java.util.HashMap;

public static final String TABLE_TARGET_TABLE_SUFFIX = "target_table";

// target properties
public static final String TABLE_PROPS_PREFIX = "table.create.properties.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] Dead code: TABLE_EXCLUDE_COLUMNS_SUFFIX is declared here but is never used anywhere in the codebase — not in DataSourceConfigValidator.ALLOW_TABLE_LEVEL_SUFFIXES, not in any consumer code. If this is a placeholder for a future feature, consider removing it from this PR to avoid confusion and adding it when the feature is actually implemented.

throw new IllegalArgumentException("Malformed per-table config key: '" + key
+ "'. Expected format: table.<tableName>.<suffix>");
}
String suffix = parts[parts.length - 1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] Table names with dots will be rejected: split("\\.", -1) with parts.length != 3 means that a table name containing a dot (e.g., table.my.dotted.table.target_table) will produce more than 3 parts and fail validation. PostgreSQL allows dots in quoted identifiers.

Consider using indexOf/lastIndexOf instead:

int firstDot = key.indexOf('.', TABLE_LEVEL_PREFIX.length());
int lastDot = key.lastIndexOf('.');
if (firstDot == -1 || firstDot != lastDot - ???) { ... }

Or at minimum, document this limitation (no dots in source table names).

@JNSimba
Copy link
Member Author

JNSimba commented Mar 13, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26847 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e59494ac1f4eeff35a2aa03ad4f761dfe2b1e70b, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17604	4504	4304	4304
q2	q3	10655	782	523	523
q4	4686	375	267	267
q5	7551	1216	1013	1013
q6	182	182	152	152
q7	794	833	677	677
q8	9306	1554	1361	1361
q9	4731	4752	4712	4712
q10	6246	1874	1669	1669
q11	453	271	254	254
q12	707	559	470	470
q13	18023	2931	2153	2153
q14	235	226	213	213
q15	q16	750	727	670	670
q17	726	870	433	433
q18	5899	5374	5252	5252
q19	1121	970	617	617
q20	543	491	369	369
q21	4370	1860	1443	1443
q22	338	295	392	295
Total cold run time: 94920 ms
Total hot run time: 26847 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4895	4622	4619	4619
q2	q3	3895	4328	3809	3809
q4	869	1196	763	763
q5	4100	4400	4312	4312
q6	187	180	144	144
q7	1726	1667	1514	1514
q8	2498	2742	2510	2510
q9	7610	7412	7593	7412
q10	3793	3970	3645	3645
q11	528	431	421	421
q12	501	602	462	462
q13	2712	3224	2370	2370
q14	285	292	272	272
q15	q16	709	740	737	737
q17	1130	1314	1374	1314
q18	7258	6855	6589	6589
q19	987	1116	938	938
q20	2080	2158	2038	2038
q21	3950	3458	3362	3362
q22	456	440	396	396
Total cold run time: 50169 ms
Total hot run time: 47627 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169055 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e59494ac1f4eeff35a2aa03ad4f761dfe2b1e70b, data reload: false

query5	4321	653	527	527
query6	317	257	208	208
query7	4221	477	272	272
query8	346	253	240	240
query9	8685	2758	2770	2758
query10	510	382	342	342
query11	7036	5110	4909	4909
query12	185	130	130	130
query13	1266	466	354	354
query14	5770	3754	3522	3522
query14_1	2892	2888	2856	2856
query15	206	197	177	177
query16	1002	473	455	455
query17	1122	736	610	610
query18	2453	453	356	356
query19	212	213	188	188
query20	139	127	134	127
query21	231	149	116	116
query22	13292	14127	14831	14127
query23	16066	16056	15518	15518
query23_1	15635	15573	15519	15519
query24	7106	1648	1264	1264
query24_1	1257	1243	1253	1243
query25	545	523	402	402
query26	1245	262	149	149
query27	2771	488	299	299
query28	4482	1847	1843	1843
query29	822	551	479	479
query30	300	230	195	195
query31	1002	952	883	883
query32	88	71	78	71
query33	506	331	276	276
query34	910	870	528	528
query35	651	688	605	605
query36	1115	1139	1028	1028
query37	137	99	86	86
query38	2962	2917	2857	2857
query39	870	843	800	800
query39_1	793	795	778	778
query40	228	157	140	140
query41	64	60	57	57
query42	266	252	259	252
query43	241	259	230	230
query44	
query45	199	192	193	192
query46	887	984	619	619
query47	2085	2113	2059	2059
query48	321	331	224	224
query49	625	461	383	383
query50	704	281	214	214
query51	4048	4085	3978	3978
query52	262	272	258	258
query53	307	343	289	289
query54	304	275	268	268
query55	98	86	92	86
query56	323	322	312	312
query57	1968	1820	1733	1733
query58	286	285	270	270
query59	2816	2936	2766	2766
query60	351	338	323	323
query61	152	148	158	148
query62	613	593	550	550
query63	323	289	281	281
query64	5036	1259	979	979
query65	
query66	1463	458	348	348
query67	24361	24349	24309	24309
query68	
query69	411	319	280	280
query70	1007	942	975	942
query71	349	316	316	316
query72	2835	2732	2618	2618
query73	546	557	320	320
query74	9649	9621	9423	9423
query75	2887	2783	2483	2483
query76	2314	1055	702	702
query77	380	417	350	350
query78	10862	11140	10446	10446
query79	1165	778	585	585
query80	1321	640	558	558
query81	548	259	222	222
query82	1012	156	122	122
query83	342	264	247	247
query84	248	118	104	104
query85	905	505	448	448
query86	426	298	274	274
query87	3176	3123	3077	3077
query88	3612	2688	2688	2688
query89	434	378	343	343
query90	2029	180	176	176
query91	170	160	143	143
query92	81	75	76	75
query93	978	860	502	502
query94	640	333	303	303
query95	594	340	382	340
query96	642	539	231	231
query97	2429	2489	2368	2368
query98	240	223	222	222
query99	1013	991	897	897
Total cold run time: 249134 ms
Total hot run time: 169055 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants