Skip to content

Commit 7d311ce

Browse files
ThorneANNThorne
andcommitted
[FLINK-39001][doc][Flink-source]supple NewlyAddTable's doc with mongodb,postgres,oracle connectors (apache#4247)
Co-authored-by: Thorne <syyfffy@email> Co-authored-by: Thorne <syyfffy@163.com>
1 parent 984ce8c commit 7d311ce

6 files changed

Lines changed: 366 additions & 0 deletions

File tree

docs/content.zh/docs/connectors/flink-sources/mongodb-cdc.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -489,6 +489,63 @@ MongoDB 的`oplog.rs` 集合没有在状态之前保持更改记录的更新,
489489
顺便说一句,[DBZ-435](https://issues.redhat.com/browse/DBZ-435)提到的Debezium的MongoDB变更流探索,正在制定路线图。<br>
490490
如果完成了,我们可以考虑集成两种源连接器供用户选择。
491491
492+
### 动态加表
493+
494+
**注意:** 该功能从 Flink CDC 3.1.0 版本开始支持。
495+
496+
动态加表功能使你可以为正在运行的作业添加新集合进行监控。新添加的集合将首先读取其快照数据,然后自动读取其变更流。
497+
498+
想象一下这个场景:一开始,Flink 作业监控集合 `[product, user, address]`,但几天后,我们希望这个作业还可以监控集合 `[order, custom]`,这些集合包含历史数据,我们需要作业仍然可以复用作业的已有状态。动态加表功能可以优雅地解决此问题。
499+
500+
以下操作显示了如何启用此功能来解决上述场景。使用现有的 MongoDB CDC Source 作业,如下:
501+
502+
```java
503+
MongoDBSource<String> mongoSource = MongoDBSource.<String>builder()
504+
.hosts("yourHostname:27017")
505+
.databaseList("db") // 设置捕获的数据库
506+
.collectionList("db.product", "db.user", "db.address") // 设置捕获的集合
507+
.username("yourUsername")
508+
.password("yourPassword")
509+
.scanNewlyAddedTableEnabled(true) // 启用扫描新添加的表功能
510+
.deserializer(new JsonDebeziumDeserializationSchema()) // 将 SourceRecord 转换为 JSON 字符串
511+
.build();
512+
// 你的业务代码
513+
```
514+
515+
如果我们想添加新集合 `[order, custom]` 到现有的 Flink 作业,只需更新作业的 `collectionList()` 将新增集合 `[order, custom]` 加入并从已有的 savepoint 恢复作业。
516+
517+
_Step 1_: 使用 savepoint 停止现有的 Flink 作业。
518+
```shell
519+
$ ./bin/flink stop $Existing_Flink_JOB_ID
520+
```
521+
```shell
522+
Suspending job "cca7bc1061d61cf15238e92312c2fc20" with a savepoint.
523+
Savepoint completed. Path: file:/tmp/flink-savepoints/savepoint-cca7bc-bb1e257f0dab
524+
```
525+
_Step 2_: 更新现有 Flink 作业的集合列表选项。
526+
1. 更新 `collectionList()` 参数。
527+
2. 编译更新后的作业,示例如下:
528+
```java
529+
MongoDBSource<String> mongoSource = MongoDBSource.<String>builder()
530+
.hosts("yourHostname:27017")
531+
.databaseList("db")
532+
.collectionList("db.product", "db.user", "db.address", "db.order", "db.custom") // 设置捕获的集合 [product, user, address, order, custom]
533+
.username("yourUsername")
534+
.password("yourPassword")
535+
.scanNewlyAddedTableEnabled(true)
536+
.deserializer(new JsonDebeziumDeserializationSchema()) // 将 SourceRecord 转换为 JSON 字符串
537+
.build();
538+
// 你的业务代码
539+
```
540+
_Step 3_: 从 savepoint 还原更新后的 Flink 作业。
541+
```shell
542+
$ ./bin/flink run \
543+
--detached \
544+
--from-savepoint /tmp/flink-savepoints/savepoint-cca7bc-bb1e257f0dab \
545+
./FlinkCDCExample.jar
546+
```
547+
**注意:** 请参考文档 [Restore the job from previous savepoint](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/cli/#command-line-interface) 了解更多详细信息。
548+
492549
### DataStream Source
493550
494551
MongoDB CDC 连接器也可以是一个数据流源。 你可以创建 SourceFunction,如下所示:

docs/content.zh/docs/connectors/flink-sources/oracle-cdc.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -558,6 +558,67 @@ _Note: the mechanism of `scan.startup.mode` option relying on Debezium's `snapsh
558558

559559
The Oracle CDC source can't work in parallel reading, because there is only one task can receive change events.
560560
561+
### 动态加表
562+
563+
**注意:** 该功能从 Flink CDC 3.1.0 版本开始支持。
564+
565+
动态加表功能使你可以为正在运行的作业添加新表进行监控。新添加的表将首先读取其快照数据,然后自动读取其 redo log。
566+
567+
想象一下这个场景:一开始,Flink 作业监控表 `[product, user, address]`,但几天后,我们希望这个作业还可以监控表 `[order, custom]`,这些表包含历史数据,我们需要作业仍然可以复用作业的已有状态。动态加表功能可以优雅地解决此问题。
568+
569+
以下操作显示了如何启用此功能来解决上述场景。使用现有的 Oracle CDC Source 作业,如下:
570+
571+
```java
572+
JdbcIncrementalSource<String> oracleSource = new OracleSourceBuilder()
573+
.hostname("yourHostname")
574+
.port(1521)
575+
.databaseList("ORCLCDB") // 设置捕获的数据库
576+
.schemaList("INVENTORY") // 设置捕获的 schema
577+
.tableList("INVENTORY.PRODUCT", "INVENTORY.USER", "INVENTORY.ADDRESS") // 设置捕获的表
578+
.username("yourUsername")
579+
.password("yourPassword")
580+
.scanNewlyAddedTableEnabled(true) // 启用扫描新添加的表功能
581+
.deserializer(new JsonDebeziumDeserializationSchema()) // 将 SourceRecord 转换为 JSON 字符串
582+
.build();
583+
// 你的业务代码
584+
```
585+
586+
如果我们想添加新表 `[INVENTORY.ORDER, INVENTORY.CUSTOM]` 到现有的 Flink 作业,只需更新作业的 `tableList()` 将新增表 `[INVENTORY.ORDER, INVENTORY.CUSTOM]` 加入并从已有的 savepoint 恢复作业。
587+
588+
_Step 1_: 使用 savepoint 停止现有的 Flink 作业。
589+
```shell
590+
$ ./bin/flink stop $Existing_Flink_JOB_ID
591+
```
592+
```shell
593+
Suspending job "cca7bc1061d61cf15238e92312c2fc20" with a savepoint.
594+
Savepoint completed. Path: file:/tmp/flink-savepoints/savepoint-cca7bc-bb1e257f0dab
595+
```
596+
_Step 2_: 更新现有 Flink 作业的表列表选项。
597+
1. 更新 `tableList()` 参数。
598+
2. 编译更新后的作业,示例如下:
599+
```java
600+
JdbcIncrementalSource<String> oracleSource = new OracleSourceBuilder()
601+
.hostname("yourHostname")
602+
.port(1521)
603+
.databaseList("ORCLCDB")
604+
.schemaList("INVENTORY")
605+
.tableList("INVENTORY.PRODUCT", "INVENTORY.USER", "INVENTORY.ADDRESS", "INVENTORY.ORDER", "INVENTORY.CUSTOM") // 设置捕获的表 [PRODUCT, USER, ADDRESS, ORDER, CUSTOM]
606+
.username("yourUsername")
607+
.password("yourPassword")
608+
.scanNewlyAddedTableEnabled(true)
609+
.deserializer(new JsonDebeziumDeserializationSchema()) // 将 SourceRecord 转换为 JSON 字符串
610+
.build();
611+
// 你的业务代码
612+
```
613+
_Step 3_: 从 savepoint 还原更新后的 Flink 作业。
614+
```shell
615+
$ ./bin/flink run \
616+
--detached \
617+
--from-savepoint /tmp/flink-savepoints/savepoint-cca7bc-bb1e257f0dab \
618+
./FlinkCDCExample.jar
619+
```
620+
**注意:** 请参考文档 [Restore the job from previous savepoint](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/cli/#command-line-interface) 了解更多详细信息。
621+
561622
### DataStream Source
562623
563624
The Oracle CDC connector can also be a DataStream source. There are two modes for the DataStream source:

docs/content.zh/docs/connectors/flink-sources/postgres-cdc.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -510,6 +510,71 @@ The config option `scan.startup.mode` specifies the startup mode for PostgreSQL
510510
- `committed-offset`: Skip snapshot phase and start reading events from a `confirmed_flush_lsn` offset of replication slot.
511511
- `snapshot`: Only the snapshot phase is performed and exits after the snapshot phase reading is completed.
512512

513+
### 动态加表
514+
515+
**注意:** 该功能从 Flink CDC 3.1.0 版本开始支持。
516+
517+
动态加表功能使你可以为正在运行的作业添加新表进行监控。新添加的表将首先读取其快照数据,然后自动读取其 WAL (Write-Ahead Log) 日志 或者 replication slot changes 复制槽。
518+
519+
想象一下这个场景:一开始,Flink 作业监控表 `[product, user, address]`,但几天后,我们希望这个作业还可以监控表 `[order, custom]`,这些表包含历史数据,我们需要作业仍然可以复用作业的已有状态。动态加表功能可以优雅地解决此问题。
520+
521+
以下操作显示了如何启用此功能来解决上述场景。使用现有的 PostgreSQL CDC Source 作业,如下:
522+
523+
```java
524+
JdbcIncrementalSource<String> postgresSource =
525+
PostgresSourceBuilder.PostgresIncrementalSource.<String>builder()
526+
.hostname("yourHostname")
527+
.port(5432)
528+
.database("postgres") // 设置捕获的数据库
529+
.schemaList("inventory") // 设置捕获的 schema
530+
.tableList("inventory.product", "inventory.user", "inventory.address") // 设置捕获的表
531+
.username("yourUsername")
532+
.password("yourPassword")
533+
.slotName("flink")
534+
.scanNewlyAddedTableEnabled(true) // 启用扫描新添加的表功能
535+
.deserializer(new JsonDebeziumDeserializationSchema()) // 将 SourceRecord 转换为 JSON 字符串
536+
.build();
537+
// 你的业务代码
538+
```
539+
540+
如果我们想添加新表 `[inventory.order, inventory.custom]` 到现有的 Flink 作业,只需更新作业的 `tableList()` 将新增表 `[inventory.order, inventory.custom]` 加入并从已有的 savepoint 恢复作业。
541+
542+
_Step 1_: 使用 savepoint 停止现有的 Flink 作业。
543+
```shell
544+
$ ./bin/flink stop $Existing_Flink_JOB_ID
545+
```
546+
```shell
547+
Suspending job "cca7bc1061d61cf15238e92312c2fc20" with a savepoint.
548+
Savepoint completed. Path: file:/tmp/flink-savepoints/savepoint-cca7bc-bb1e257f0dab
549+
```
550+
_Step 2_: 更新现有 Flink 作业的表列表选项。
551+
1. 更新 `tableList()` 参数。
552+
2. 编译更新后的作业,示例如下:
553+
```java
554+
JdbcIncrementalSource<String> postgresSource =
555+
PostgresSourceBuilder.PostgresIncrementalSource.<String>builder()
556+
.hostname("yourHostname")
557+
.port(5432)
558+
.database("postgres")
559+
.schemaList("inventory")
560+
.tableList("inventory.product", "inventory.user", "inventory.address", "inventory.order", "inventory.custom") // 设置捕获的表 [product, user, address, order, custom]
561+
.username("yourUsername")
562+
.password("yourPassword")
563+
.slotName("flink")
564+
.scanNewlyAddedTableEnabled(true)
565+
.deserializer(new JsonDebeziumDeserializationSchema()) // 将 SourceRecord 转换为 JSON 字符串
566+
.build();
567+
// 你的业务代码
568+
```
569+
_Step 3_: 从 savepoint 还原更新后的 Flink 作业。
570+
```shell
571+
$ ./bin/flink run \
572+
--detached \
573+
--from-savepoint /tmp/flink-savepoints/savepoint-cca7bc-bb1e257f0dab \
574+
./FlinkCDCExample.jar
575+
```
576+
**注意:** 请参考文档 [Restore the job from previous savepoint](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/cli/#command-line-interface) 了解更多详细信息。
577+
513578
### DataStream Source
514579

515580
The Postgres CDC connector can also be a DataStream source. There are two modes for the DataStream source:

docs/content/docs/connectors/flink-sources/mongodb-cdc.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -512,6 +512,63 @@ Applications can use change streams to subscribe to all data changes on a single
512512
By the way, Debezium's MongoDB change streams exploration mentioned by [DBZ-435](https://issues.redhat.com/browse/DBZ-435) is on roadmap.<br>
513513
If it's done, we can consider integrating two kinds of source connector for users to choose.
514514

515+
### Scan Newly Added Collections
516+
517+
**Note:** This feature is available since Flink CDC 3.1.0.
518+
519+
The Scan Newly Added Collections feature enables you to add new collections to monitor for existing running pipeline. The newly added collections will read their snapshot data firstly and then read their change stream automatically.
520+
521+
Imagine this scenario: At the beginning, a Flink job monitors collections `[product, user, address]`, but after some days we would like the job can also monitor collections `[order, custom]` which contain history data, and we need the job can still reuse existing state of the job. This feature can resolve this case gracefully.
522+
523+
The following operations show how to enable this feature to resolve above scenario. An existing Flink job which uses MongoDB CDC Source like:
524+
525+
```java
526+
MongoDBSource<String> mongoSource = MongoDBSource.<String>builder()
527+
.hosts("yourHostname:27017")
528+
.databaseList("db") // set captured database
529+
.collectionList("db.product", "db.user", "db.address") // set captured collections
530+
.username("yourUsername")
531+
.password("yourPassword")
532+
.scanNewlyAddedTableEnabled(true) // enable scan the newly added collections feature
533+
.deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
534+
.build();
535+
// your business code
536+
```
537+
538+
If we would like to add new collections `[order, custom]` to an existing Flink job, we just need to update the `collectionList()` value of the job to include `[order, custom]` and restore the job from previous savepoint.
539+
540+
_Step 1_: Stop the existing Flink job with savepoint.
541+
```shell
542+
$ ./bin/flink stop $Existing_Flink_JOB_ID
543+
```
544+
```shell
545+
Suspending job "cca7bc1061d61cf15238e92312c2fc20" with a savepoint.
546+
Savepoint completed. Path: file:/tmp/flink-savepoints/savepoint-cca7bc-bb1e257f0dab
547+
```
548+
_Step 2_: Update the collection list option for the existing Flink job.
549+
1. update `collectionList()` value.
550+
2. build the jar of updated job.
551+
```java
552+
MongoDBSource<String> mongoSource = MongoDBSource.<String>builder()
553+
.hosts("yourHostname:27017")
554+
.databaseList("db")
555+
.collectionList("db.product", "db.user", "db.address", "db.order", "db.custom") // set captured collections [product, user, address, order, custom]
556+
.username("yourUsername")
557+
.password("yourPassword")
558+
.scanNewlyAddedTableEnabled(true) // enable scan newly added tables feature
559+
.deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
560+
.build();
561+
// your business code
562+
```
563+
_Step 3_: Restore the updated Flink job from savepoint.
564+
```shell
565+
$ ./bin/flink run \
566+
--detached \
567+
--from-savepoint /tmp/flink-savepoints/savepoint-cca7bc-bb1e257f0dab \
568+
./FlinkCDCExample.jar
569+
```
570+
**Note:** Please refer the doc [Restore the job from previous savepoint](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/cli/#command-line-interface) for more details.
571+
515572
### DataStream Source
516573

517574
The MongoDB CDC connector can also be a DataStream source. You can create a SourceFunction as the following shows:

docs/content/docs/connectors/flink-sources/oracle-cdc.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -559,6 +559,67 @@ _Note: the mechanism of `scan.startup.mode` option relying on Debezium's `snapsh
559559

560560
The Oracle CDC source can't work in parallel reading, because there is only one task can receive change events.
561561
562+
### Scan Newly Added Tables
563+
564+
**Note:** This feature is available since Flink CDC 3.1.0.
565+
566+
Scan Newly Added Tables feature enables you to add new tables to monitor for an existing running pipeline. The newly added tables will read their snapshot data first and then read their redo log automatically.
567+
568+
Imagine this scenario: At the beginning, a Flink job monitors tables `[product, user, address]`, but after some days we would like the job to also monitor tables `[order, custom]` which contain historical data, and we need the job to still reuse existing state of the job. This feature can resolve this case gracefully.
569+
570+
The following operations show how to enable this feature to resolve above scenario. An existing Flink job which uses Oracle CDC Source like:
571+
572+
```java
573+
JdbcIncrementalSource<String> oracleSource = new OracleSourceBuilder()
574+
.hostname("yourHostname")
575+
.port(1521)
576+
.databaseList("ORCLCDB") // set captured database
577+
.schemaList("INVENTORY") // set captured schema
578+
.tableList("INVENTORY.PRODUCT", "INVENTORY.USER", "INVENTORY.ADDRESS") // set captured tables
579+
.username("yourUsername")
580+
.password("yourPassword")
581+
.scanNewlyAddedTableEnabled(true) // enable scan newly added tables feature
582+
.deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
583+
.build();
584+
// your business code
585+
```
586+
587+
If we would like to add new tables `[INVENTORY.ORDER, INVENTORY.CUSTOM]` to an existing Flink job, we just need to update the `tableList()` value of the job to include `[INVENTORY.ORDER, INVENTORY.CUSTOM]` and restore the job from previous savepoint.
588+
589+
_Step 1_: Stop the existing Flink job with savepoint.
590+
```shell
591+
$ ./bin/flink stop $Existing_Flink_JOB_ID
592+
```
593+
```shell
594+
Suspending job "cca7bc1061d61cf15238e92312c2fc20" with a savepoint.
595+
Savepoint completed. Path: file:/tmp/flink-savepoints/savepoint-cca7bc-bb1e257f0dab
596+
```
597+
_Step 2_: Update the table list option for the existing Flink job.
598+
1. update `tableList()` value.
599+
2. build the jar of updated job.
600+
```java
601+
JdbcIncrementalSource<String> oracleSource = new OracleSourceBuilder()
602+
.hostname("yourHostname")
603+
.port(1521)
604+
.databaseList("ORCLCDB")
605+
.schemaList("INVENTORY")
606+
.tableList("INVENTORY.PRODUCT", "INVENTORY.USER", "INVENTORY.ADDRESS", "INVENTORY.ORDER", "INVENTORY.CUSTOM") // set captured tables [PRODUCT, USER, ADDRESS, ORDER, CUSTOM]
607+
.username("yourUsername")
608+
.password("yourPassword")
609+
.scanNewlyAddedTableEnabled(true)
610+
.deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
611+
.build();
612+
// your business code
613+
```
614+
_Step 3_: Restore the updated Flink job from savepoint.
615+
```shell
616+
$ ./bin/flink run \
617+
--detached \
618+
--from-savepoint /tmp/flink-savepoints/savepoint-cca7bc-bb1e257f0dab \
619+
./FlinkCDCExample.jar
620+
```
621+
**Note:** Please refer the doc [Restore the job from previous savepoint](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/cli/#command-line-interface) for more details.
622+
562623
### DataStream Source
563624
564625
The Oracle CDC connector can also be a DataStream source. There are two modes for the DataStream source:

0 commit comments

Comments
 (0)