You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/docs/engine-flink/deltajoins.md
+8-10Lines changed: 8 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,14 +8,14 @@ sidebar_position: 6
8
8
Beginning with **Apache Flink 2.1**, a new operator called **Delta Join** was introduced.
9
9
Compared to traditional streaming joins, the delta join operator significantly reduces the amount of state that needs to be maintained during execution. This improvement helps mitigate several common issues associated with large state sizes, including:
10
10
11
-
- Excessive memory and storage consumption
11
+
- Excessive computing resource and storage consumption
12
12
- Long checkpointing durations
13
13
- Extended recovery times after failures or restarts
14
14
15
15
Starting with **Apache Fluss 0.8**, streaming join jobs running on **Flink 2.1 or later** will be automatically optimized into **delta joins** whenever applicable. This optimization happens transparently at query planning time, requiring no manual configuration.
16
16
17
17
## How Delta Join Works
18
-
Traditional streaming joins in Flink require maintaining both input sides entirely in state to match updates across streams. Delta join, by contrast, uses a **prefix-based lookup mechanism**that only retains *relevant subsets*of one table’s data in state. This drastically reduces memory pressure and improves performance for many streaming analytics and enrichment workloads.
18
+
Traditional streaming joins in Flink require maintaining both input sides entirely in state to match records across streams. Delta join, by contrast, uses a **prefix-based lookup mechanism**to transform the behavior of querying data from the state into querying data from the Fluss source table, thereby avoiding redundant storage of the same data in both the Fluss source table and the state. This drastically reduces state size and improves performance for many streaming analytics and enrichment workloads.
If the physical plan includes `DeltaJoin`, it indicates that the optimizer has successfully transformed the traditional streaming join into a delta join.
86
+
If the printed plan includes `DeltaJoin` as shown below, it indicates that the optimizer has successfully transformed the traditional streaming join into a delta join.
This confirms that the delta join optimization is active.
116
114
117
115
## Understanding Prefix Keys
118
116
A prefix key defines the portion of a table’s primary key that can be used for efficient key-based lookups or index pruning.
@@ -125,9 +123,9 @@ For example:
125
123
126
124
In this setup:
127
125
* The delta join operator uses the prefix key (`city_id`) to retrieve only relevant right-side records matching each left-side event.
128
-
* This eliminates the need to hold all records for every city in memory, significantly reducing state size.
126
+
* This eliminates the need to hold all records for every city in Flink state, significantly reducing state size.
129
127
130
-
Prefix keys thus form the foundation for state-efficient lookups in delta joins, enabling Flink to scale join workloads efficiently even under high throughput.
128
+
Prefix keys thus form the foundation for efficient lookups in delta joins, enabling Flink to scale join workloads efficiently even under high throughput.
131
129
132
130
## Flink Version Support
133
131
@@ -144,9 +142,9 @@ Refer to the [Delta Join](https://issues.apache.org/jira/browse/FLINK-37836) for
144
142
145
143
#### Limitations
146
144
147
-
- The primary key or the prefix lookup key of the tables must be included as part of the equivalence conditions in the join.
145
+
- The primary key or the prefix key of the tables must be included as part of the equivalence conditions in the join.
148
146
- The join must be a INNER join.
149
-
- The downstream nodes of the join can accept duplicate changes, such as a sink that provides UPSERT mode.
147
+
- The downstream nodes of the join can accept duplicate changes, such as a sink that provides UPSERT mode without `upsertMaterialize`.
150
148
- All join inputs should be INSERT-ONLY streams.
151
149
- This is why the option `'table.merge-engine' = 'first_row'` is added to the source table DDL.
152
150
- All upstream nodes of the join should be `TableSourceScan` or `Exchange`.
0 commit comments