You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/docs/concepts/architecture.md
+52Lines changed: 52 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,3 +3,55 @@ sidebar_position: 1
3
3
---
4
4
5
5
# Architecture
6
+
A Fluss cluster consists of two main processes: **CoordinatorServer** and **TabletServer**.
7
+
8
+

9
+
10
+
## CoordinatorServer
11
+
The **Coordinator Server** serves as the central control and management component of the cluster. It is responsible for maintaining metadata, managing tablet allocation, listing nodes, and handling permissions.
12
+
13
+
Additionally, it coordinates critical operations such as:
14
+
- Rebalancing data during node scaling (upscaling or downscaling).
15
+
- Managing data migration and service node switching in the event of node failures.
16
+
- Overseeing table management tasks, including creating or deleting tables and updating bucket counts.
17
+
18
+
As the `brain` of the cluster, the **Coordinator Server** ensures efficient cluster operation and seamless management of resources.
19
+
20
+
## Tablet Server
21
+
The **Tablet Server** is responsible for data storage, persistence, and providing I/O services directly to users. It comprises two key components: **LogStore** and **KvStore**.
22
+
- For **update tables**, both `LogStore` and `KvStore` are enabled to support updates efficiently.
23
+
- For **append-only tables**, only the `LogStore` is activated, optimizing performance for write-heavy workloads.
24
+
25
+
This architecture ensures the **Tablet Server** delivers tailored data handling capabilities based on table types.
26
+
27
+
28
+
### LogStore
29
+
The **LogStore** is designed to store log data, functioning similarly to a database binlog.
30
+
Messages can only be appended, not modified, ensuring data integrity.
31
+
Its primary purposes are to enable low-latency streaming reads and to serve as the write-ahead log (WAL) for restoring logs in the **KvStore**.
32
+
33
+
### KvStore
34
+
The **KvStore** is used to store table data, functioning similarly to database tables. It supports data updates and deletions, enabling efficient querying and table management. Additionally, it generates comprehensive changelogs to track data modifications.
35
+
36
+
### Tablet / Bucket
37
+
Table data is divided into multiple buckets based on the defined bucketing policy.
38
+
39
+
Data for the **LogStore** and **KvStore** is stored within tablets. Each tablet consists of a **LogTablet** and, optionally, a **KvTablet**, depending on whether the table supports updates.
40
+
Both **LogStore** and **KvStore** adhere to the same bucket-splitting and tablet allocation policies. As a result, **LogTablets** and **KvTablets** with the same `tablet_id` are always allocated to the same **Tablet Server** for efficient data management.
41
+
42
+
The **LogTablet** supports multiple replicas based on the table's configured replication factor, ensuring high availability and fault tolerance. **Currently, replication is not supported for KvTablets**.
43
+
44
+
## Zookeeper
45
+
Fluss currently utilizes **ZooKeeper** for cluster coordination, metadata storage, and cluster configuration management.
46
+
In upcoming releases, **ZooKeeper will be replaced** by **KvStore** for metadata storage and **Raft** for cluster coordination and ensuring consistency. This transition aims to streamline operations and enhance system reliability.
47
+
48
+
49
+
## Remote Storage
50
+
**Remote Storage** serves two primary purposes:
51
+
-**Hierarchical Storage for LogStores:** By offloading LogStore data, it reduces storage costs and accelerates scaling operations.
52
+
-**Persistent Storage for KvStores:** It ensures durable storage for KvStore data and collaborates with LogStore to enable fault recovery.
53
+
54
+
Additionally, **Remote Storage** allows clients to perform bulk read operations on Log and Kv data, enhancing data analysis efficiency. It also supports bulk write operations, optimizing data import workflows for greater scalability and performance.
55
+
56
+
## Client
57
+
Fluss clients/sdks support streaming reads/writes, batch read/writes, DDL and point queries.
Copy file name to clipboardExpand all lines: website/docs/concepts/storage-model.md
+47Lines changed: 47 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,3 +3,50 @@ sidebar_position: 2
3
3
---
4
4
5
5
# Storage Model
6
+
7
+
## Database
8
+
A Database is a collection of Table objects. You can create/delete databases or create/modify/delete tables under a database.
9
+
10
+
## Table
11
+
n Fluss, a Table is the fundamental unit of user data storage, organized into rows and columns. Tables are stored within specific databases, adhering to a hierarchical structure (database -> table).
12
+
13
+
Tables are classified into two types based on the presence of a primary key:
14
+
-**Log Tables:**
15
+
- Designed for append-only scenarios.
16
+
- Support only INSERT operations.
17
+
-**PrimaryKey Tables:**
18
+
- Used for updating and managing data in business databases.
19
+
- Support INSERT, UPDATE, and DELETE operations based on the defined primary key.
20
+
21
+
A Table becomes a [Partitioned Table](../table-design/data-distribution/partitioning.md) when a partition column is defined. Data with the same partition value is stored in the same partition. Partition columns can be applied to both Log Tables and PrimaryKey Tables, but with specific considerations:
22
+
-**For Log Tables**, partitioning is commonly used for log data, typically based on date columns, to facilitate data separation and cleaning.
23
+
-**For PrimaryKey Tables**, the partition column must be a subset of the primary key to ensure uniqueness.
24
+
25
+
This design ensures efficient data organization, flexibility in handling different use cases, and adherence to data integrity constraints.
26
+
27
+
## Table Data Organization
28
+
29
+

30
+
31
+
32
+
### Partition
33
+
A **partition** is a logical division of a table's data into smaller, more manageable subsets based on the values of one or more specified columns, known as partition columns.
34
+
Each unique value (or combination of values) in the partition column(s) defines a distinct partition.
35
+
36
+
37
+
### Bucket
38
+
A **bucket** horizontally divides the data of a table/partition into `N` buckets according to the bucketing policy.
39
+
The number of buckets `N` can be configured per table. A bucket is the smallest unit od data migration and backup.
40
+
The data of a bucket consists of a LogTablet and a (optional) KvTablet.
41
+
42
+
### LogTablet
43
+
A **LogTablet** needs to be generated for each bucket of Log and PrimaryKey tables.
44
+
For Log Tables, the LogTablet is both the primary table data and the log data. For PrimaryKey tables, the LogTablet acts
45
+
as the log data for the primary table data.
46
+
-**Segment:** The smallest unit of log storage in the **LogTablet**. A segment consists of an **.index** file and a **.log** data file.
47
+
-**.index:** An `offset sparse index` that stores the mappings between the physical byte address in the message relative offset -> .log file.
48
+
-**.log:** Compact arrangement of log data.
49
+
50
+
### KvTablet
51
+
Each bucket of the PrimaryKey table needs to generate a KvTablet, but it does not need to append the table.
Copy file name to clipboardExpand all lines: website/docs/engine-flink/lookups.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,7 @@ sidebar_position: 5
4
4
---
5
5
6
6
# Flink Lookup Joins
7
+
Flink lookup joins are important because they enable efficient, real-time enrichment of streaming data with reference data, a common requirement in many real-time analytics and processing scenarios.
7
8
8
9
## Instructions
9
10
- Use a primary key table as a dimension table, and the join condition must include all primary keys of the dimension table.
The Fluss source support limit read for both the primary-key table and the log table. It is useful to preview the latest N records in a table.
34
+
The Fluss sources supports limiting reads for both primary-key tables and log tables, making it convenient to preview the latest `N` records in a table.
35
35
36
36
### Example
37
37
1. Create a table and prepare data
@@ -65,8 +65,7 @@ SELECT * FROM log_table LIMIT 10;
65
65
66
66
## Point Query
67
67
68
-
The Fluss source supports point query for primary-key tables. It is useful to inspect a specific record in a table. Currently, the point query only supports the primary-key table.
69
-
68
+
The Fluss source supports point queries for primary-key tables, allowing you to inspect specific records efficiently. Currently, this functionality is exclusive to primary-key tables.
70
69
71
70
### Example
72
71
1. Create a table and prepare data
@@ -103,7 +102,7 @@ SELECT * FROM pk_table WHERE c_custkey = 1;
103
102
## Read Options
104
103
105
104
### scan.startup.mode
106
-
Currently, Fluss supports the following `scan.startup.mode`:
105
+
The scan startup mode enables you to specify the starting point for data consumption. Fluss currently supports the following `scan.startup.mode` options:
107
106
-`initial` (default): For primary key tables, it first consumes the full data set and then consumes incremental data. For log tables, it starts consuming from the earliest offset.
108
107
-`earliest`: For primary key tables, it starts consuming from the earliest changelog offset; for log tables, it starts consuming from the earliest log offset.
109
108
-`latest`: For primary key tables, it starts consuming from the latest changelog offset; for log tables, it starts consuming from the latest log offset.
Copy file name to clipboardExpand all lines: website/docs/engine-flink/writes.md
+14-19Lines changed: 14 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,17 +5,16 @@ sidebar_position: 3
5
5
6
6
# Flink Writes
7
7
8
-
You can directly insert or update data into the Fluss table using the `INSERT INTO` statement.
9
-
The Fluss primary key table can accept all types of messages (`INSERT`, `UPDATE_BEFORE`, `UPDATE_AFTER`, `DELETE`), while the Fluss log table can only accept insert type messages.
8
+
You can directly insert or update data into a Fluss table using the `INSERT INTO` statement.
9
+
Fluss primary key tables can accept all types of messages (`INSERT`, `UPDATE_BEFORE`, `UPDATE_AFTER`, `DELETE`), while Fluss log table can only accept `INSERT` type messages.
10
10
11
11
12
12
## INSERT INTO
13
-
`INSERT INTO`statement can be used to writing data to Fluss tables. This statement can both work in
14
-
streaming mode and batch mode, and both work on primary-key tables (upserting data) and log tables (appending data).
13
+
`INSERT INTO`statements are used to write data to Fluss tables.
14
+
They support both streaming and batch modes and are compatible with primary-key tables (for upserting data) as well as log tables (for appending data).
15
15
16
-
### Appending on Log Table
17
-
18
-
First, create a log table.
16
+
### Appending Data to the Log Table
17
+
#### Create a Log table.
19
18
```sql
20
19
CREATETABLElog_table (
21
20
order_id BIGINT,
@@ -25,7 +24,7 @@ CREATE TABLE log_table (
25
24
);
26
25
```
27
26
28
-
Then insert the data into the log table.
27
+
#### Insert data into the Log table.
29
28
```sql
30
29
CREATE TEMPORARY TABLE source (
31
30
order_id BIGINT,
@@ -39,9 +38,9 @@ SELECT * FROM source;
39
38
```
40
39
41
40
42
-
### Upserting on PrimaryKey Table
41
+
### Perform Data Upserts to the PrimaryKey Table.
43
42
44
-
First, create a primary key table.
43
+
#### Create a primary key table.
45
44
```sql
46
45
CREATETABLEpk_table (
47
46
shop_id BIGINT,
@@ -54,14 +53,12 @@ CREATE TABLE pk_table (
54
53
55
54
#### Updates All Columns
56
55
```sql
57
-
CREATE TEMPORARY TABLE source
58
-
(
56
+
CREATE TEMPORARY TABLE source (
59
57
shop_id BIGINT,
60
58
user_id BIGINT,
61
59
num_orders INT,
62
60
total_amount INT
63
-
)
64
-
WITH ('connector'='datagen');
61
+
) WITH ('connector'='datagen');
65
62
66
63
INSERT INTO pk_table
67
64
SELECT*FROM source;
@@ -70,14 +67,12 @@ SELECT * FROM source;
70
67
#### Partial Updates
71
68
72
69
```sql
73
-
CREATE TEMPORARY TABLE source
74
-
(
70
+
CREATE TEMPORARY TABLE source (
75
71
shop_id BIGINT,
76
72
user_id BIGINT,
77
73
num_orders INT,
78
74
total_amount INT
79
-
)
80
-
WITH ('connector'='datagen');
75
+
) WITH ('connector'='datagen');
81
76
82
77
-- only partial-update the num_orders column
83
78
INSERT INTO pk_table (shop_id, user_id, num_orders)
@@ -97,7 +92,7 @@ DELETE FROM pk_table WHERE shop_id = 10000 and user_id = 123456;
97
92
```
98
93
99
94
## UPDATE
100
-
Fluss supports updating data for the primary-key tables in batch mode via `UPDATE` statement. Currently, only single data updates based on the primary key are supported.
95
+
Fluss enables data updates for primary-key tables in batch mode using the `UPDATE` statement. Currently, only single-row updates based on the primary key are supported.
101
96
102
97
```sql
103
98
-- Execute the flink job in batch mode for current session context
Copy file name to clipboardExpand all lines: website/docs/quickstart/flink.md
+38-23Lines changed: 38 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,24 +5,26 @@ sidebar_position: 1
5
5
6
6
# Real-Time Analytics With Flink
7
7
8
-
The guide will get you up and running with Flink to do real-time analytics, covering some powerful features of Fluss.
9
-
The guide is derived from from [TPC-H](https://www.tpc.org/tpch/) Q5. You can learn more about running with Flink by
10
-
checking out the [Engine Flink](engine-flink/getting-started.md) section.
8
+
This guide will get you up and running with Apache Flink to do real-time analytics, covering some powerful features of Fluss.
9
+
The guide is derived from from [TPC-H](https://www.tpc.org/tpch/)**Q5**.
10
+
11
+
For more information on working with Flink, refer to the [Apache Flink Engine](engine-flink/getting-started.md) section.
11
12
12
13
## Environment Setup
13
14
### Prerequisites
14
-
To go through this guide, [Docker](https://docs.docker.com/engine/install/)needs to be already installed in your machine.
15
+
Before proceeding with this guide, ensure that [Docker](https://docs.docker.com/engine/install/)is installed on your machine.
15
16
16
17
### Starting components required
17
-
The components required in this tutorial are all managed in containers, so we will use `docker-compose` to start them.
18
+
We will use `docker-compose` to spin up all the required components for this tutorial.
19
+
20
+
1. Create a directory to serve as your working directory for this guide and add the `docker-compose.yaml` file to it.
18
21
19
-
1. Create a directory to put the `docker-compose.yaml` file, it will be your working directory in this guide.
20
22
```shell
21
23
mkdir fluss-quickstart-flink
22
24
cd fluss-quickstart-flink
23
25
```
24
26
25
-
2. Create `docker-compose.yml` file using following contents:
27
+
2. Create `docker-compose.yml` file with the following content:
26
28
```yaml
27
29
services:
28
30
coordinator-server:
@@ -84,17 +86,20 @@ volumes:
84
86
```
85
87
86
88
The Docker Compose environment consists of the following containers:
87
-
- Fluss Cluster: a Fluss CoordinatorServer, a Fluss TabletServer and a ZooKeeper server.
88
-
- Flink Cluster: a Flink JobManager and a Flink TaskManager container to execute queries.
89
-
The image `fluss/quickstart-flink` is from [flink:1.20.0-java17](https://hub.docker.com/layers/library/flink/1.20-java17/images/sha256-381ed7399c95b6b03a7b5ee8baca91fd84e24def9965ce9d436fb22773d66717), but
90
-
has packaged the [fluss-connector-flink](engine-flink/getting-started.md), [flink-connector-faker](https://flink-packages.org/packages/flink-faker) to simplify this guide.
89
+
- **Fluss Cluster:** a Fluss `CoordinatorServer`, a Fluss `TabletServer` and a `ZooKeeper` server.
90
+
- **Flink Cluster**: a Flink `JobManager` and a Flink `TaskManager` container to execute queries.
91
+
92
+
**Note:** The `fluss/quickstart-flink` image is based on [flink:1.20.0-java17](https://hub.docker.com/layers/library/flink/1.20-java17/images/sha256-381ed7399c95b6b03a7b5ee8baca91fd84e24def9965ce9d436fb22773d66717) and
93
+
includes the [fluss-connector-flink](engine-flink/getting-started.md) and [flink-connector-faker](https://flink-packages.org/packages/flink-faker) to simplify this guide.
91
94
92
95
3. To start all containers, run the following command in the directory that contains the `docker-compose.yml` file:
93
96
```shell
94
97
docker-compose up -d
95
98
```
96
99
This command automatically starts all the containers defined in the Docker Compose configuration in a detached mode.
97
-
Run `docker ps` to check whether these containers are running properly. You can also visit http://localhost:8081/ to see if Flink is running normally.
100
+
Run `docker ps` to check whether these containers are running properly.
101
+
102
+
You can also visit http://localhost:8081/ to see if Flink is running normally.
98
103
99
104
:::note
100
105
- If you want to run with your own Flink environment, remember to download the [fluss-connector-flink](engine-flink/getting-started.md), [flink-connector-faker](https://github.com/knaufk/flink-faker/releases) connector jars and then put them to `FLINK_HOME/lib/`.
@@ -107,9 +112,8 @@ First, use the following command to enter the Flink SQL CLI Container:
107
112
docker-compose exec jobmanager ./sql-client
108
113
```
109
114
110
-
**NOTE**:
111
-
To simplify this guide, it has prepared three temporary `faker` tables to generate data, you can use `describe table source_customer`
112
-
, `describe table source_order` and `describe table source_nation` to see the schema of the pre-created tables.
115
+
**Note**:
116
+
To simplify this guide, three temporary tables have been pre-created with `faker` to generate data. You can view their schemas by running the following commands: `DESCRIBE TABLE source_customer`, `DESCRIBE TABLE source_order`, and `DESCRIBE TABLE source_nation`.
113
117
114
118
## Create Fluss Tables
115
119
### Create Fluss Catalog
@@ -175,9 +179,9 @@ First, run the following sql to sync data from source tables to Fluss tables:
175
179
```sql title="Flink SQL Client"
176
180
EXECUTE STATEMENT SET
177
181
BEGIN
178
-
INSERT INTO fluss_nation SELECT*FROM`default_catalog`.`default_database`.source_nation;
179
-
INSERT INTO fluss_customer SELECT*FROM`default_catalog`.`default_database`.source_customer;
180
-
INSERT INTO fluss_order SELECT*FROM`default_catalog`.`default_database`.source_order;
182
+
INSERT INTO fluss_nation SELECT*FROM`default_catalog`.`default_database`.source_nation;
183
+
INSERT INTO fluss_customer SELECT*FROM`default_catalog`.`default_database`.source_customer;
184
+
INSERT INTO fluss_order SELECT*FROM`default_catalog`.`default_database`.source_order;
181
185
END;
182
186
```
183
187
@@ -187,11 +191,22 @@ primary-key tables `fluss_customer` and `fluss_nation` to enrich the `fluss_orde
LEFT JOIN fluss_customer FOR SYSTEM_TIME AS OF `o`.`ptime`AS`c`ONo.cust_key=c.cust_key
194
-
LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF `o`.`ptime`AS`n`ONc.nation_key=n.nation_key;
206
+
LEFT JOIN fluss_customer FOR SYSTEM_TIME AS OF `o`.`ptime`AS`c`
207
+
ONo.cust_key=c.cust_key
208
+
LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF `o`.`ptime`AS`n`
209
+
ONc.nation_key=n.nation_key;
195
210
```
196
211
197
212
## Real-Time Analytics on Fluss Tables
@@ -222,4 +237,4 @@ The result should be returned quickly since Fluss supports fast lookup by primar
222
237
After finishing the tutorial, run `exit` to exit Flink SQL CLI Container and then run `docker-compose down` to stop all containers.
223
238
224
239
## Learn more
225
-
Now that you're up an running with Fluss and Flink, check out the [Engine Flink](engine-flink/getting-started.md) docs to learn more features with Flink!
240
+
Now that you're up an running with Fluss and Flink, check out the [Apache Flink Engine](engine-flink/getting-started.md) docs to learn more features with Flink!
0 commit comments