apache
diff --git a/‎website/docs/assets/architecture.png‎
342 KB b/‎website/docs/assets/architecture.png‎
342 KB
diff --git a/‎website/docs/assets/data_organization.png‎
272 KB b/‎website/docs/assets/data_organization.png‎
272 KB
diff --git a/‎website/docs/concepts/architecture.md‎
Lines changed: 52 additions & 0 deletions b/‎website/docs/concepts/architecture.md‎
Lines changed: 52 additions & 0 deletions
diff --git a/‎website/docs/concepts/storage-model.md‎
Lines changed: 47 additions & 0 deletions b/‎website/docs/concepts/storage-model.md‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎website/docs/engine-flink/getting-started.md‎
Lines changed: 4 additions & 3 deletions b/‎website/docs/engine-flink/getting-started.md‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎website/docs/engine-flink/lookups.md‎
Lines changed: 1 addition & 0 deletions b/‎website/docs/engine-flink/lookups.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎website/docs/engine-flink/reads.md‎
Lines changed: 3 additions & 4 deletions b/‎website/docs/engine-flink/reads.md‎
Lines changed: 3 additions & 4 deletions
diff --git a/‎website/docs/engine-flink/writes.md‎
Lines changed: 14 additions & 19 deletions b/‎website/docs/engine-flink/writes.md‎
Lines changed: 14 additions & 19 deletions
diff --git a/‎website/docs/quickstart/flink.md‎
Lines changed: 38 additions & 23 deletions b/‎website/docs/quickstart/flink.md‎
Lines changed: 38 additions & 23 deletions
@@ -3,3 +3,55 @@ sidebar_position: 1
 ---
 
 # Architecture
+A Fluss cluster consists of two main processes: **CoordinatorServer** and **TabletServer**.
+
+![Fluss Architecture](../assets/architecture.png)
+
+## CoordinatorServer
+The **Coordinator Server** serves as the central control and management component of the cluster. It is responsible for maintaining metadata, managing tablet allocation, listing nodes, and handling permissions. 
+
+Additionally, it coordinates critical operations such as:
+- Rebalancing data during node scaling (upscaling or downscaling).
+- Managing data migration and service node switching in the event of node failures.
+- Overseeing table management tasks, including creating or deleting tables and updating bucket counts.
+
+As the `brain` of the cluster, the **Coordinator Server** ensures efficient cluster operation and seamless management of resources.
+
+## Tablet Server
+The **Tablet Server** is responsible for data storage, persistence, and providing I/O services directly to users. It comprises two key components: **LogStore** and **KvStore**.
+- For **update tables**, both `LogStore` and `KvStore` are enabled to support updates efficiently.
+- For **append-only tables**, only the `LogStore` is activated, optimizing performance for write-heavy workloads.
+
+This architecture ensures the **Tablet Server** delivers tailored data handling capabilities based on table types.
+
+
+### LogStore
+The **LogStore** is designed to store log data, functioning similarly to a database binlog. 
+Messages can only be appended, not modified, ensuring data integrity. 
+Its primary purposes are to enable low-latency streaming reads and to serve as the write-ahead log (WAL) for restoring logs in the **KvStore**.
+
+### KvStore
+The **KvStore** is used to store table data, functioning similarly to database tables. It supports data updates and deletions, enabling efficient querying and table management. Additionally, it generates comprehensive changelogs to track data modifications.
+
+### Tablet / Bucket
+Table data is divided into multiple buckets based on the defined bucketing policy.
+
+Data for the **LogStore** and **KvStore** is stored within tablets. Each tablet consists of a **LogTablet** and, optionally, a **KvTablet**, depending on whether the table supports updates. 
+Both **LogStore** and **KvStore** adhere to the same bucket-splitting and tablet allocation policies. As a result, **LogTablets** and **KvTablets** with the same `tablet_id` are always allocated to the same **Tablet Server** for efficient data management.
+
+The **LogTablet** supports multiple replicas based on the table's configured replication factor, ensuring high availability and fault tolerance. **Currently, replication is not supported for KvTablets**.
+
+## Zookeeper
+Fluss currently utilizes **ZooKeeper** for cluster coordination, metadata storage, and cluster configuration management. 
+In upcoming releases, **ZooKeeper will be replaced** by **KvStore** for metadata storage and **Raft** for cluster coordination and ensuring consistency. This transition aims to streamline operations and enhance system reliability.
+
+
+## Remote Storage
+**Remote Storage** serves two primary purposes:
+- **Hierarchical Storage for LogStores:** By offloading LogStore data, it reduces storage costs and accelerates scaling operations.
+- **Persistent Storage for KvStores:** It ensures durable storage for KvStore data and collaborates with LogStore to enable fault recovery.
+
+Additionally, **Remote Storage** allows clients to perform bulk read operations on Log and Kv data, enhancing data analysis efficiency. It also supports bulk write operations, optimizing data import workflows for greater scalability and performance.
+
+## Client
+Fluss clients/sdks support streaming reads/writes, batch read/writes, DDL and point queries.
@@ -3,3 +3,50 @@ sidebar_position: 2
 ---
 
 # Storage Model
+
+## Database
+A Database is a collection of Table objects. You can create/delete databases or create/modify/delete tables under a database.
+
+## Table
+n Fluss, a Table is the fundamental unit of user data storage, organized into rows and columns. Tables are stored within specific databases, adhering to a hierarchical structure (database -> table).
+
+Tables are classified into two types based on the presence of a primary key:
+- **Log Tables:**
+  - Designed for append-only scenarios. 
+  - Support only INSERT operations.
+- **PrimaryKey Tables:**
+  - Used for updating and managing data in business databases. 
+  - Support INSERT, UPDATE, and DELETE operations based on the defined primary key.
+
+A Table becomes a [Partitioned Table](../table-design/data-distribution/partitioning.md) when a partition column is defined. Data with the same partition value is stored in the same partition. Partition columns can be applied to both Log Tables and PrimaryKey Tables, but with specific considerations:
+- **For Log Tables**, partitioning is commonly used for log data, typically based on date columns, to facilitate data separation and cleaning.
+- **For PrimaryKey Tables**, the partition column must be a subset of the primary key to ensure uniqueness. 
+
+This design ensures efficient data organization, flexibility in handling different use cases, and adherence to data integrity constraints.
+
+## Table Data Organization
+
+![Table Data Organization](../assets/data_organization.png)
+
+
+### Partition 
+A **partition** is a logical division of a table's data into smaller, more manageable subsets based on the values of one or more specified columns, known as partition columns. 
+Each unique value (or combination of values) in the partition column(s) defines a distinct partition.
+
+
+### Bucket
+A **bucket** horizontally divides the data of a table/partition into `N` buckets according to the bucketing policy.
+The number of buckets `N` can be configured per table. A bucket is the smallest unit od  data migration and backup.
+The data of a bucket consists of a LogTablet and a (optional) KvTablet.
+
+### LogTablet
+A **LogTablet** needs to be generated for each bucket of Log and PrimaryKey tables.
+For Log Tables, the LogTablet is both the primary table data and the log data. For PrimaryKey tables, the LogTablet acts
+as the log data for the primary table data.
+- **Segment:** The smallest unit of log storage in the **LogTablet**. A segment consists of an **.index** file and a **.log** data file.
+- **.index:** An `offset sparse index` that stores the mappings between the physical byte address in the message relative offset -> .log file. 
+- **.log:** Compact arrangement of log data. 
+
+### KvTablet
+Each bucket of the PrimaryKey table needs to generate a KvTablet, but it does not need to append the table. 
+
@@ -5,7 +5,8 @@ sidebar_position: 1
 
 # Getting Started with Flink Engine
 ## Quick Start
-If you want to quickly start running with Flink, see the [Quick Start](../quickstart/flink.md).
+For a quick introduction to running Flink, refer to the [Quick Start](../quickstart/flink.md) guide.
+
 
 ## Support Flink Versions
 | Fluss Connector Versions | Supported Flink Versions |
@@ -145,7 +146,7 @@ SELECT * FROM pk_table /*+ OPTIONS('scan.startup.mode' = 'timestamp',
 ## Type Conversion
 Fluss's integration for Flink automatically converts between Flink and Fluss types.
 
-### Fluss to Flink
+### Fluss -> Apache Flink
 
 | Fluss         | Flink         |
 |---------------|---------------|
@@ -165,7 +166,7 @@ Fluss's integration for Flink automatically converts between Flink and Fluss typ
 | TIMESTAMP_LTZ | TIMESTAMP_LTZ |
 | BYTES         | BYTES         |
 
-### Flink to Fluss
+### Apache Flink -> Fluss
 
 | Flink         | Fluss          | 
 |---------------|----------------|
 
@@ -4,6 +4,7 @@ sidebar_position: 5
 ---
 
 # Flink Lookup Joins
+Flink lookup joins are important because they enable efficient, real-time enrichment of streaming data with reference data, a common requirement in many real-time analytics and processing scenarios.
 
 ## Instructions
 - Use a primary key table as a dimension table,  and the join condition must include all primary keys of the dimension table.
 
@@ -31,7 +31,7 @@ SELECT * FROM my_table /*+ OPTIONS('scan.startup.mode' = 'latest') */;
 
 ## Limit Read
 
-The Fluss source support limit read for both the primary-key table and the log table. It is useful to preview the latest N records in a table.
+The Fluss sources supports limiting reads for both primary-key tables and log tables, making it convenient to preview the latest `N` records in a table.
 
 ### Example
 1. Create a table and prepare data
@@ -65,8 +65,7 @@ SELECT * FROM log_table LIMIT 10;
 
 ## Point Query
 
-The Fluss source supports point query for primary-key tables. It is useful to inspect a specific record in a table. Currently, the point query only supports the primary-key table.
-
+The Fluss source supports point queries for primary-key tables, allowing you to inspect specific records efficiently. Currently, this functionality is exclusive to primary-key tables.
 
 ### Example
 1. Create a table and prepare data
@@ -103,7 +102,7 @@ SELECT * FROM pk_table WHERE c_custkey = 1;
 ## Read Options
 
 ### scan.startup.mode
-Currently, Fluss supports the following `scan.startup.mode`:
+The scan startup mode enables you to specify the starting point for data consumption. Fluss currently supports the following `scan.startup.mode` options:
 - `initial` (default): For primary key tables, it first consumes the full data set and then consumes incremental data. For log tables, it starts consuming from the earliest offset.
 - `earliest`: For primary key tables, it starts consuming from the earliest changelog offset; for log tables, it starts consuming from the earliest log offset.
 - `latest`: For primary key tables, it starts consuming from the latest changelog offset; for log tables, it starts consuming from the latest log offset.
 
@@ -5,17 +5,16 @@ sidebar_position: 3
 
 # Flink Writes
 
-You can directly insert or update data into the Fluss table using the `INSERT INTO` statement.
-The Fluss primary key table can accept all types of messages (`INSERT`, `UPDATE_BEFORE`, `UPDATE_AFTER`, `DELETE`), while the Fluss log table can only accept insert type messages.
+You can directly insert or update data into a Fluss table using the `INSERT INTO` statement.
+Fluss primary key tables can accept all types of messages (`INSERT`, `UPDATE_BEFORE`, `UPDATE_AFTER`, `DELETE`), while Fluss log table can only accept `INSERT` type messages.
 
 
 ## INSERT INTO
-`INSERT INTO` statement can be used to writing data to Fluss tables. This statement can both work in
-streaming mode and batch mode, and both work on primary-key tables (upserting data) and log tables (appending data).
+`INSERT INTO` statements are used to write data to Fluss tables. 
+They support both streaming and batch modes and are compatible with primary-key tables (for upserting data) as well as log tables (for appending data).
 
-### Appending on Log Table
-
-First, create a log table.
+### Appending Data to the Log Table
+#### Create a Log table.
 ```sql 
 CREATE TABLE log_table (
   order_id BIGINT,
@@ -25,7 +24,7 @@ CREATE TABLE log_table (
 );
 ```
 
-Then insert the data into the log table.
+#### Insert data into the Log table.
 ```sql 
 CREATE TEMPORARY TABLE source (
   order_id BIGINT,
@@ -39,9 +38,9 @@ SELECT * FROM source;
 ```
 
 
-### Upserting on PrimaryKey Table
+### Perform Data Upserts to the PrimaryKey Table.
 
-First, create a primary key table.
+#### Create a primary key table.
 ```sql 
 CREATE TABLE pk_table (
   shop_id BIGINT,
@@ -54,14 +53,12 @@ CREATE TABLE pk_table (
 
 #### Updates All Columns
 ```sql 
-CREATE TEMPORARY TABLE source
-(
+CREATE TEMPORARY TABLE source (
   shop_id BIGINT,
   user_id BIGINT,
   num_orders INT,
   total_amount INT
-)
-WITH ('connector' = 'datagen');
+) WITH ('connector' = 'datagen');
 
 INSERT INTO pk_table
 SELECT * FROM source;
@@ -70,14 +67,12 @@ SELECT * FROM source;
 #### Partial Updates
 
 ```sql 
-CREATE TEMPORARY TABLE source
-(
+CREATE TEMPORARY TABLE source (
   shop_id BIGINT,
   user_id BIGINT,
   num_orders INT,
   total_amount INT
-)
-WITH ('connector' = 'datagen');
+) WITH ('connector' = 'datagen');
 
 -- only partial-update the num_orders column
 INSERT INTO pk_table (shop_id, user_id, num_orders)
@@ -97,7 +92,7 @@ DELETE FROM pk_table WHERE shop_id = 10000 and user_id = 123456;
 ```
 
 ## UPDATE
-Fluss supports updating data for the primary-key tables in batch mode via `UPDATE` statement. Currently, only single data updates based on the primary key are supported.
+Fluss enables data updates for primary-key tables in batch mode using the `UPDATE` statement. Currently, only single-row updates based on the primary key are supported.
 
 ```sql
 -- Execute the flink job in batch mode for current session context
 
@@ -5,24 +5,26 @@ sidebar_position: 1
 
 # Real-Time Analytics With Flink
 
-The guide will get you up and running with Flink to do real-time analytics, covering some powerful features of Fluss. 
-The guide is derived from from [TPC-H](https://www.tpc.org/tpch/) Q5. You can learn more about running with Flink by 
-checking out the [Engine Flink](engine-flink/getting-started.md) section.
+This guide will get you up and running with Apache Flink to do real-time analytics, covering some powerful features of Fluss. 
+The guide is derived from from [TPC-H](https://www.tpc.org/tpch/) **Q5**. 
+
+For more information on working with Flink, refer to the [Apache Flink Engine](engine-flink/getting-started.md) section.
 
 ## Environment Setup
 ### Prerequisites
-To go through this guide, [Docker](https://docs.docker.com/engine/install/) needs to be already installed in your machine.
+Before proceeding with this guide, ensure that [Docker](https://docs.docker.com/engine/install/) is installed on your machine.
 
 ### Starting components required
-The components required in this tutorial are all managed in containers, so we will use `docker-compose` to start them.
+We will use `docker-compose` to spin up all the required components for this tutorial.
+
+1. Create a directory to serve as your working directory for this guide and add the `docker-compose.yaml` file to it.
 
-1. Create a directory to put the `docker-compose.yaml` file, it will be your working directory in this guide.
 ```shell
 mkdir fluss-quickstart-flink
 cd fluss-quickstart-flink
 ```
 
-2. Create `docker-compose.yml` file using following contents:
+2. Create `docker-compose.yml` file with the following content:
 ```yaml
 services:
   coordinator-server:
@@ -84,17 +86,20 @@ volumes:
 ```
 
 The Docker Compose environment consists of the following containers:
-- Fluss Cluster: a Fluss CoordinatorServer, a Fluss TabletServer and a ZooKeeper server.
-- Flink Cluster: a Flink JobManager and a Flink TaskManager container to execute queries.
-The image `fluss/quickstart-flink` is from [flink:1.20.0-java17](https://hub.docker.com/layers/library/flink/1.20-java17/images/sha256-381ed7399c95b6b03a7b5ee8baca91fd84e24def9965ce9d436fb22773d66717), but 
-has packaged the [fluss-connector-flink](engine-flink/getting-started.md), [flink-connector-faker](https://flink-packages.org/packages/flink-faker) to simplify this guide.
+- **Fluss Cluster:** a Fluss `CoordinatorServer`, a Fluss `TabletServer` and a `ZooKeeper` server.
+- **Flink Cluster**: a Flink `JobManager` and a Flink `TaskManager` container to execute queries.
+
+**Note:** The `fluss/quickstart-flink` image is based on [flink:1.20.0-java17](https://hub.docker.com/layers/library/flink/1.20-java17/images/sha256-381ed7399c95b6b03a7b5ee8baca91fd84e24def9965ce9d436fb22773d66717) and 
+includes the [fluss-connector-flink](engine-flink/getting-started.md) and [flink-connector-faker](https://flink-packages.org/packages/flink-faker) to simplify this guide.
 
 3. To start all containers, run the following command in the directory that contains the `docker-compose.yml` file:
 ```shell
 docker-compose up -d
 ```
 This command automatically starts all the containers defined in the Docker Compose configuration in a detached mode.
-Run `docker ps` to check whether these containers are running properly. You can also visit http://localhost:8081/ to see if Flink is running normally.
+Run `docker ps` to check whether these containers are running properly. 
+
+You can also visit http://localhost:8081/ to see if Flink is running normally.
 
 :::note
 - If you want to run with your own Flink environment, remember to download the [fluss-connector-flink](engine-flink/getting-started.md), [flink-connector-faker](https://github.com/knaufk/flink-faker/releases) connector jars and then put them to `FLINK_HOME/lib/`.
@@ -107,9 +112,8 @@ First, use the following command to enter the Flink SQL CLI Container:
 docker-compose exec jobmanager ./sql-client
 ```
 
-**NOTE**:
-To simplify this guide, it has prepared three temporary `faker` tables to generate data, you can use `describe table source_customer` 
-, `describe table source_order` and `describe table source_nation` to see the schema of the pre-created tables.
+**Note**:
+To simplify this guide, three temporary tables have been pre-created with `faker` to generate data. You can view their schemas by running the following commands: `DESCRIBE TABLE source_customer`, `DESCRIBE TABLE source_order`, and `DESCRIBE TABLE source_nation`.
 
 ## Create Fluss Tables
 ### Create Fluss Catalog
@@ -175,9 +179,9 @@ First, run the following sql to sync data from source tables to Fluss tables:
 ```sql  title="Flink SQL Client"
 EXECUTE STATEMENT SET
 BEGIN
-INSERT INTO fluss_nation SELECT * FROM `default_catalog`.`default_database`.source_nation;
-INSERT INTO fluss_customer SELECT * FROM `default_catalog`.`default_database`.source_customer;
-INSERT INTO fluss_order SELECT * FROM `default_catalog`.`default_database`.source_order;
+    INSERT INTO fluss_nation SELECT * FROM `default_catalog`.`default_database`.source_nation;
+    INSERT INTO fluss_customer SELECT * FROM `default_catalog`.`default_database`.source_customer;
+    INSERT INTO fluss_order SELECT * FROM `default_catalog`.`default_database`.source_order;
 END;
 ```
 
@@ -187,11 +191,22 @@ primary-key tables `fluss_customer` and `fluss_nation` to enrich the `fluss_orde
 
 ```sql  title="Flink SQL Client"
 INSERT INTO enriched_orders
-SELECT o.order_key, o.cust_key, o.total_price, o.order_date, o.order_priority, o.clerk,
-    c.name, c.phone, c.acctbal, c.mktsegment, n.name
+SELECT o.order_key, 
+       o.cust_key, 
+       o.total_price,
+       o.order_date, 
+       o.order_priority,
+       o.clerk,
+       c.name,
+       c.phone,
+       c.acctbal, 
+       c.mktsegment,
+       n.name
 FROM fluss_order o 
-LEFT JOIN fluss_customer FOR SYSTEM_TIME AS OF `o`.`ptime` AS `c` ON o.cust_key = c.cust_key
-LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF `o`.`ptime` AS `n` ON c.nation_key = n.nation_key;
+LEFT JOIN fluss_customer FOR SYSTEM_TIME AS OF `o`.`ptime` AS `c` 
+    ON o.cust_key = c.cust_key
+LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF `o`.`ptime` AS `n` 
+    ON c.nation_key = n.nation_key;
 ```
 
 ## Real-Time Analytics on Fluss Tables
@@ -222,4 +237,4 @@ The result should be returned quickly since Fluss supports fast lookup by primar
 After finishing the tutorial, run `exit` to exit Flink SQL CLI Container and then run `docker-compose down` to stop all containers.
 
 ## Learn more
-Now that you're up an running with Fluss and Flink, check out the [Engine Flink](engine-flink/getting-started.md) docs to learn more features with Flink!
+Now that you're up an running with Fluss and Flink, check out the [Apache Flink Engine](engine-flink/getting-started.md) docs to learn more features with Flink!