Skip to content

Commit 98af40c

Browse files
authored
[docs] Fix typos streaming lakehouse page (#1581)
1 parent 7ffbae3 commit 98af40c

File tree

2 files changed

+10
-11
lines changed

2 files changed

+10
-11
lines changed

website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@ Key behavior for data retention:
125125

126126
### Reading with other Engines
127127

128-
Since the data tiered to Paimon from Fluss is stored as a standard Paimon table, you can use any engine that supports Paimon to read it. Below is an example using [StarRocks](https://paimon.apache.org/docs/master/engines/starrocks/):
128+
Since the data tiered to Paimon from Fluss is stored as a standard Paimon table, you can use any engine that supports Paimon to read it. Below is an example using [StarRocks](https://paimon.apache.org/docs/1.2/ecosystem/starrocks/):
129129

130130
First, create a Paimon catalog in StarRocks:
131131

website/docs/streaming-lakehouse/overview.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,16 @@ sidebar_position: 1
99

1010
Lakehouse represents a new, open architecture that combines the best elements of data lakes and data warehouses.
1111
It combines data lake scalability and cost-effectiveness with data warehouse reliability and performance.
12-
The well-known data lake format such like [Apache Iceberg](https://iceberg.apache.org/), [Apache Paimon](https://paimon.apache.org/), [Apache Hudi](https://hudi.apache.org/) and [Delta Lake](https://delta.io/) play key roles in the Lakehouse architecture,
12+
The well-known data lake formats such as [Apache Iceberg](https://iceberg.apache.org/), [Apache Paimon](https://paimon.apache.org/), [Apache Hudi](https://hudi.apache.org/) and [Delta Lake](https://delta.io/) play key roles in the Lakehouse architecture,
1313
facilitating a harmonious balance between data storage, reliability, and analytical capabilities within a single, unified platform.
1414

1515
Lakehouse, as a modern architecture, is effective in addressing the complex needs of data management and analytics.
16-
But they can hardly meet the scenario of real-time analytics requiring sub-second-level data freshness limited by their implementation.
16+
However, they struggle to meet real-time analytics scenarios that require sub-second-level data freshness due to limitations in their implementation.
1717
With these data lake formats, you will get into a contradictory situation:
1818

19-
1. If you require low latency, then you write and commit frequently, which means many small Parquet files. This becomes inefficient for
19+
1. If you require low latency, then you must write and commit frequently, resulting in many small Parquet files. This becomes inefficient for
2020
reads which must now deal with masses of small files.
21-
2. If you require reading efficiency, then you accumulate data until you can write to large Parquet files, but this introduces
22-
much higher latency.
21+
2. If you require reading efficiency, then you accumulate data until you can write to large Parquet files, but this results in much higher latency.
2322

2423
Overall, these data lake formats typically achieve data freshness at best within minute-level granularity, even under optimal usage conditions.
2524

@@ -31,17 +30,17 @@ This not only brings low latency to data Lakehouse, but also adds powerful analy
3130

3231
To build a Streaming Lakehouse, Fluss maintains a tiering service that compacts real-time data from the Fluss cluster into the data lake format stored in the Lakehouse Storage.
3332
The data in the Fluss cluster, stored in streaming Arrow format, is optimized for low-latency read and write operations, making it ideal for short-term data storage. In contrast, the compacted data in the Lakehouse, stored in Parquet format with higher compression, is optimized for efficient analytics and long-term storage.
34-
So the data in Fluss cluster serves real-time data layer which retains days with sub-second-level freshness, and the data in Lakehouse serves historical data layer which retains months with minute-level freshness.
33+
The data in the Fluss cluster serves as a real-time data layer, retaining days of data with sub-second-level freshness. In contrast, the data in the Lakehouse serves as a historical data layer, retaining months of data with minute-level freshness.
3534

3635
![streamhouse](../assets/streamhouse.png)
3736

3837
The core idea of Streaming Lakehouse is shared data and shared metadata between stream and Lakehouse, avoiding data duplication and metadata inconsistency.
39-
Some powerful features it provided are:
38+
Some powerful features it provides are:
4039

41-
- **Unified Metadata**: Fluss provides a unified table metadata for both data in Stream and Lakehouse. So users only need to handle one table, but can access either the real-time streaming data, or the historical data, or the union of them.
42-
- **Union Reads**: Compute engines perform queries on the table will read the union of the real-time streaming data and Lakehouse data. Currently, only Flink supports union reads, but more engines are on the roadmap.
40+
- **Unified Metadata**: Fluss provides unified table metadata for both data in Stream and Lakehouse. Users only need to manage one table and can access real-time streaming data, historical data, or both combined.
41+
- **Union Reads**: Compute engines that perform queries on the table will read the union of the real-time streaming data and Lakehouse data. Currently, only Flink supports union reads, but more engines are on the roadmap.
4342
- **Real-Time Lakehouse**: The union reads help Lakehouse evolving from near-real-time analytics to truly real-time analytics. This empowers businesses to gain more valuable insights from real-time data.
4443
- **Analytical Streams**: The union reads help data streams to have the powerful analytics capabilities. This reduces complexity when developing streaming applications, simplifies debugging, and allows for immediate access to live data insights.
45-
- **Connect to Lakehouse Ecosystem**: Fluss keeps the table metadata in sync with data lake catalogs while compacting data into Lakehouse. This allows external engines like Spark, StarRocks, Flink, Trino to read the data directly by connecting to the data lake catalog.
44+
- **Connect to Lakehouse Ecosystem**: Fluss keeps the table metadata in sync with data lake catalogs while compacting data into Lakehouse. As a result, external engines like Spark, StarRocks, Flink, and Trino can read the data directly. They simply connect to the data lake catalog.
4645

4746
Currently, Fluss supports [Paimon](integrate-data-lakes/paimon.md), [Iceberg](integrate-data-lakes/iceberg.md), and [Lance](integrate-data-lakes/lance.md) as Lakehouse Storage, more kinds of data lake formats are on the roadmap.

0 commit comments

Comments
 (0)