Skip to content

Commit 722a39e

Browse files
Mehul BatraMehul Batra
authored andcommitted
address jark comments
1 parent 66d74c3 commit 722a39e

17 files changed

+16
-17
lines changed

website/blog/2025-12-02-fluss-x-iceberg-why-your-lakehouse-is-not-streamhouse-yet.md

Lines changed: 16 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
---
2-
title: "Fluss × Iceberg: Why Your Lakehouse Isn't Streamhouse Yet"
2+
title: "Fluss × Iceberg: Why Your Lakehouse Isn't Streamhouse Yet - Part 1"
33
authors: [mehulbatra, yuxia]
4-
date: 2025-12-02
5-
tags: [streaming-lakehouse, apache-iceberg, real-time-analytics]
4+
date: 2025-12-08
5+
tags: [streaming-lakehouse, apache-iceberg, real-time-analytics, apache-fluss]
66
---
77

88

9-
As software/data engineers, we've witnessed Apache Iceberg revolutionize analytical data lakes with ACID transactions, time travel, and schema evolution. Yet when we try to push Iceberg into real-time workloadssub-second streaming queries, high-frequency CDC updates, and primary key semanticswe hit fundamental architectural walls. This blog explores how Fluss × Iceberg integration works and delivers a true real-time lakehouse.
9+
As software and data engineers, we've witnessed Apache Iceberg revolutionize analytical data lakes with ACID transactions, time travel, and schema evolution. Yet when we try to push Iceberg into real-time workloads such as sub-second streaming queries, high-frequency CDC updates, and primary key semantics, we hit fundamental architectural walls. This blog explores how Fluss × Iceberg integration works and delivers a true real-time lakehouse.
1010

1111
Apache Fluss represents a new architectural approach: the **Streamhouse** for real-time lakehouses. Instead of stitching together separate streaming and batch systems, the Streamhouse unifies them under a single architecture. In this model, Apache Iceberg continues to serve exactly the role it was designed for: a highly efficient, scalable cold storage layer for analytics, while Fluss fills the missing piece: a hot streaming storage layer with sub-second latency, columnar storage, and built-in primary-key semantics.
1212

1313
After working on Fluss–Iceberg lakehouse integration and deploying this architecture at a massive scale, including Alibaba's 3 PB production deployment processing 40 GB/s, we're ready to share the architectural lessons learned. Specifically, why existing systems fall short, how Fluss and Iceberg naturally complement each other, and what this means for finally building true real-time lakehouses.
1414

15-
![Banner](assets/fluss-x-iceberg/fluss-realtime-lakehouse.png)
15+
![Banner](assets/fluss-x-iceberg/fluss-lakehouse-streaming.png)
1616

1717
<!-- truncate -->
1818

@@ -30,7 +30,7 @@ Four converging forces are driving the need for sub-second data infrastructure:
3030

3131
**4. Agentic AI Requires Real-Time Context:** AI agents need immediate access to the current system state to make decisions. Whether it's autonomous trading systems, intelligent routing agents, or customer service bots, agents can't operate effectively on stale data.
3232

33-
![Use Cases](assets/fluss-x-iceberg/lakehouse_usecases.png)
33+
![Use Cases](assets/fluss-x-iceberg/lakehouse-usecases.png)
3434

3535
### The Evolution of Data Freshness
3636

@@ -44,8 +44,7 @@ Four converging forces are driving the need for sub-second data infrastructure:
4444

4545
Yet critical use cases demand sub-second to second-level latency: search and recommendation systems with real-time personalization, advertisement attribution tracking, anomaly detection for fraud and security monitoring, operational intelligence for manufacturing/logistics/ride-sharing, and Gen AI model inference requiring up-to-the-second features. The industry needs a **hot real-time layer** sitting in front of the lakehouse.
4646

47-
![Evolution Timeline](assets/fluss-x-iceberg/evolution.png)
48-
47+
![Evolution Timeline](assets/fluss-x-iceberg/untitled.png)
4948
## What is Fluss × Iceberg?
5049

5150
### The Core Concept: Hot/Cold Unified Storage
@@ -78,7 +77,7 @@ Traditional architectures force you to maintain **separate systems** for these z
7877

7978
**Query flexibility:** Run streaming queries on hot data (Fluss), analytical queries on cold data (Iceberg), or union queries that transparently span both tiers.
8079

81-
![Tiering Service](assets/fluss-x-iceberg/fluss-tiering.png)
80+
![Tiering Service](assets/fluss-x-iceberg/fluss-tiering-lake.png)
8281

8382
## What Iceberg Misses Today
8483

@@ -295,7 +294,7 @@ The architectural breakthrough enabling a real-time lakehouse is **client-side s
295294

296295
### How Union Read Works
297296

298-
![Union Read Architecture](assets/fluss-x-iceberg/fluss-unionread.png)
297+
![Union Read Architecture](assets/fluss-x-iceberg/fluss-union-read.png)
299298

300299
Union Read seamlessly combines hot and cold data through intelligent offset coordination, as illustrated above:
301300

@@ -349,13 +348,13 @@ Fluss coordinator persists this mapping. When clients query, they receive the ex
349348

350349
## Architecture Benefits
351350

352-
### Cost-Efficient Storage
353-
![Historical Analysis](assets/fluss-x-iceberg/fluss-lake-history.png)
351+
### Cost-Efficient Historical Storage
352+
![Historical Analysis](assets/fluss-x-iceberg/fluss-lakehouse-history.png)
354353

355354
Automatic tiering optimizes storage and analytics: efficient backfill, projection/filter pushdown, high Parquet compression, and S3 throughput.
356355

357356
### Real-Time Analytics
358-
![Real-time Analytics](assets/fluss-x-iceberg/fluss-lake-realtime.png)
357+
![Real-time Analytics](assets/fluss-x-iceberg/fluss-lakehouse-realtime.png)
359358

360359
Union Read delivers sub-second lakehouse freshness: union delta log on Fluss, Arrow-native exchange, and seamless integration with Flink, Spark *, Trino, and StarRocks.
361360

@@ -374,7 +373,7 @@ This gives you a working streaming lakehouse environment in minutes. Visit: [htt
374373

375374
## Conclusion: The Path Forward
376375

377-
Apache Fluss and Apache Iceberg represent a fundamental rethinking of real-time lakehouse architecture. Instead of forcing Iceberg to become a streaming platform (which architecturally it was never designed to be), Fluss embraces Iceberg for its strengthscost-efficient analytical storage with ACID guaranteeswhile adding the missing hot streaming layer.
376+
Apache Fluss and Apache Iceberg represent a fundamental rethinking of real-time lakehouse architecture. Instead of forcing Iceberg to become a streaming platform (which architecturally it was never designed to be), Fluss embraces Iceberg for its strengths cost-efficient analytical storage with ACID guarantees, while adding the missing hot streaming layer.
378377

379378
The result is a Streamhouse that delivers:
380379

@@ -384,7 +383,7 @@ The result is a Streamhouse that delivers:
384383
- **Single write path** ending dual-write consistency problems
385384
- **Automatic lifecycle management** from hot to cold tiers
386385

387-
For software/data engineers building real-time analytics platforms, the question isn't whether to use Fluss or Icebergit's recognizing they solve complementary problems. Fluss handles what happens in the last hour (streaming, updates, real-time queries). Iceberg handles everything before that (historical analytics, ML training, compliance).
386+
For software/data engineers building real-time analytics platforms, the question isn't whether to use Fluss or Iceberg, it's recognizing they solve complementary problems. Fluss handles what happens in the last hour (streaming, updates, real-time queries). Iceberg handles everything before that (historical analytics, ML training, compliance).
388387

389388
### When to Adopt
390389

@@ -395,7 +394,7 @@ For software/data engineers building real-time analytics platforms, the question
395394
- Need for primary key semantics with indexed lookups
396395
- Large Flink stateful jobs (10TB+ state) that could be externalized
397396
- Desire to unify real-time and historical queries
398-
- Tired of maintaining dual infrastructureone for batch, another for real-time
397+
- Tired of maintaining dual infrastructure one for batch, another for real-time
399398

400399
### Next Steps
401400

@@ -407,4 +406,4 @@ For software/data engineers building real-time analytics platforms, the question
407406

408407
The future of real-time analytics isn't Lambda architecture with separate streaming and batch systems. It's unified lakehouse storage where hot and cold are simply tiers of the same table, with data flowing automatically between them.
409408

410-
**Apache Fluss makes this vision realit transforms your lakehouse into a streaming lakehouse.**
409+
**Apache Fluss makes this vision real, it transforms your lakehouse into a streaming lakehouse.**
5.69 MB
Loading
-381 KB
Binary file not shown.
-402 KB
Binary file not shown.
203 KB
Loading
197 KB
Loading
1000 KB
Loading
-1010 KB
Binary file not shown.
194 KB
Loading
-234 KB
Binary file not shown.

0 commit comments

Comments
 (0)