You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As software/data engineers, we've witnessed Apache Iceberg revolutionize analytical data lakes with ACID transactions, time travel, and schema evolution. Yet when we try to push Iceberg into real-time workloads—sub-second streaming queries, high-frequency CDC updates, and primary key semantics—we hit fundamental architectural walls. This blog explores how Fluss × Iceberg integration works and delivers a true real-time lakehouse.
11
10
12
11
Apache Fluss represents a new architectural approach: the **Streamhouse** for real-time lakehouses. Instead of stitching together separate streaming and batch systems, the Streamhouse unifies them under a single architecture. In this model, Apache Iceberg continues to serve exactly the role it was designed for: a highly efficient, scalable cold storage layer for analytics, while Fluss fills the missing piece: a hot streaming storage layer with sub-second latency, columnar storage, and built-in primary-key semantics.
13
12
14
13
After working on Fluss–Iceberg lakehouse integration and deploying this architecture at a massive scale, including Alibaba's 3 PB production deployment processing 40 GB/s, we're ready to share the architectural lessons learned. Specifically, why existing systems fall short, how Fluss and Iceberg naturally complement each other, and what this means for finally building true real-time lakehouses.
@@ -29,7 +30,7 @@ Four converging forces are driving the need for sub-second data infrastructure:
29
30
30
31
**4. Agentic AI Requires Real-Time Context:** AI agents need immediate access to the current system state to make decisions. Whether it's autonomous trading systems, intelligent routing agents, or customer service bots, agents can't operate effectively on stale data.
@@ -43,14 +44,15 @@ Four converging forces are driving the need for sub-second data infrastructure:
43
44
44
45
Yet critical use cases demand sub-second to second-level latency: search and recommendation systems with real-time personalization, advertisement attribution tracking, anomaly detection for fraud and security monitoring, operational intelligence for manufacturing/logistics/ride-sharing, and Gen AI model inference requiring up-to-the-second features. The industry needs a **hot real-time layer** sitting in front of the lakehouse.
The Fluss architecture delivers millisecond-level end-to-end latency for real-time data writing and reading. Its **Tiering Service** continuously offloads data into standard lakehouse formats like Apache Iceberg, enabling external query engines to analyze data directly. This streaming/lakehouse unification simplifies the ecosystem, ensures data freshness for critical use cases, and combines real-time and historical data seamlessly for comprehensive analytics.
53
54
55
+
**Unified Data Locality:** Fluss aligns partitions and buckets across both streaming and lakehouse layers, ensuring consistent data layout. This alignment enables direct Arrow-to-Parquet conversion without network shuffling or repartitioning, dramatically reducing I/O overhead and improving pipeline performance.
@@ -59,9 +61,10 @@ Think of your data as having two thermal zones:
59
61
60
62
Traditional architectures force you to maintain **separate systems** for these zones: Kafka/Kinesis for streaming (hot), Iceberg for analytics (cold), complex ETL pipelines to move data between them, and applications writing to both systems (dual-write problem).
61
63
64
+

65
+
62
66
**Fluss × Iceberg unifies these as tiered storage with Kappa architecture:** Applications write once to Fluss. A stateless Tiering Service (Flink job) automatically moves data from hot to cold storage based on configured freshness (e.g., 30 seconds, 5 minutes). Query engines see a single table that seamlessly spans both tiers—eliminating the dual-write complexity of Lambda architecture.
63
67
64
-

65
68
66
69
### Why This Architecture Matters
67
70
@@ -75,6 +78,8 @@ Traditional architectures force you to maintain **separate systems** for these z
75
78
76
79
**Query flexibility:** Run streaming queries on hot data (Fluss), analytical queries on cold data (Iceberg), or union queries that transparently span both tiers.
Apache Iceberg was architected for batch-optimized analytics. While it supports streaming ingestion, fundamental design decisions create unavoidable limitations for real-time workloads.
@@ -282,43 +287,31 @@ While tiering data, the service optionally performs bin-packing compaction:
282
287
283
288
**Result:** Streaming workloads avoid small file proliferation without separate maintenance jobs.
### Solution 4: Union Read for Seamless Query Across Tiers
288
291
289
292
**Enables:** Querying hot + cold data as a single logical table
290
293
291
294
The architectural breakthrough enabling a real-time lakehouse is **client-side stitching with metadata coordination**. This is what makes Fluss truly a **Streaming Lakehouse**—unlocking real-time data to the Lakehouse with union delta log (minutes) on Fluss.
Union Read seamlessly combines hot and cold data through intelligent offset coordination, as illustrated above:
301
+
302
+
**The Example:** Consider a query that needs records for users Jark, Mehul, and Yuxia:
303
+
304
+
1.**Offset Coordination:** Fluss CoordinatorServer provides Snapshot 06 as the Iceberg boundary. At this snapshot, Iceberg contains `{Jark: 30, Yuxia: 20}`.
305
+
306
+
2.**Hot Data Supplement:** Fluss's real-time layer holds the latest updates beyond the snapshot: `{Jark: 30, Mehul: 20, Yuxia: 20}` (including Mehul's new record).
307
+
308
+
3.**Union Read in Action:** The query engine performs a union read:
309
+
- Reads `{Jark: 30, Yuxia: 20}` from Iceberg (Snapshot 06)
310
+
- Supplements with `{Mehul: 20}` from Fluss (new data after the snapshot)
311
+
312
+
4.**Sort Merge:** Results are merged and deduplicated, producing the final unified view: `{Jark: 30, Mehul: 20}` (Yuxia's update already in Iceberg).
313
+
314
+
**Key Benefit:** The application queries a single logical table while the system intelligently routes between Iceberg (historical) and Fluss (real-time) with zero gaps or overlaps.
322
315
323
316
**Union Read Capabilities:**
324
317
@@ -353,24 +346,19 @@ Fluss coordinator persists this mapping. When clients query, they receive the ex
Union Read delivers sub-second lakehouse freshness: union delta log on Fluss, Arrow-native exchange, and seamless integration with Flink, Spark *, Trino, and StarRocks.
@@ -388,7 +376,7 @@ This gives you a working streaming lakehouse environment in minutes. Visit: [htt
388
376
389
377
Apache Fluss and Apache Iceberg represent a fundamental rethinking of real-time lakehouse architecture. Instead of forcing Iceberg to become a streaming platform (which architecturally it was never designed to be), Fluss embraces Iceberg for its strengths—cost-efficient analytical storage with ACID guarantees—while adding the missing hot streaming layer.
390
378
391
-
The result is a Kappa architecture that delivers:
379
+
The result is a Streamhouse that delivers:
392
380
393
381
-**Sub-second query latency** for real-time workloads
394
382
-**Second-level freshness** for analytical queries (versus T+1 hour)
0 commit comments