-
Notifications
You must be signed in to change notification settings - Fork 24
blog: Introducing OLake Fusion #408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
siddharth-chevella
wants to merge
9
commits into
datazip-inc:master
from
siddharth-chevella:fusion-intro-blog
Closed
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
28d59b3
blog: Introducing OLake Fusion
siddharth-chevella 83cc17b
changes-01
siddharth-chevella abafd0d
comments resolved
siddharth-chevella f402dfe
updated links
siddharth-chevella 1edd801
resolved comments -2
siddharth-chevella 9093931
chore: minor changes
siddharth-chevella 6bfbb66
chore: resolved badal comments
siddharth-chevella 7af0a24
update: delete file image
siddharth-chevella 8ba9ecd
reolved: badal comments 2
siddharth-chevella File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,139 @@ | ||
| --- | ||
| slug: olake-fusion-introduction | ||
| title: "We Built an Easier Way to Maintain Iceberg Tables" | ||
| description: Why Iceberg table maintenance tends to become a full-time headache, what teams are doing today to cope with it, and what we built at OLake to actually solve it. | ||
| date: 2026-04-28 | ||
| tags: [iceberg, olake, fusion, compaction, optimization, apache-iceberg, iceberg-tables, iceberg-maintenance, small-files, binpack-compaction, sort-compaction, manifest-rewrite, lakehouse, metadata-optimization] | ||
| authors: [siddharth] | ||
| image: /img/blog/2026/5/introducing-olake-fusion.webp | ||
| --- | ||
|
|
||
| # We Built a Better Way to Maintain Apache Iceberg Tables | ||
|
|
||
| Apache Iceberg is the right choice for most modern lakehouses ([read more](https://olake.io/blog/apache-iceberg-features-benefits/)). It gives you ACID guarantees, schema evolution, time travel, and genuinely fast analytical queries — without locking you into any single vendor or engine. The adoption numbers back it up: Iceberg has quietly become the default open table format for teams building serious data infrastructure. | ||
|
|
||
| But here's what nobody tells you when you're getting started: picking the right table format is only half the job. The other half is *keeping those tables healthy*. And that part? It's a lot harder than it looks. | ||
|
|
||
| This blog is about that second half — specifically, why Iceberg table maintenance tends to become a full-time headache, what teams are doing today to cope with it, and what we built at OLake to actually solve it. | ||
|
|
||
| ## The Problem That Sneaks Up on You | ||
|
|
||
| When you first set up Iceberg, everything feels fast and clean. Queries return in seconds. Pipelines run smoothly. The team is happy. | ||
|
|
||
| Then, slowly, things start to drift. | ||
|
|
||
| Queries that used to finish in seconds start taking minutes. Your dashboards feel a little sluggish. File counts are climbing. Nothing is *broken*, exactly — but something is clearly off. | ||
|
|
||
| What's happening is almost always the same thing: **Small Files**. | ||
|
|
||
| Every time a CDC pipeline writes data to Iceberg, it creates new files in object storage. That's how the format works. The trouble is that modern CDC pipelines write constantly. Row-level changes streaming in every few seconds, each batch producing a tiny new file. What should've been 50 well-sized Parquet files has turned into 50,000 tiny ones spread across your table. | ||
|
|
||
|  | ||
|
|
||
| This is the small files problem, and it triggers a cascade of issues. | ||
|
|
||
| **Query engines have to work much harder.** Engines like Spark, Trino, or Athena don't read a table as a single unit. They read individual files. With 50,000 small files, every query involves thousands of extra file listings, metadata reads, and I/O round trips. The total data size hasn't changed, but the work has grown by orders of magnitude. | ||
|
|
||
|  | ||
|
|
||
| **Metadata becomes a bottleneck on its own.** Iceberg lists every data file in manifests. The more files you have, the heavier those lists get. Planning a query or committing a write then takes longer, because the engine has to scan a much larger inventory before it can do real work. | ||
|
|
||
| **Delete file accumulation makes this even worse.** In CDC-heavy pipelines, every sync doesn't just create new data files — it also creates delete files that track which rows were updated or deleted. These delete files are how Iceberg handles upserts without rewriting entire data files on every change. But delete files have a cost: every query has to apply them at read time to get the correct view of the data. As they pile up, the overhead of applying deletes during reads becomes significant. A table with thousands of delete files will be noticeably slower than the same table after they've been resolved. | ||
|
|
||
|  | ||
|
siddharth-chevella marked this conversation as resolved.
|
||
|
|
||
| **Object storage costs creep up silently.** Cloud storage doesn't just charge for how much data you store — it also charges per API request. More files means more reads, more listings, more API calls on every operation. You won't notice it until the bill shows up, and by then you've been overpaying for weeks. | ||
|
|
||
|  | ||
|
siddharth-chevella marked this conversation as resolved.
|
||
|
|
||
| None of this happens suddenly. It builds up quietly, which is exactly why it catches teams off guard. By the time performance is obviously degraded, the tables are already in rough shape. | ||
|
|
||
| ## What Teams Do Today (And Why It's a Grind) | ||
|
|
||
| The standard fix for small files and delete accumulation is **compaction** — periodically rewriting fragmented small files into larger, well-organized ones and resolving accumulated deletes into the data. Iceberg's ecosystem supports this, and Apache Spark has become the de facto tool for it via `rewrite_data_files`. | ||
|
|
||
| So most teams end up doing something like this: | ||
|
|
||
| They write a Python script that calls `rewrite_data_files(...)` with the right parameters. They figure out executor counts, memory settings, file size bounds, and parallelism through trial and error. They wire it up to Airflow or a cron job to run every 20 or 30 minutes. A few weeks later, their ingestion rate changes, and the parameters they tuned are no longer appropriate for the table's current state. | ||
|
|
||
| This works. Teams do make it work. But look at what they're actually doing: | ||
|
|
||
| **Writing custom spark scripts to maintain Iceberg tables.** The compaction script itself becomes a thing that needs documentation, version control, incident response, and occasional debugging at 2am when a job fails and nobody knows why. That's before you account for the fact that most teams have more than one Iceberg table. | ||
|
|
||
| **Running one compaction setup for situations that need different treatment.** Some tables need frequent, aggressive compaction, while others only need lighter compaction on a slower schedule. Spark's `rewrite_data_files` doesn't differentiate — it processes whatever files fall within your size bounds, regardless of whether that's the right level of intervention for the current table state. | ||
|
|
||
| **Figuring out what happened by digging through scattered logs.** Most setups save something like: Airflow history, Spark driver logs, or files on disk. The hard part is connecting that output back to the table itself—whether file layout improved, deletes were absorbed, and whether the job really helped. When a run fails—or shows success while queries stay slow—**why** is often still unclear. Errors and exit codes alone rarely say what went wrong; you hunt through executor logs and put the picture together by hand. | ||
|
|
||
| ## Introducing OLake Fusion | ||
|
|
||
|  | ||
|
|
||
| OLake Fusion is a dedicated Iceberg table maintenance service. It handles compaction for your Iceberg tables on a per-table cron schedule you configure — with tiered compaction levels, built-in metrics, and enough observability to actually understand what's happening to your tables. | ||
|
|
||
| No custom Spark scripts. No wondering if last night's compaction job did anything useful. | ||
|
|
||
| ### Tiered Compaction: The Right Level of Work at the Right Time | ||
|
|
||
| The most important thing about OLake Fusion's approach is that it doesn't treat all compaction as the same operation. It offers three compaction tiers that you can schedule independently, each designed for a different kind of table maintenance need. | ||
|
|
||
| **Lite** — Designed for small, frequent cleanup tasks. It keeps tables from slowly sliding into bad shape without using much compute, so you can run it often. | ||
|
siddharth-chevella marked this conversation as resolved.
|
||
|
|
||
| **Medium** — Designed for regular cleanup when small files and deletes are starting to slow reads. It does more work than Lite, but avoids the cost of rewriting the whole table. | ||
|
|
||
| **Full** — Designed for deep cleanup tasks where the whole table needs to be laid out fresh. It uses the most compute, so it makes sense for occasional resets, not frequent runs. | ||
|
|
||
| One important detail: if multiple tiers are scheduled to run at the same time, Fusion automatically runs only the highest one. Medium overrides Lite. Full overrides both. You don't end up doing redundant work when schedules overlap. | ||
|
siddharth-chevella marked this conversation as resolved.
|
||
|
|
||
| For the exact details of what each tier does, see the [Types of Compaction](https://olake.io/docs/iceberg-maintenance/compaction/overview/). | ||
|
|
||
| This tiered approach matters because it lets you optimize for cost and efficiency at the same time. Running Full compaction every few minutes on a CDC table is wasteful — you're rewriting data that doesn't need rewriting. Running only Lite is insufficient if delete files are building up and impacting read performance. The right answer is run Lite frequently, Medium regularly, Full occasionally. Fusion makes it easy to express exactly that. | ||
|
|
||
| ### Cheaper than Spark compaction | ||
|
|
||
| On comparable infrastructure, Fusion costs about **50% less** than Apache Spark’s `rewrite_data_files` for the same compaction workload without giving up table layout quality. Run-by-run timings, query checks, methodology, and cost breakdown are covered in [OLake Fusion vs Spark compaction benchmark](https://olake.io/blog/iceberg-compaction-spark-vs-fusion-benchmark/) | ||
|
|
||
| ### Observability That Actually Tells You Something | ||
|
|
||
| Here's a problem that doesn't get talked about enough: with custom Spark compaction scripts, visibility is usually something you have to build and maintain yourself. | ||
|
|
||
| You can query Iceberg metadata tables before and after a Spark job to calculate file counts and delete counts. But in practice, teams still have to wire that into the job, store the results, connect them to run history, and make them easy to inspect when something feels slow. Fusion makes that visibility part of the product instead of another extra script. | ||
|
|
||
| Fusion comes with observability built in, at two levels. | ||
|
|
||
| **Per-run logs and metrics.** Fusion keeps logs and metrics for each compaction run, so you can see what happened and dig in without starting from unrelated job noise. More in [Runs and logs](https://olake.io/docs/iceberg-maintenance/runs-and-logs). | ||
|
|
||
|  | ||
|
|
||
| **Input vs output for each run.** After each compaction run, Fusion shows metrics for inputs and outputs: counts and sizes for data files and deletes, recorded before versus after each run. You read them straight from the UI instead of reconstructing totals only from unstructured logs. | ||
|
|
||
|  | ||
|
|
||
| **Table-level health metrics.** Fusion shows metrics for each table's current state so you can understand and decide whether it needs compaction. | ||
|
|
||
|  | ||
|
|
||
| The **Tables** page shows an overall **health score** for each table, so you get a first-pass view of whether compaction looks necessary before you dive into detailed metrics. | ||
|
|
||
|  | ||
|
|
||
| This is the kind of visibility that makes the difference between proactively maintaining your tables and reactively debugging performance issues after users are already complaining. | ||
|
merlynm20 marked this conversation as resolved.
|
||
|
|
||
| To know more about metrics, refer [here](https://olake.io/docs/iceberg-maintenance/metrics). | ||
|
|
||
| ## Where OLake Fusion fits in | ||
|
|
||
| Fusion connects to your Iceberg catalog. For each table, you configure the compaction schedule — which tiers to enable, and how often each one should run. You can think of it like cron: you define the cadence, Fusion executes it. | ||
|
|
||
| A typical setup depends on your CDC ingestion frequency. For example, if ingestion runs every 2 minutes, you might schedule Lite every 30 minutes, Medium every 6 hours, and Full every 2 days. Fusion handles the execution, the logging, and the metrics. If a run fails, you see it immediately in the runs view without having to dig through Airflow task logs or SSH into a Spark driver node. | ||
|
|
||
| If you're already using OLake for CDC ingestion, Fusion integrates naturally — same catalog, same UI. But it also works as a standalone service if you're using a different ingestion tool. | ||
|
|
||
| Refer here for a walkthrough guide: [Getting Started with Fusion](https://olake.io/docs/getting-started/configure-first-compaction) | ||
|
|
||
| ## TL;DR | ||
|
|
||
| If you're running Iceberg with CDC pipelines, table maintenance isn't optional — it's the difference between a lakehouse that stays fast and one that gradually becomes unusable. The small files problem and delete file accumulation are real, they compound over time, and they're hard to notice until performance is already degraded. | ||
|
merlynm20 marked this conversation as resolved.
|
||
|
|
||
| Spark-based compaction works, but only if you build and run those jobs yourself. They are often slow and expensive, and it can be hard to tell if each run really helped. | ||
|
|
||
| OLake Fusion is built specifically for this. Tiered compaction that matches the level of work to what the table actually needs. 2x faster than Spark. About half the cost. And enough observability to actually understand what's happening to your tables — before your users start asking why queries are slow. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
|
badalprasadsingh marked this conversation as resolved.
badalprasadsingh marked this conversation as resolved.
badalprasadsingh marked this conversation as resolved.
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
|
badalprasadsingh marked this conversation as resolved.
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
|
badalprasadsingh marked this conversation as resolved.
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
|
badalprasadsingh marked this conversation as resolved.
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.