chore: blog on dest refactor #209

hash-data · 2025-10-10T13:21:31Z

Destination Refactor Blog

shubham19may · 2025-10-13T11:42:25Z

static/img/blog/2025/10/how-olake-becomes-7x-faster-3.webp

colors in some of the headings are invisible, due to background color. Ex Globa Schema

shubham19may · 2025-10-13T12:08:27Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+2. **Slow throughput**: inefficient data processing pipeline resulted in poor ingestion performance
+3. **High memory consumption**: excessive serialization/deserialization and large JSON envelopes increased CPU and memory pressure
+4. **No proper file sizing**: inconsistent file sizes led to suboptimal query performance and storage efficiency
+5. **Schema unaware Go path**: parallel schema evolutions could conflict because Go lacked first-class knowledge of table schema


Suggested change

5. **Schema unaware Go path**: parallel schema evolutions could conflict because Go lacked first-class knowledge of table schema

5. **Go managed schema evolution/orchestration**: parallel schema evolutions could conflict because Go lacked first-class knowledge of table schema

shubham19may · 2025-10-13T12:23:10Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+**Benefits of this approach:**
+
+- **Reduced RPC chatter**: Fewer, larger batches mean fewer network round trips


its just local transfer. It won't matter much

shubham19may · 2025-10-13T12:23:44Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+- **Reduced RPC chatter**: Fewer, larger batches mean fewer network round trips
+- **Bounded working sets**: Memory usage is predictable and controlled
+- **Consistent file sizes**: Better query performance and storage efficiency


We have not explained how this is being achieved?

shubham19may · 2025-10-13T12:45:10Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+}
+```
+
+**Key insight**: The commit is atomic. Either the entire batch becomes visible in the Iceberg table, or nothing does. No partial state. This atomicity is crucial because it means readers will never see inconsistent intermediate states, even during concurrent operations.


what is a batch? (currently we have explained batch = 10k records in the previous image)

ideally we should explain, it has to be like chunk for full-load or cdc (reaching latest checkpoint)

shubham19may · 2025-10-13T12:52:42Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+| Scenario | Behavior | Why It Works |
+|----------|----------|--------------|
+| **Crash before Iceberg commit** | Nothing visible; source not acked; retry produces same batch | Atomic commit ensures no partial state is visible; source replay provides idempotency |
+| **Crash after commit, before ack** | Replay converges via upsert; no double-materialization | Iceberg's upsert semantics handle duplicate data correctly; no background jobs needed |


we don't support this case for now. We need to handle it by checking commit id already present in Iceberg or not. Something like 2-phase-commit

shubham19may · 2025-10-13T12:58:27Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+The Go side of our architecture is responsible for the high-level data plane operations: concurrency management, intelligent batching, and schema coordination. This design leverages Go's strengths in concurrent programming while keeping the complex Iceberg I/O operations in Java where the native libraries are most mature.
+
+The key responsibilities of the Go data plane include:


flattening as well

shubham19may · 2025-10-13T13:01:11Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+### Parallel Normalization and Schema Detection
+
+Normalization and schema evolution checks run in parallel at the thread level. Each thread builds a local candidate schema from its batch, compares it against the stream's global schema, and only acquires the stream-level schema lock if a difference is detected. This approach minimizes contention and allows for efficient parallel processing.


what is normalisation?

shubham19may · 2025-10-13T13:39:19Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+The key responsibilities of the Go data plane include:
+
+- **Concurrent processing**: Managing multiple writer threads that can process different partitions or streams simultaneously


Suggested change

- **Concurrent processing**: Managing multiple writer threads that can process different partitions or streams simultaneously

- **Concurrent processing**: Managing multiple writer threads that can process different full-load-chunks or partitions or incremental/cdc at streams level simultaneously

shubham19may · 2025-10-13T13:40:02Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+The key responsibilities of the Go data plane include:
+
+- **Concurrent processing**: Managing multiple writer threads that can process different partitions or streams simultaneously
+- **Intelligent batching**: Collecting records into optimal batch sizes for efficient processing


i don't think there is any intelliegence here, we have fixed the batch size to 10k

shubham19may · 2025-10-13T13:41:32Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+- **Schema sharing**: All threads for a stream share the same schema artifact to ensure consistency
+- **Thread isolation**: Each thread has its own buffer and processing context
+- **Resource management**: Proper initialization and cleanup of writer resources
+- **Configuration**: Thread-specific options like batch sizes and timeouts


i think need to remove this as we don't have this.

shubham19may · 2025-10-13T13:42:09Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+Key aspects of the thread setup:
+
+- **Schema sharing**: All threads for a stream share the same schema artifact to ensure consistency
+- **Thread isolation**: Each thread has its own buffer and processing context


Suggested change

- **Thread isolation**: Each thread has its own buffer and processing context

- **Thread isolation**: Each thread has its own buffer and processing context for error handling

shubham19may · 2025-10-13T13:43:04Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+The flush process follows this sequence:
+
+1. **Data flattening and schema detection**: Records are flattened and analyzed for schema changes


flattening or normalisation? we need to stick to one keyword

shubham19may · 2025-10-13T13:46:42Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+---
+
+## gRPC Contract


i think the main benefit is that we only mention the schema/types ones and then attach just records/values. This enhances the write as lesser data needs to be read. Just like pgoutput vs wal2json

shubham19may · 2025-10-13T13:49:11Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+---
+
+## Partition Fanout Writer


we need to mention that it uses rolling writer. Basically closes one file before writing another file for that partition

also this approach works on partition keys hashing (vs previous approach where we sort the records), we are trading little high memory use for faster speed.

shubham19may · 2025-10-13T13:52:17Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+1. **Batch size optimization** reduces the frequency of expensive operations (RPC calls, file writes)
+2. **Typed serialization** makes each operation faster and uses less memory
+3. **Thread-scoped schema** eliminates contention that would otherwise limit concurrency
+4. **Native Iceberg I/O** leverages the most efficient data structures and algorithms


not able to understand

shubham19may · 2025-10-13T13:53:04Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+The performance improvements compound because they address different bottlenecks in the pipeline:
+
+1. **Batch size optimization** reduces the frequency of expensive operations (RPC calls, file writes)
+2. **Typed serialization** makes each operation faster and uses less memory


Suggested change

2. **Typed serialization** makes each operation faster and uses less memory

2. **Typed serialization using protobug** makes each operation faster and uses less memory

shubham19may · 2025-10-13T13:54:36Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+- **Improves query performance**: Larger, consistently-sized files enable better query planning and execution
+- **Optimizes storage efficiency**: Reduces the number of small files that can hurt storage performance
+- **Enables better compaction**: Consistent file sizes make compaction strategies more predictable and efficient


No need for compaction after full-load

shubham19may · 2025-10-13T13:55:48Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+The destination refactor represents a fundamental shift in how we approach data pipeline architecture. By carefully separating concerns between Go and Java components and eliminating unnecessary complexity, we've achieved both significant performance improvements and stronger correctness guarantees.
+
+### Key Achievements


merge this and Benefits section. currently feels too much duplication

shubham19may · 2025-10-13T13:58:28Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+### Immediate Priorities (Next 3-6 months)
+
+- **Iceberg merge-on-read optimizations**: Implementing intelligent file pruning and predicate pushdown to reduce query latency


this how we would implement? equality to positional?

shubham19may · 2025-10-13T13:58:45Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+- **Iceberg merge-on-read optimizations**: Implementing intelligent file pruning and predicate pushdown to reduce query latency
+- **Real-time metrics and observability**: Building comprehensive monitoring dashboards for throughput, latency, and error rates
+- **DLQ/SMT hooks**: Leveraging the typed contract and schema awareness to add dead-lettering and smart transforms without reintroducing JSON envelopes


arrow based writes?

shubham19may · 2025-10-13T13:59:19Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+
+### Medium-term Goals (6-12 months)
+
+- **Multi-region replication**: Extending the atomic commit model to support cross-region data replication


any supporting references / links that this is even needed/possible?

shubham19may · 2025-10-13T13:59:37Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+- **Multi-region replication**: Extending the atomic commit model to support cross-region data replication
+- **Advanced compression**: Implementing columnar compression optimizations for better storage efficiency
+- **Query acceleration**: Pre-computed aggregations and materialized views for common query patterns
+- **Kubernetes-native deployment**: Operator-based deployment and management for cloud-native environments


we already have this as helm right?

instead we can mention auto-scaling for managed offering

shubham19may · 2025-10-13T14:00:20Z

blog/2025-10-10-how-olake-becomes-7x-faster.mdx

+- **Query acceleration**: Pre-computed aggregations and materialized views for common query patterns
+- **Kubernetes-native deployment**: Operator-based deployment and management for cloud-native environments
+
+### Long-term Vision (12+ months)


lets remove. this is fully-AI generated

hash-data added 6 commits October 10, 2025 18:50

chore: blog on dest refactor

4d48745

chore: update images

efacc44

chore: image update

ce596e3

Update blog post with proper image paths and WebP format

2c37395

chore: adding cover image and improve blog

b285a75

chore: update auther

965b1da

shubham19may reviewed Oct 13, 2025

View reviewed changes

	5. Schema unaware Go path: parallel schema evolutions could conflict because Go lacked first-class knowledge of table schema
	5. Go managed schema evolution/orchestration: parallel schema evolutions could conflict because Go lacked first-class knowledge of table schema


		Benefits of this approach:

		- Reduced RPC chatter: Fewer, larger batches mean fewer network round trips


		The Go side of our architecture is responsible for the high-level data plane operations: concurrency management, intelligent batching, and schema coordination. This design leverages Go's strengths in concurrent programming while keeping the complex Iceberg I/O operations in Java where the native libraries are most mature.

		The key responsibilities of the Go data plane include:


		### Parallel Normalization and Schema Detection

		Normalization and schema evolution checks run in parallel at the thread level. Each thread builds a local candidate schema from its batch, compares it against the stream's global schema, and only acquires the stream-level schema lock if a difference is detected. This approach minimizes contention and allows for efficient parallel processing.


		The key responsibilities of the Go data plane include:

		- Concurrent processing: Managing multiple writer threads that can process different partitions or streams simultaneously

	- Thread isolation: Each thread has its own buffer and processing context
	- Thread isolation: Each thread has its own buffer and processing context for error handling


		The flush process follows this sequence:

		1. Data flattening and schema detection: Records are flattened and analyzed for schema changes

	2. Typed serialization makes each operation faster and uses less memory
	2. Typed serialization using protobug makes each operation faster and uses less memory


		The destination refactor represents a fundamental shift in how we approach data pipeline architecture. By carefully separating concerns between Go and Java components and eliminating unnecessary complexity, we've achieved both significant performance improvements and stronger correctness guarantees.

		### Key Achievements


		### Immediate Priorities (Next 3-6 months)

		- Iceberg merge-on-read optimizations: Implementing intelligent file pruning and predicate pushdown to reduce query latency


		### Medium-term Goals (6-12 months)

		- Multi-region replication: Extending the atomic commit model to support cross-region data replication

chore: blog on dest refactor #209

Are you sure you want to change the base?

chore: blog on dest refactor #209

Uh oh!

Conversation

hash-data commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shubham19may Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shubham19may Oct 13, 2025 •

edited

Loading