-
Notifications
You must be signed in to change notification settings - Fork 16
chore: blog on dest refactor #209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
colors in some of the headings are invisible, due to background color. Ex Globa Schema
| 2. **Slow throughput**: inefficient data processing pipeline resulted in poor ingestion performance | ||
| 3. **High memory consumption**: excessive serialization/deserialization and large JSON envelopes increased CPU and memory pressure | ||
| 4. **No proper file sizing**: inconsistent file sizes led to suboptimal query performance and storage efficiency | ||
| 5. **Schema unaware Go path**: parallel schema evolutions could conflict because Go lacked first-class knowledge of table schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 5. **Schema unaware Go path**: parallel schema evolutions could conflict because Go lacked first-class knowledge of table schema | |
| 5. **Go managed schema evolution/orchestration**: parallel schema evolutions could conflict because Go lacked first-class knowledge of table schema |
|
|
||
| **Benefits of this approach:** | ||
|
|
||
| - **Reduced RPC chatter**: Fewer, larger batches mean fewer network round trips |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its just local transfer. It won't matter much
|
|
||
| - **Reduced RPC chatter**: Fewer, larger batches mean fewer network round trips | ||
| - **Bounded working sets**: Memory usage is predictable and controlled | ||
| - **Consistent file sizes**: Better query performance and storage efficiency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have not explained how this is being achieved?
| } | ||
| ``` | ||
|
|
||
| **Key insight**: The commit is atomic. Either the entire batch becomes visible in the Iceberg table, or nothing does. No partial state. This atomicity is crucial because it means readers will never see inconsistent intermediate states, even during concurrent operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a batch? (currently we have explained batch = 10k records in the previous image)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we should explain, it has to be like chunk for full-load or cdc (reaching latest checkpoint)
| | Scenario | Behavior | Why It Works | | ||
| |----------|----------|--------------| | ||
| | **Crash before Iceberg commit** | Nothing visible; source not acked; retry produces same batch | Atomic commit ensures no partial state is visible; source replay provides idempotency | | ||
| | **Crash after commit, before ack** | Replay converges via upsert; no double-materialization | Iceberg's upsert semantics handle duplicate data correctly; no background jobs needed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't support this case for now. We need to handle it by checking commit id already present in Iceberg or not. Something like 2-phase-commit
|
|
||
| The Go side of our architecture is responsible for the high-level data plane operations: concurrency management, intelligent batching, and schema coordination. This design leverages Go's strengths in concurrent programming while keeping the complex Iceberg I/O operations in Java where the native libraries are most mature. | ||
|
|
||
| The key responsibilities of the Go data plane include: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flattening as well
|
|
||
| ### Parallel Normalization and Schema Detection | ||
|
|
||
| Normalization and schema evolution checks run in parallel at the thread level. Each thread builds a local candidate schema from its batch, compares it against the stream's global schema, and only acquires the stream-level schema lock if a difference is detected. This approach minimizes contention and allows for efficient parallel processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is normalisation?
|
|
||
| The key responsibilities of the Go data plane include: | ||
|
|
||
| - **Concurrent processing**: Managing multiple writer threads that can process different partitions or streams simultaneously |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Concurrent processing**: Managing multiple writer threads that can process different partitions or streams simultaneously | |
| - **Concurrent processing**: Managing multiple writer threads that can process different full-load-chunks or partitions or incremental/cdc at streams level simultaneously |
| The key responsibilities of the Go data plane include: | ||
|
|
||
| - **Concurrent processing**: Managing multiple writer threads that can process different partitions or streams simultaneously | ||
| - **Intelligent batching**: Collecting records into optimal batch sizes for efficient processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think there is any intelliegence here, we have fixed the batch size to 10k
| - **Schema sharing**: All threads for a stream share the same schema artifact to ensure consistency | ||
| - **Thread isolation**: Each thread has its own buffer and processing context | ||
| - **Resource management**: Proper initialization and cleanup of writer resources | ||
| - **Configuration**: Thread-specific options like batch sizes and timeouts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think need to remove this as we don't have this.
| Key aspects of the thread setup: | ||
|
|
||
| - **Schema sharing**: All threads for a stream share the same schema artifact to ensure consistency | ||
| - **Thread isolation**: Each thread has its own buffer and processing context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Thread isolation**: Each thread has its own buffer and processing context | |
| - **Thread isolation**: Each thread has its own buffer and processing context for error handling |
|
|
||
| The flush process follows this sequence: | ||
|
|
||
| 1. **Data flattening and schema detection**: Records are flattened and analyzed for schema changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flattening or normalisation? we need to stick to one keyword
|
|
||
| --- | ||
|
|
||
| ## gRPC Contract |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the main benefit is that we only mention the schema/types ones and then attach just records/values. This enhances the write as lesser data needs to be read. Just like pgoutput vs wal2json
|
|
||
| --- | ||
|
|
||
| ## Partition Fanout Writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to mention that it uses rolling writer. Basically closes one file before writing another file for that partition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this approach works on partition keys hashing (vs previous approach where we sort the records), we are trading little high memory use for faster speed.
| 1. **Batch size optimization** reduces the frequency of expensive operations (RPC calls, file writes) | ||
| 2. **Typed serialization** makes each operation faster and uses less memory | ||
| 3. **Thread-scoped schema** eliminates contention that would otherwise limit concurrency | ||
| 4. **Native Iceberg I/O** leverages the most efficient data structures and algorithms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not able to understand
| The performance improvements compound because they address different bottlenecks in the pipeline: | ||
|
|
||
| 1. **Batch size optimization** reduces the frequency of expensive operations (RPC calls, file writes) | ||
| 2. **Typed serialization** makes each operation faster and uses less memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 2. **Typed serialization** makes each operation faster and uses less memory | |
| 2. **Typed serialization using protobug** makes each operation faster and uses less memory |
|
|
||
| - **Improves query performance**: Larger, consistently-sized files enable better query planning and execution | ||
| - **Optimizes storage efficiency**: Reduces the number of small files that can hurt storage performance | ||
| - **Enables better compaction**: Consistent file sizes make compaction strategies more predictable and efficient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for compaction after full-load
|
|
||
| The destination refactor represents a fundamental shift in how we approach data pipeline architecture. By carefully separating concerns between Go and Java components and eliminating unnecessary complexity, we've achieved both significant performance improvements and stronger correctness guarantees. | ||
|
|
||
| ### Key Achievements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge this and Benefits section. currently feels too much duplication
|
|
||
| ### Immediate Priorities (Next 3-6 months) | ||
|
|
||
| - **Iceberg merge-on-read optimizations**: Implementing intelligent file pruning and predicate pushdown to reduce query latency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this how we would implement? equality to positional?
|
|
||
| - **Iceberg merge-on-read optimizations**: Implementing intelligent file pruning and predicate pushdown to reduce query latency | ||
| - **Real-time metrics and observability**: Building comprehensive monitoring dashboards for throughput, latency, and error rates | ||
| - **DLQ/SMT hooks**: Leveraging the typed contract and schema awareness to add dead-lettering and smart transforms without reintroducing JSON envelopes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow based writes?
|
|
||
| ### Medium-term Goals (6-12 months) | ||
|
|
||
| - **Multi-region replication**: Extending the atomic commit model to support cross-region data replication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any supporting references / links that this is even needed/possible?
| - **Multi-region replication**: Extending the atomic commit model to support cross-region data replication | ||
| - **Advanced compression**: Implementing columnar compression optimizations for better storage efficiency | ||
| - **Query acceleration**: Pre-computed aggregations and materialized views for common query patterns | ||
| - **Kubernetes-native deployment**: Operator-based deployment and management for cloud-native environments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have this as helm right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead we can mention auto-scaling for managed offering
| - **Query acceleration**: Pre-computed aggregations and materialized views for common query patterns | ||
| - **Kubernetes-native deployment**: Operator-based deployment and management for cloud-native environments | ||
|
|
||
| ### Long-term Vision (12+ months) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets remove. this is fully-AI generated
Destination Refactor Blog