Skip to content

Conversation

@hash-data
Copy link
Collaborator

Destination Refactor Blog

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

colors in some of the headings are invisible, due to background color. Ex Globa Schema

2. **Slow throughput**: inefficient data processing pipeline resulted in poor ingestion performance
3. **High memory consumption**: excessive serialization/deserialization and large JSON envelopes increased CPU and memory pressure
4. **No proper file sizing**: inconsistent file sizes led to suboptimal query performance and storage efficiency
5. **Schema unaware Go path**: parallel schema evolutions could conflict because Go lacked first-class knowledge of table schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. **Schema unaware Go path**: parallel schema evolutions could conflict because Go lacked first-class knowledge of table schema
5. **Go managed schema evolution/orchestration**: parallel schema evolutions could conflict because Go lacked first-class knowledge of table schema


**Benefits of this approach:**

- **Reduced RPC chatter**: Fewer, larger batches mean fewer network round trips
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its just local transfer. It won't matter much


- **Reduced RPC chatter**: Fewer, larger batches mean fewer network round trips
- **Bounded working sets**: Memory usage is predictable and controlled
- **Consistent file sizes**: Better query performance and storage efficiency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have not explained how this is being achieved?

}
```

**Key insight**: The commit is atomic. Either the entire batch becomes visible in the Iceberg table, or nothing does. No partial state. This atomicity is crucial because it means readers will never see inconsistent intermediate states, even during concurrent operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a batch? (currently we have explained batch = 10k records in the previous image)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we should explain, it has to be like chunk for full-load or cdc (reaching latest checkpoint)

| Scenario | Behavior | Why It Works |
|----------|----------|--------------|
| **Crash before Iceberg commit** | Nothing visible; source not acked; retry produces same batch | Atomic commit ensures no partial state is visible; source replay provides idempotency |
| **Crash after commit, before ack** | Replay converges via upsert; no double-materialization | Iceberg's upsert semantics handle duplicate data correctly; no background jobs needed |
Copy link
Contributor

@shubham19may shubham19may Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't support this case for now. We need to handle it by checking commit id already present in Iceberg or not. Something like 2-phase-commit


The Go side of our architecture is responsible for the high-level data plane operations: concurrency management, intelligent batching, and schema coordination. This design leverages Go's strengths in concurrent programming while keeping the complex Iceberg I/O operations in Java where the native libraries are most mature.

The key responsibilities of the Go data plane include:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flattening as well


### Parallel Normalization and Schema Detection

Normalization and schema evolution checks run in parallel at the thread level. Each thread builds a local candidate schema from its batch, compares it against the stream's global schema, and only acquires the stream-level schema lock if a difference is detected. This approach minimizes contention and allows for efficient parallel processing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is normalisation?


The key responsibilities of the Go data plane include:

- **Concurrent processing**: Managing multiple writer threads that can process different partitions or streams simultaneously
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Concurrent processing**: Managing multiple writer threads that can process different partitions or streams simultaneously
- **Concurrent processing**: Managing multiple writer threads that can process different full-load-chunks or partitions or incremental/cdc at streams level simultaneously

The key responsibilities of the Go data plane include:

- **Concurrent processing**: Managing multiple writer threads that can process different partitions or streams simultaneously
- **Intelligent batching**: Collecting records into optimal batch sizes for efficient processing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think there is any intelliegence here, we have fixed the batch size to 10k

- **Schema sharing**: All threads for a stream share the same schema artifact to ensure consistency
- **Thread isolation**: Each thread has its own buffer and processing context
- **Resource management**: Proper initialization and cleanup of writer resources
- **Configuration**: Thread-specific options like batch sizes and timeouts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think need to remove this as we don't have this.

Key aspects of the thread setup:

- **Schema sharing**: All threads for a stream share the same schema artifact to ensure consistency
- **Thread isolation**: Each thread has its own buffer and processing context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Thread isolation**: Each thread has its own buffer and processing context
- **Thread isolation**: Each thread has its own buffer and processing context for error handling


The flush process follows this sequence:

1. **Data flattening and schema detection**: Records are flattened and analyzed for schema changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flattening or normalisation? we need to stick to one keyword


---

## gRPC Contract
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the main benefit is that we only mention the schema/types ones and then attach just records/values. This enhances the write as lesser data needs to be read. Just like pgoutput vs wal2json


---

## Partition Fanout Writer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to mention that it uses rolling writer. Basically closes one file before writing another file for that partition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this approach works on partition keys hashing (vs previous approach where we sort the records), we are trading little high memory use for faster speed.

1. **Batch size optimization** reduces the frequency of expensive operations (RPC calls, file writes)
2. **Typed serialization** makes each operation faster and uses less memory
3. **Thread-scoped schema** eliminates contention that would otherwise limit concurrency
4. **Native Iceberg I/O** leverages the most efficient data structures and algorithms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not able to understand

The performance improvements compound because they address different bottlenecks in the pipeline:

1. **Batch size optimization** reduces the frequency of expensive operations (RPC calls, file writes)
2. **Typed serialization** makes each operation faster and uses less memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. **Typed serialization** makes each operation faster and uses less memory
2. **Typed serialization using protobug** makes each operation faster and uses less memory


- **Improves query performance**: Larger, consistently-sized files enable better query planning and execution
- **Optimizes storage efficiency**: Reduces the number of small files that can hurt storage performance
- **Enables better compaction**: Consistent file sizes make compaction strategies more predictable and efficient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for compaction after full-load


The destination refactor represents a fundamental shift in how we approach data pipeline architecture. By carefully separating concerns between Go and Java components and eliminating unnecessary complexity, we've achieved both significant performance improvements and stronger correctness guarantees.

### Key Achievements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge this and Benefits section. currently feels too much duplication


### Immediate Priorities (Next 3-6 months)

- **Iceberg merge-on-read optimizations**: Implementing intelligent file pruning and predicate pushdown to reduce query latency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this how we would implement? equality to positional?


- **Iceberg merge-on-read optimizations**: Implementing intelligent file pruning and predicate pushdown to reduce query latency
- **Real-time metrics and observability**: Building comprehensive monitoring dashboards for throughput, latency, and error rates
- **DLQ/SMT hooks**: Leveraging the typed contract and schema awareness to add dead-lettering and smart transforms without reintroducing JSON envelopes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow based writes?


### Medium-term Goals (6-12 months)

- **Multi-region replication**: Extending the atomic commit model to support cross-region data replication
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any supporting references / links that this is even needed/possible?

- **Multi-region replication**: Extending the atomic commit model to support cross-region data replication
- **Advanced compression**: Implementing columnar compression optimizations for better storage efficiency
- **Query acceleration**: Pre-computed aggregations and materialized views for common query patterns
- **Kubernetes-native deployment**: Operator-based deployment and management for cloud-native environments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have this as helm right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead we can mention auto-scaling for managed offering

- **Query acceleration**: Pre-computed aggregations and materialized views for common query patterns
- **Kubernetes-native deployment**: Operator-based deployment and management for cloud-native environments

### Long-term Vision (12+ months)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove. this is fully-AI generated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants