@@ -26,6 +26,7 @@ This document explains how the Iceberg native scan optimization ensures that **e
2626## The Problem: Broadcasting Waste
2727
2828### Old Approach (Before Optimization)
29+
2930In a traditional distributed query execution:
3031
31321 . ** Driver serializes ALL partition tasks** into a protobuf message
@@ -35,12 +36,14 @@ In a traditional distributed query execution:
35365 . ** Result: 99% waste for large N**
3637
3738### Example
39+
3840- Table with ** 1000 partitions**
3941- Each partition has ** 100KB of task data** (file paths, partition values, schemas, etc.)
4042- Total task data: ** 100MB**
4143- ** Problem** : EVERY executor receives all 100MB, but only uses ~ 100KB
4244
4345For a cluster with 100 executors:
46+
4447- ** Total network transfer** : 100 executors × 100MB = ** 10GB**
4548- ** Useful data** : 100 executors × 100KB = ** 10MB**
4649- ** Waste** : 99% of transferred data is discarded!
@@ -82,6 +85,7 @@ scan.wrapped.inputRDD match {
8285```
8386
8487** What happens here:**
88+
85891 . During query planning on the ** driver** , the code iterates through each Spark partition
86902 . For each partition ` i ` , it extracts ** only the FileScanTasks that belong to that partition**
87913 . These tasks are serialized to protobuf bytes: ` IcebergFilePartition ` → ` Array[Byte] `
@@ -120,6 +124,7 @@ class IcebergScanRDD(
120124```
121125
122126** What happens here:**
127+
1231281 . ** Custom Partition class** : ` IcebergScanPartition ` carries its own ` taskBytes: Array[Byte] `
1241292 . ** getPartitions()** : Creates N partition objects, each with only its own task data
1251303 . ** Spark's RDD serialization** : When Spark schedules tasks, it serializes the ` Partition ` object and sends it to the executor
@@ -130,6 +135,7 @@ class IcebergScanRDD(
130135#### Why This Works: Spark's Task Serialization
131136
132137Spark's task scheduling works as follows:
138+
1331391 . ** Driver** calls ` getPartitions() ` → creates array of Partition objects
1341402 . ** Scheduler** assigns tasks to executors: "Executor A: compute partition 5", "Executor B: compute partition 8", etc.
1351413 . ** Task serialization** : When sending the task to an executor, Spark serializes:
@@ -176,6 +182,7 @@ if (useJniTaskRetrieval) {
176182** What happens here:**
177183
178184#### On the Executor (JVM side):
185+
1791861 . ** Receive** : Executor receives ` IcebergScanPartition(5, taskBytes) ` from Spark
1801872 . ** Thread-local storage** : Task bytes stored in ` ThreadLocal[Array[Byte]] ` via ` Native.setIcebergPartitionTasks(taskBytes) `
1811883 . ** Create iterator** : Native execution plan is initialized
@@ -203,6 +210,7 @@ object Native {
203210```
204211
205212** Why Thread-local?**
213+
206214- Multiple tasks may run concurrently on the same executor JVM
207215- Each task runs in its own thread
208216- Thread-local storage ensures each task only accesses ** its own partition data**
@@ -234,6 +242,7 @@ pub unsafe extern "system" fn Java_org_apache_comet_Native_getIcebergPartitionTa
234242```
235243
236244** What happens here:**
245+
2372461 . Native Iceberg planner calls ` getIcebergPartitionTasks() ` via JNI
2382472 . This calls back to ` Native.getIcebergPartitionTasksInternal() ` on JVM side
2392483 . Retrieves the ` Array[Byte] ` from thread-local storage
@@ -277,12 +286,14 @@ override def convertBlock(): CometNativeExec = {
277286```
278287
279288The ` serializedPlanOpt ` contains the ** operator DAG structure** :
289+
280290- Scan → Filter → Project, etc.
281291- Schema definitions
282292- Filter predicates
283293- Projection columns
284294
285295But it does ** NOT** contain partition-specific FileScanTasks because:
296+
2862971 . It's created ** once** on the driver
2872982 . It's ** shared** by all executors
2882993 . It's the same for partition 0, partition 5, partition 1000, etc.
@@ -360,9 +371,9 @@ You might ask: **"Why not include partition-specific data in the protobuf?"**
360371
361372If we embedded partition-specific data in protobuf, we'd need:
362373
363- | Approach | Implications |
364- | ----------| --------------|
365- | ** Current: JNI Callback** | ✓ One shared plan protobuf<br >✓ Leverages existing Comet architecture<br >✓ Partition data via RDD (our optimization)<br >⚠ Extra JNI roundtrip (minimal overhead) |
374+ | Approach | Implications |
375+ | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
376+ | ** Current: JNI Callback** | ✓ One shared plan protobuf<br >✓ Leverages existing Comet architecture<br >✓ Partition data via RDD (our optimization)<br >⚠ Extra JNI roundtrip (minimal overhead) |
366377| ** Alternative: Embed in Protobuf** | ✗ Would need N different protobuf plans (one per partition)<br >✗ Each executor receives different protobuf<br >✗ Breaks Comet's shared plan model<br >✗ Major architectural restructuring required |
367378
368379### The JNI Callback as a Bridge
@@ -414,6 +425,7 @@ The JNI callback overhead is **minimal** compared to the optimization benefits:
414425- ** Memory savings** : 100-200× reduction in executor memory (ongoing)
415426
416427For a table with 10,000 partitions:
428+
417429- JNI overhead: 10,000 partitions × 10μs = ** 0.1 seconds total**
418430- Network savings: 100GB → 500MB = ** 99.5 GB saved**
419431- Memory savings: 100GB → 500MB executor memory = ** 199.5 GB saved**
@@ -499,14 +511,17 @@ Without the JNI callback, we would have to fundamentally restructure how Comet s
499511### Network Transfer Savings
500512
501513** Before optimization:**
514+
502515- Total data per executor = N × avg_task_size
503516- Total cluster network = num_executors × N × avg_task_size
504517
505518** After optimization:**
519+
506520- Total data per executor = avg_tasks_per_executor × avg_task_size
507521- Total cluster network = num_executors × avg_tasks_per_executor × avg_task_size
508522
509523** Savings ratio:**
524+
510525```
511526savings = 1 - (avg_tasks_per_executor / N)
512527```
@@ -516,18 +531,21 @@ For evenly distributed data: `avg_tasks_per_executor ≈ N / num_executors`
516531### Example: Large Table Scan
517532
518533** Scenario:**
534+
519535- Table with 10,000 partitions
520536- 200 executors
521537- 50KB average task data per partition
522538- Total task metadata: 10,000 × 50KB = ** 500MB**
523539
524540** Before optimization:**
541+
525542- Each executor receives: ** 500MB** (all partition data)
526543- Total network transfer: 200 × 500MB = ** 100GB**
527544- Each executor uses: ~ 50 partitions × 50KB = ** 2.5MB** (0.5%)
528545- Wasted transfer: ** 99.5%**
529546
530547** After optimization:**
548+
531549- Each executor receives: ~ 50 × 50KB = ** 2.5MB** (only its partitions)
532550- Total network transfer: 200 × 2.5MB = ** 500MB**
533551- Each executor uses: ** 2.5MB** (100%)
@@ -536,11 +554,13 @@ For evenly distributed data: `avg_tasks_per_executor ≈ N / num_executors`
536554### Memory Pressure Reduction
537555
538556** Before:**
557+
539558- Driver memory: 500MB (serialize all tasks)
540559- Executor memory: 500MB × 200 = ** 100GB** across cluster
541560- GC pressure: High (500MB objects per executor)
542561
543562** After:**
563+
544564- Driver memory: 500MB (same, but partitioned)
545565- Executor memory: 2.5MB × 200 = ** 500MB** across cluster
546566- GC pressure: Low (2.5MB objects per executor)
@@ -559,6 +579,7 @@ Spark's broadcast variables would still send all data to all executors. The opti
559579** Problem** : Need to pass partition-specific data from JVM to native code during execution.
560580
561581** Options considered:**
582+
5625831 . ** Pass as function parameter** : Would require modifying the entire call chain
5635842 . ** Global state** : Unsafe with concurrent tasks
5645853 . ** Thread-local** : ✓ Safe, simple, minimal API changes
@@ -572,12 +593,14 @@ Spark's broadcast variables would still send all data to all executors. The opti
572593### 4. What About Protobuf Deduplication?
573594
574595The code still uses deduplication pools (CometIcebergNativeScan.scala:696-705) to reduce redundancy ** within each partition's task data** :
596+
575597- Schema pool
576598- Partition spec pool
577599- Delete files pool
578600- etc.
579601
580602This is ** orthogonal** to the partition distribution optimization. Both work together:
603+
581604- ** Deduplication** : Reduces task data size within each partition
582605- ** Partition-specific distribution** : Ensures executors only receive their partition data
583606
@@ -586,6 +609,7 @@ This is **orthogonal** to the partition distribution optimization. Both work tog
586609## Code Flow Summary
587610
588611### Query Planning (Driver)
612+
5896131 . ` CometScanRule ` → creates ` CometBatchScanExec ` with Iceberg metadata
5906142 . ` CometIcebergNativeScan.convert() ` → serializes plan to protobuf
591615 - Extracts FileScanTasks per partition
@@ -600,6 +624,7 @@ This is **orthogonal** to the partition distribution optimization. Both work tog
600624 - Passes ` partitionTasks ` map to RDD constructor
601625
602626### Task Execution (Executors)
627+
6036281 . Spark schedules task for partition ` i ` on executor
6046292 . Spark serializes and sends ` IcebergScanPartition(i, taskBytes_i) ` to executor
6056303 . ` IcebergScanRDD.compute() ` called with partition object
@@ -620,6 +645,7 @@ This is **orthogonal** to the partition distribution optimization. Both work tog
620645To verify the optimization is working:
621646
6226471 . ** Check logs for partition data distribution:**
648+
623649 ```
624650 INFO CometIcebergNativeScan: Cached N partitions (avg X bytes/partition)
625651 ```
0 commit comments