Core: Support incremental compute for partition stats #12629

ajantha-bhat · 2025-03-24T09:27:26Z

If the previous stats file exist, no need to compute the stats from the scratch.

Identify the latest snapshot for which partition stats file exist. Read the previous stats, incrementally compute the stats for new snapshots, merge the stats and write them to the new file.

PartitionStatsHandler.computeAndWriteStats() -- incrementally compute stats if previous stats available, if not full compute.

ajantha-bhat · 2025-03-24T09:33:43Z

Some engines want to synchronously write the partition stats. (Similar to how Trino synchronously write the puffin files during insert). Reading all the manifests in a table can be avoided to compute partition stats if the we compute the stats incrementally and merge it with previous stats.

@aokolnychyi, @pvary, @deniskuzZ, @gaborkaszab : Let me know what you guys think.

ajantha-bhat · 2025-03-24T09:38:38Z

data/src/main/java/org/apache/iceberg/data/PartitionStatsHandler.java

+      PartitionMap<PartitionStats> statsMap = PartitionMap.create(table.specs());
+      // read previous stats
+      try (CloseableIterable<PartitionStats> oldStats =
+          readPartitionStatsFile(statsFileSchema, Files.localInput(statisticsFile.path()))) {


since the new unified tuple is used for reading the old stats file. It automatically handled the schema evolution.

core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java

data/src/main/java/org/apache/iceberg/data/PartitionStatsHandler.java

core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java

data/src/main/java/org/apache/iceberg/data/PartitionStatsHandler.java

data/src/test/java/org/apache/iceberg/data/TestPartitionStatsHandler.java

ajantha-bhat · 2025-03-27T09:55:36Z

I have added 1.9.0 milestone for this PR as it is a small change (excluding refactoring) and we still have some time for 1.9.0 sue to open issues in milestone.

pvary · 2025-03-27T11:01:15Z

core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java

+      manifestFilePredicate =
+          manifestFile ->
+              snapshotIdsRange.contains(manifestFile.snapshotId())
+                  && !manifestFile.hasExistingFiles();


Don't we want this as a default predicate?

manifestFile -> !manifestFile.hasExistingFiles()

we could add it as default filter:

if (fromSnapshot != null) { manifestFilePredicate = manifestFile -> snapshotIdsRange.contains(manifestFile.snapshotId()) } List<ManifestFile> manifests = currentSnapshot.allManifests(table.io()).stream() .filter(manifestFilePredicate) .filter(manifestFile -> !manifestFile.hasExistingFiles()) .collect(Collectors.toList());

good point.

While computing incremental, I observed that it may become duplicate counts. So, I added.
I do have some gaps, I need to understand fully when and all we mark manifest entry as existing.
Is there any scenario exist to consider "existing" entries or just "added" is enough?

There is another check down below, that considers both added and existing (added long back).

iceberg/core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java

Line 103 in d54d81e

if (entry.isLive()) {

I will update the code to just keep added entry and also add a testcase of rewrite data files to ensure stats are same after the rewrite.

Also, looks like ManifestFile can have both added and existing entries together? So, Instead of filtering here. I will keep filtering just at the entries level down below in collectStatsForManifest

what if we have compaction and expire snapshots? new manifests would have the EXISTING entries?

What do we do with the stats of the removed files?

Lets say:

S1 adds data

Execute the stats collection

S2 adds more data

S3 compacts data from S1, and S2 - This removes files created by S1, and S2 and creates new files

Execute incremental/normal stats collection

What happens with the stats in this case?

compaction doesn't remove the data. if we expire S1 and S2 we don't have prev snapshots/stats and start fresh (i.e. full compute)

If we don't expire data, could we detect that S3 is only a compaction commit, and the stats don't need to be changed?

What if S3 instead is a MoW commit? Can we detect the changes and calculate stats incrementally?

Compaction will have snapshot operation as REPLACE and we can reuse the old stats for that scenario. But need to write the new stats file with same data to handle clean GC of snapshot files.

Compaction will be tested end to end while adding the spark procedure.

About the live (existing + added),

For full compute, old manifest files will be marked as deleted and entries will be reused as existing in the manifest files + may have additional added entry. So, for full compute need to consider both existing and added.

For incremental compute, old stats file has some entires which are now existing. So, should consider the existing entires.

This all leads to the next question, what happens when manifest is deleted. That case we just update the snapshot entry (last modified) and not decrement the stats. Hence, we should skip it for incremental compute again.

All these logic present in collectStatsForManifest and existing testcases (full compute and incremental) covers it as it uses mergeAppend which produces manifest mix of added and existing entires.

We didn't need decrement stats for full compute because we were discarding the deleted manifests. Only considering live manifests.

Now, I am not really sure for compaction, the current code will work. We may need decrement stats just for incremental compute. I will test compaction scenario tomorrow and handle this.

deniskuzZ

LGTM +1

core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java

pvary · 2025-03-27T11:53:45Z

data/src/main/java/org/apache/iceberg/data/PartitionStatsHandler.java

+    PartitionStatisticsFile statisticsFile = latestStatsFile(table, snapshot.snapshotId());
+    if (statisticsFile == null) {
+      LOG.info("Previous stats not found. Computing the stats for whole table.");
+      return PartitionStatsUtil.computeStats(table, null, snapshot);
+    }


Could this throw an error instead?

why? That is to handle the case when no stats files existed before, and we need to execute a full computation.
We enter here when computing stats for the first time.

If I understand correctly, the user requested incremental stat compute, but with wrong parameters. In this case we could either "correct" the mistake or throw an error.

The question is, how frequent is the problem, and how easy is to detect from the user side

what do you mean by wrong parameters? non existing snapshotId?

what do you mean by wrong parameters? non existing snapshotId?

Exactly

throwing error now and added the testcase

i don't agree with that design, see #12629 (comment)

data/src/main/java/org/apache/iceberg/data/PartitionStatsHandler.java

ajantha-bhat · 2025-03-27T13:59:50Z

@deniskuzZ, @pvary: Thanks guys for the review. I have addressed all the comments. You can take a fresh look again tomorrow :D (after some break :D)

deniskuzZ · 2025-03-27T14:09:44Z

data/src/main/java/org/apache/iceberg/data/PartitionStatsHandler.java

+      Table table, Snapshot snapshot, StructType partitionType) throws IOException {
+    PartitionStatisticsFile statisticsFile = latestStatsFile(table, snapshot.snapshotId());
+    if (statisticsFile == null) {
+      throw new RuntimeException(


I don't think it's user-friendly + recompute flag loses it's purpose (you can call directly computeAndWriteStats).
Now every client needs to implement either the same prev stats file check or do the try-catch.

try { computeAndWriteStatsFileIncremental() } catch (RuntimeException e) { if (e.getMessage().equals("bla-bla")) { computeAndWriteStats } }

I would expect from computeAndWriteStatsFileIncremental do what's needed instead throwing Previous stats not found exception.

Non-existent snapshotId is a diff situation. We should validate if snapshot == null and throw Snapshot doesn't exist

recompute flag loses it's purpose

there is no recompute flag exposed to the user. The private method (incrementalComputeAndMerge) which is throwing this exception is also always computing incrementally.

I would expect from computeAndWriteStatsFileIncremental do what's needed instead throwing Previous stats not found exception.

computeAndWriteStatsFileIncremental says incremental compute. Forcefully recomputing when there is an error is not a good idea as the method's responsibility is just to try incremental compute?

Maybe I can expose another method called computeAndWriteStatsWithFallback(), which will internally calls it?

public void computeAndWriteStatsIncrementalWithFallback() { try { computeAndWriteStatsFileIncremental(); } catch (RuntimeException e) { if ("bla-bla".equals(e.getMessage())) { computeAndWriteStats(); // Fallback in case of a specific error } else { throw e; // Re-throw unexpected errors } } }

I liked how you did it initially. Please disregard the recompute flag comment, it has nothing to do with the incremental workflow.

Think about what changes are needed on the client side. I was planning just to replace the existing call to the incremental one unless it's ANALYZE TABLE (force recompute).

What are the use-cases we would benefit from the prev stats file missing exception?

Lets see what @pvary thinks.

I'm very late to this conversation, sorry about that :) I think we should talk a bit about how a user would use these APIs to compute stats and then we might be able to sort this disagreement out too.

I see the other PR to introduce a compute_partition_stats stored proc for the "full-compute" path. I'd assume there would be another proc, incremental_compute_partition_stats or similar that will execute the "incremental-compute" path. If my assumption is correct, I think the question is why would a user decide to call one instead of the other. The expectation here is that the "full-compute" path is more expensive than the "incremental-compute" path. So if the users motivation is to run this cheaper operation then falling back to the full compute could be misleading.
Or from a different angle: if the incremental path is expected to try first with the cheap computation and then fall back to the more expensive one, then what would be the point of having the compute_partition_stats procedure to execute "full-compute". Why would one call it? So in general I'm in favour of throwing an exception if the incremental computation is not feasible.

Well, if the plan is to have that single Spark procedure for both approaches, then everything I wrote above is irrelevant :)

My take on this is to always use incremental unless you explicitly need to recompute.
incremental should be smart enough to decide if it needs to start from scratch (first stats compute) or reuse prev stats, compute diff, and merge.
What is the benefit of "Previous stats not found for incremental compute. Try full compute"? It just moves the need for exception handling to a client.

I think there are different approaches that could work well. I understand that the most convenient would be to offer a single computeStats function that can decide between incremental or full computation. On the other hand that could hide some details from the users. I had the impression that on Iceberg APIs are designed in a way to have clear boundaries and not mix functionalities like incremental or full stat computation.
I believe that to come to a conclusion we might want to raise this question on dev@ to have wider visibility. People probably are busy with the upcoming Iceberg Summit, but still we could get good insights. @ajantha-bhat WDYT?

@gaborkaszab, please check the code closely, incremental doesn't do full computation unless there were no prior stats.
Maybe I am wrong, but creating a dev mail thread to discuss every minor change seems counterproductive to me, when we already have people interested in this change here.
Maybe it's worth dropping computeAndWrite since recompute flag introduces confusion.

Maybe I am wrong, but creating a dev mail thread to discuss every minor change seems counterproductive to me, when we already have people interested in this change here.

Strongly agree.

ajantha-bhat · 2025-05-09T05:28:30Z

@pvary, @gaborkaszab, @deniskuzZ, @nastra: PR is ready for review.

pvary · 2025-05-09T13:33:34Z

Discussed this with @ajantha-bhat offline:

We think that the need to recompute the stats is very-very rare. Ajantha mentioned that maybe when the stats file was deleted accidentally then it could lead to corruption when the stats removal is needed.
Also he mentioned that using the java API the user could remove the stats already:

    table.refresh();
    UpdatePartitionStatistics update = table.updatePartitionStatistics();
    table.snapshots().forEach(s -> update.removePartitionStatistics(s.snapshotId()));
    update.commit();

Based on this, if there are no immediate needs from the compute engines we could just omit the computeAndWriteStatsFullRefresh method.
We can always add the new method to the API if the need rises.

What do you think @deniskuzZ, @gaborkaszab?

gaborkaszab · 2025-05-09T13:48:07Z

What do you think

Not introducing a specific API for the "force full computation" path makes sense to me.

Thanks for the heads-up on the "drop stats" topic. I think it's fine. We might want to introduce a parameterless UpdatePartitionStatistics.removePartitionStatistics() later on, but we'll see.

Thanks for your work @ajantha-bhat !

deniskuzZ · 2025-05-09T13:57:05Z

Hive provides the QL to force the stats recompute. Maybe use-cases could be limited and mainly related to recovery, but still, I think it won't harm to have an API for that.
We can work around it with removePartitionStatistics and then compute, but that would create 2 snapshots, not sure if that is a good approach.
I am envisioning the following interfaces

PartitionStatsHandler.computeAndWriteStats() -- incremental
PartitionStatsHandler.computeAndWriteStats(boolean forceRefresh)

pvary · 2025-05-09T14:38:54Z

Do we know someone from Trino / Spark who can chime in with their requirements?

ajantha-bhat · 2025-05-09T16:25:56Z

@deniskuzZ:

We can work around it with removePartitionStatistics and then compute, but that would create 2 snapshots, not sure if that is a good approach.

Iceberg doesn't create 2 snapshots for this. Updating table metadata with the partition stats is PendingUpdate not SnapshotUpdate. So, it just creates a new table metadata file. Not a new snapshots. So, it is a light weight operation.

@pvary : Thanks for discussing this in depth. I agree that as of now one interface is enough
PartitionStatsHandler.computeAndWriteStats() -- incrementally compute stats if previous stats available, if not full compute.

In the future we can have UpdatePartitionStatistics.removePartitionStatistics() as @gaborkaszab suggested if user want to bulk remove stats based on the need. If stats are corrupted, as of now we do have a way to recompute by unregistering existing stats. So, I am fine to keep one interface as of now.

Do we know someone from Trino / Spark who can chime in with their requirements?

I didn't see much participation directly from these community. Maybe in future we can discuss again about adding new interface to force refresh if it is needed.

ajantha-bhat · 2025-05-09T17:06:02Z

I have updated the PR with just one interface (as a new commit)

deniskuzZ · 2025-05-10T09:00:47Z

@ajantha-bhat, I meant two metadata files, though they might still be large.

Could you clarify the concern around keeping the API to trigger full partition stats recompute? Clients have to rely on workarounds, even though Iceberg internally supports this through a private method.

In Hive, we have a concrete need for this functionality. So what's the suggested approach - should we hack it on the client side or bring this capability into the engine repo directly?

PS: what's the iceberg view on the fact that we are changing the behavior of the existing API (full recompute -> incremental)? From a client's perspective, it might be considered a breaking change.

I just checked the Impala docs (https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html) and they mentions support for both options:

COMPUTE STATS
COMPUTE INCREMENTAL STATS

ajantha-bhat · 2025-05-10T10:11:54Z

Could you clarify the concern around keeping the API to trigger full partition stats recompute? Clients have to rely on workarounds, even though Iceberg internally supports this through a private method.

There are no strong concern. We felt it is redundant to have many APIs. Plus the reason for full compute again is very rare (maybe only during corruption). Plus there are ways to achieve full compute with the single API by clearing stats.

PS: what's the iceberg view on the fact that we are changing the behavior of the existing API (full recompute -> incremental)? From a client's perspective, it might be considered a breaking change.

Still the full stats available for the user. The way it compute internally has changed. No difference in the output for the user.

I just checked the Impala docs (https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html) and they mentions support for both options:

This is little different. Per partition or whole table. (Not based on the snapshot)

@deniskuzZ: I have a question for hive users, if the user calls incremental first time (table without previous stats) are you expecting it to throw error or do full compute?

deniskuzZ · 2025-05-10T11:40:21Z

reason for full compute again is very rare (maybe only during corruption)

it's rare, but it exists, so why not expose a clear API to resolve it instead of suggesting workarounds? In my opinion, we are overthinking here.

This is little different. Per partition or whole table. (Not based on the snapshot)

If the partition spec is not provided, stats is computed for all the partitions individually, as we do here.

if the user calls incremental first time (table without previous stats) are you expecting it to throw error or do full compute?

do a full compute.

pvary · 2025-05-10T12:36:09Z

@ebyhr, @findepi: Do I remember correctly that you work on Trino?
We are debating of the api required by the engines for partition stats calculation.
We will have an API, which calculates the stats for a partition. It will either calculate from scratch, or if previous snapshots had some outdated stats, it will reuse stats for the old snapshot, and calculates the current stats incrementally.
Do you think Trino would need/use a different API which fully recalculates the stats?

Thanks,
Peter

ebyhr · 2025-05-12T07:41:00Z

I believe @raunaqmorarka is the best person to answer the question.

pvary · 2025-05-12T12:03:26Z

@deniskuzZ: How does Hive INCREMENTAL stats work? I'm not able to find the doc 😢

About the Impala, here is what I have found:

For a particular table, use either COMPUTE STATS or COMPUTE INCREMENTAL STATS, but never combine the two or alternate between them. If you switch from COMPUTE STATS to COMPUTE INCREMENTAL STATS during the lifetime of a table, or vice versa, drop all statistics by running DROP STATS before making the switch.

For me, this means that Impala uses incremental stats in a very different way than we do it in Iceberg. Having a full recompute would not help them, but they would need a "drop stats"

ajantha-bhat · 2025-05-13T07:27:32Z

@raunaqmorarka: Do you have any opinion on whether Trino needs force refresh API for stats?

@pvary and @deniskuzZ: I also checked Spark and they doesn't have incremental or force refresh option. One API should be enough. https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html

I think there is no difference of opinion on the current code (one API). We are just debating on whether another API of force refresh is needed. So, I think we can go ahead with this PR and later add force refresh option if really needed. This PR is hanging from long time. Merging this will enable the further work on spark action and other engine integrations.

deniskuzZ · 2025-05-13T12:56:23Z

Please don't take my comment as a blocker for merging the PR. However, why not be a bit more flexible and retain the existing force refresh API for recovery, especially since it's already in use by some of the engines?
We're not introducing a brand-new API; it's something that was already in place.

deniskuzZ · 2025-05-13T13:29:44Z

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

+          "Using full compute as previous statistics file is not present for incremental compute.");
+      stats = computeStats(table, snapshot, file -> true, false /* incremental */).values();
+    } else {
+      stats = incrementalComputeAndMerge(table, snapshot, partitionType, statisticsFile);


maybe computeAndMergeStatsIncremental?

updated as computeAndMergeStatsIncremental

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

pvary · 2025-05-14T11:20:11Z

core/src/main/java/org/apache/iceberg/PartitionStats.java

+   *
+   * @param snapshot the snapshot corresponding to the deleted manifest entry.
+   */
+  public void deletedEntryForIncrementalCompute(ContentFile<?> file, Snapshot snapshot) {


Maybe this and friends could be package private

The public methods should be deprecated first

pvary

Last minor changes

ajantha-bhat · 2025-05-14T13:01:00Z

Please don't take my comment as a blocker for merging the PR. However, why not be a bit more flexible and retain the existing force refresh API for recovery, especially since it's already in use by some of the engines?
We're not introducing a brand-new API; it's something that was already in place.

Looks like we (mainly Peter) have decided to go with just one API now. If there are strong requirements, we will add force refresh in the future.

ajantha-bhat · 2025-05-14T13:01:34Z

PR is ready to be merged, I will rebase the spark actions and procedure PR once this is merged.

pvary · 2025-05-14T14:14:15Z

Merged to main.
Thanks @ajantha-bhat for the PR, and @deniskuzZ, @nastra, @gaborkaszab for the reviews!

Here we went for the narrowest possible API. Feel free to revisit the need for the "invalidate"/"recompute" API from time-to-time, and if there are multiple engines requesting the feature we could add it.

github-actions bot added core data labels Mar 24, 2025

ajantha-bhat mentioned this pull request Mar 24, 2025

Partition stats task tracker #8450

Closed

13 tasks

ajantha-bhat commented Mar 24, 2025

View reviewed changes

ajantha-bhat force-pushed the incremental branch 3 times, most recently from 9b9f5ad to 4f973a3 Compare March 24, 2025 16:10

ajantha-bhat requested review from aokolnychyi, pvary and gaborkaszab March 25, 2025 05:17