Skip to content

Conversation

@rgruener
Copy link
Contributor

Summary

Currently the average implementation uses a sum / count. For large aggregations this can cause an overflow of sum. This changes the implementation to utilize a running average that will not overflow.

Why / Goal

We utilize large (global) aggregations when doing tensor computations. This guarantees average will work even with these larger aggregations

Test Plan

  • [] Added Unit Tests
  • Covered by existing CI
  • Integration tested

Reviewers

@rgruener rgruener force-pushed the running-average branch 2 times, most recently from fdbde1e to 89e4735 Compare November 12, 2025 16:04
StructType(
"AvgIr",
Array(StructField("sum", DoubleType), StructField("count", IntType))
Array(StructField("running_average", DoubleType), StructField("count", IntType))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, this might break pipelines that are already in prod.

best way to deal with this is to add a new aggregation and leave this as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is that changing the implementation would introduce skew?

As a minimal fix, changing count to a Long would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I do think adding another implementation is warranted to unblock certain use cases

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is avro encoded data that is sitting in kvStore in the old format. the new aggregation logic will probably fail to parse it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echo Nikhil's comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, after a bit of further investigation I believe I will add this under RunningAverage since I believe we will hit overflow issues (especially with count being an INT)

override def isDeletable: Boolean = true
}

class RunningAverage extends SimpleAggregator[Double, Array[Any], Double] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a new operator would it make sense to add an argument prevent_overflow or running_average to the Average operator that defaults to false?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i like that! defaults to None that gets interpreted as false - to keep the semantic hashes as they were

@rgruener
Copy link
Contributor Author

Will update docs assuming the change looks ok

Copy link
Collaborator

@nikhil-zlai nikhil-zlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i found some weird behavior

override def isDeletable: Boolean = true
}

class RunningAverage extends SimpleAggregator[Double, Array[Any], Double] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i like that! defaults to None that gets interpreted as false - to keep the semantic hashes as they were

* Uses a more stable online algorithm which should be suitable for large numbers of records similar to:
* http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
*/
private def computeRunningAverage(ir: Array[Any], right: Double, rightWeight: Double): Array[Any] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is apparently already a getCombinedMean below in the moments stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know

Comment on lines +198 to +203
val scaling = rightWeight / newCount
if (scaling < STABILITY_CONSTANT) {
left + (right - left) * scaling
} else {
(leftWeight * left + rightWeight * right) / newCount
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do this in-place (replace logic in average directly). and it actually makes the tests fail - the average operation stops being commutative due to slight errors in the double multiply and double division. I also tried the (lw*la + rw*ra) / (lw + rw) - without luck.

we ended up merging the following change instead: zipline-ai/chronon#1292

Copy link
Collaborator

@nikhil-zlai nikhil-zlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving to negate my request changes. given the loss of commutativity with the running average computation (due to double multiple / divide errors) - i don't know what is the right thing to do :-/

we ended up just changing the denominator to long in our fork.

@rgruener
Copy link
Contributor Author

approving to negate my request changes. given the loss of commutativity with the running average computation (due to double multiple / divide errors) - i don't know what is the right thing to do :-/

we ended up just changing the denominator to long in our fork.

I understand how this isnt strictly commutative (I hit that issue in the tests originally and introduced tolerance to the tests to get them to pass). Are there larger implications with that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants