Skip to content

fix: use compressed event size to close chunk #7517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

sspaink
Copy link
Contributor

@sspaink sspaink commented Apr 15, 2025

The chunk encoder writes gzipped content to a buffer. The bug is that enc.buf.Len() doesn't represent the total chunk size, but the compressed size. Currently enc.WriteBytes compares the enc.bytesWritten to enc.softLimit to determine if the chunk should be closed and returned. While within enc.reset() it uses enc.buf.Len() to adjust the enc.softLimit. It seems to me that enc.bytesWritten is the expected size and allows the encoder to adapt the soft limit correctly. The updated tests reflect the improvement showing a more stable chunk size. Something to think about, if decision_logs.reporting.upload_size_limit_bytes is meant to limit the final compressed OR uncompressed size. I wrote this pull request with the assumption that it is meant to represent the final uncompressed size. I'd also assume this is what a user would expect when configuring the limit because they'd be more concerned with how the configured service will deal with the uncompressed size. Of course just speculative so open for discussion, could be this was meant to reduce network packet size but maximize number of events. Could make this configurable 🤔 although to use enc.buf.Len() I think you also have to call enc.w.Flush() to make sure all pending data is written.

This bug definitely made me tear some of my hair out finding it haha Found it while working on: #7455. Currently the buffers reset the chunk encoder frequently hiding the problem because the soft limit never gets out of control. I was working on updating the event buffer to reuse the same chunk encoder throughout its lifecycle. This is were the problem revealed itself because the enc.softLimit began to overflow due to frequent calls to enc.reset()! 🚨

What was happening is that the encoder kept increasing the soft limit because it was checking against the compressed size it assumed the chunk buffer was constantly being underutilized. I also added a check to prevent the enc.softLimit from overflowing and setting a upper limit to the growth (twice the hard limit or math.MaxInt64 -1) . This might not be required because enc.reset() shouldn't be called so aggressively anymore, but I added a unit test showing that it is possible.

sspaink added 2 commits April 14, 2025 22:21
The chunk encoder writes gzipped content to a buffer. Using enc.buf.Len() doesn't represent the total chunk size, but the compressed size. While enc.bytesWritten is the expected size and allows the encoder to adapt the soft limit correctly. The updated tests reflect the improvement showing a more stable chunk size.

Signed-off-by: sspaink <[email protected]>
@sspaink sspaink marked this pull request as ready for review April 15, 2025 03:44
@johanfylling
Copy link
Contributor

From the description of decision_logs.reporting.upload_size_limit_bytes:

Decision log upload size limit in bytes. OPA will chunk uploads to cap message body to this limit.

Since this note talks about the "message body", my assumption is that this config param refers to the limit of the size in transit, i.e. the chunk in it's compressed form.

Perhaps @ashutosh-narkar can shed some more light on this, as I believe he implemented the soft limit. (It's been a while since then though, so he'll be forgiven if he doesn't recall 😄)

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gosh, math 😵‍💫. You're gonna need to lead me by the hand on this one 😄.

if enc.metrics != nil {
enc.metrics.Counter(encSoftLimitScaleUpCounterName).Incr()
}

mul := int64(math.Pow(float64(softLimitBaseFactor), float64(enc.softLimitScaleUpExponent+1)))
// this can cause enc.softLimit to overflow into a negative value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the circumstances for a scenario where we reach an overflow? Since the thing we're exponentially increasing is upload bytes, for us to overflow, wouldn't the previous successful reset need have had a soft-limit already terabytes in size?

This is intuition talking, and not me doing actual calculus, though, so I may be way off in my estimates. It's very likely I'm missing something here, since you've encountered this in your work and had to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, I updated the PR to now enforce a maximum configurable limit of 4294967296 instead removing the need to check if the soft limit will ever overflow.

if limit < 0 {
limit = math.MaxInt64 - 1
}
enc.softLimit = limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As always when it comes to math, I'm a bit confused 😅.
Why are we setting the soft-limit to 2x the configured limit (or even higher) here? Won't that cause us to write past the configured limit in WriteBytes()? There is probably some detail I'm missing,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted the math, it won't hurt us anymore 😜

@ashutosh-narkar
Copy link
Member

From the description of decision_logs.reporting.upload_size_limit_bytes:

Decision log upload size limit in bytes. OPA will chunk uploads to cap message body to this limit.

Since this note talks about the "message body", my assumption is that this config param refers to the limit of the size in transit, i.e. the chunk in it's compressed form.

Perhaps @ashutosh-narkar can shed some more light on this, as I believe he implemented the soft limit. (It's been a while since then though, so he'll be forgiven if he doesn't recall 😄)

Since this note talks about the "message body", my assumption is that this config param refers to the limit of the size in transit, i.e. the chunk in it's compressed form.

Yes that's correct. It's been a while since I looked into this. But the goal is to pack as much as possible in the uploaded packet. We have some explanation of the algorithm in the section on Decision Logs. It's possible there could be a bug in some calculation which we haven't seen before.

…unk body should be closed.

Also enforce a maximum allowed upload limit to 2^32.

Signed-off-by: sspaink <[email protected]>
Copy link

netlify bot commented Apr 15, 2025

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit 71abe0d
🔍 Latest deploy log https://app.netlify.com/sites/openpolicyagent/deploys/680098eed74c4e0008fef7ea
😎 Deploy Preview https://deploy-preview-7517--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

…s dropping the nd cache is less likely

Signed-off-by: sspaink <[email protected]>
@sspaink
Copy link
Contributor Author

sspaink commented Apr 15, 2025

@ashutosh-narkar thank you for the clarification! In that case the bug is in the WriteBytes function which compares the size against the non-compressed size to determine when to close the chunk.

Latest changes now has WriteBytes updated to check the compressed size instead to match the reset function. You can see in the TestPluginTriggerManual unit test that because of this change it is now possible to send a lot more events in a single chunk, instead of over 3 chunks it sends all the events in 1. As well as in the test TestChunkMaxUploadSizeLimitNDBCacheDropping it shows it is less likely for the ND cache to be dropped because it compares against the compression size instead. I think this could be a noticeable improvement!

Also added a maximum configurable upload size limit of 4294967296 bytes that prints a warning if the user exceeds it that it was capped. Thank you @johanfylling for this suggestion.

Unit test added but also tested that out locally:

decision_logs:
  service: fakeservice
  reporting:
    upload_size_limit_bytes: 4294967296000000
➜  buffertest ./opa_darwin_arm64 run -c opa-conf.yaml --server ./example.rego --log-level=error
{"level":"warning","msg":"the configured `upload_size_limit_bytes` (4294967296000000) has been set to the maximum limit (4294967296)","plugin":"discovery","time":"2025-04-15T15:28:29-05:00"}

@sspaink sspaink changed the title fix: use enc.bytesWritten to update chunk encoder soft limit fix: use compressed event size to close chunk Apr 15, 2025
Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any existing tests for verifying that the soft-limit is properly modified over the lifetime of the encoder? If so, I'd expected to see adjustments to have them remain green. Maybe some should be added?

} else if int64(len(bs)+2) > enc.limit {
}

// Compress the incoming event by itself in order to verify its final size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final size

this isn't strictly true, though, as incomingEventBuf will additionally contain the gzip header. However, we know the header is 10 bytes (unless there are extra optional headers), so we could just subtract it. gzip also has a 8-byte trailer, but I suppose that's not part of incomingEventBuf.Len() unless w is flushed? (Haven't seen the trailer being accounted for anywhere, but maybe that's alright since we're a bit fuzzy with the soft-limit and all 🤔)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I updated the logic to include the header, opening/closing brackets, and trailer. There are two different lengths we need to consider, when the event is too big by itself and if the event is the last event added.

}

// if the compressed incoming event is bigger than the upload size limit reject it
if int64(incomingEventBuf.Len()+2) > enc.limit {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we confident that the correctness (disregarding gzip header) of checking the compressed event against the limit is a good tradeoff for the impact on memory footprint and cpu time of compressing the event twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is an alternative is there? If the limit is the final compressed size, then there isn't a way to know if a a single incoming event would be too big without compressing it first?

chunk3 := <-fixture.server.ch
expLen1 := 122
expLen2 := 242
expLen3 := 36
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test to remain equivalent, maybe we should raise the number of events fired, or tweak the upload limit, so that we keep producing multiple chunks here?

// close the current chunk and reset it so the incoming event can be written into the next chunk.
// Note that the soft limit enforces the final compressed size.
var result [][]byte
if int64(incomingEventBuf.Len()+enc.buf.Len()+1) > enc.softLimit {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't your invention, but this 1 does a lot of heavy lifting. Depending on what's already written to enc.buf, we'll either need to write an extra [, ,, ], or a combination thereof. Assuming a chunk will contain >2 events most of the time, 1 (,) is mostly right 😄.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering things like this, and the various gzip headers and trailers that will be added, and the fact that the compressed forms of [,] might not be exactly one byte each, it doesn't look like we'll ever firmly respect the configured upload limit. But given that the soft-limit adds even more variability, I suppose that's ok.

@sspaink
Copy link
Contributor Author

sspaink commented Apr 17, 2025

@johanfylling decided to close this PR in favor of this new one: #7521
Just because it changed so much from the original attempt I thought it would be easier to review without the previous clutter. Hope that is ok 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants