-
Notifications
You must be signed in to change notification settings - Fork 1.4k
fix: use compressed event size to close chunk #7517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The chunk encoder writes gzipped content to a buffer. Using enc.buf.Len() doesn't represent the total chunk size, but the compressed size. While enc.bytesWritten is the expected size and allows the encoder to adapt the soft limit correctly. The updated tests reflect the improvement showing a more stable chunk size. Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
From the description of
Since this note talks about the "message body", my assumption is that this config param refers to the limit of the size in transit, i.e. the chunk in it's compressed form. Perhaps @ashutosh-narkar can shed some more light on this, as I believe he implemented the soft limit. (It's been a while since then though, so he'll be forgiven if he doesn't recall 😄) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gosh, math 😵💫. You're gonna need to lead me by the hand on this one 😄.
v1/plugins/logs/encoder.go
Outdated
if enc.metrics != nil { | ||
enc.metrics.Counter(encSoftLimitScaleUpCounterName).Incr() | ||
} | ||
|
||
mul := int64(math.Pow(float64(softLimitBaseFactor), float64(enc.softLimitScaleUpExponent+1))) | ||
// this can cause enc.softLimit to overflow into a negative value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the circumstances for a scenario where we reach an overflow? Since the thing we're exponentially increasing is upload bytes, for us to overflow, wouldn't the previous successful reset need have had a soft-limit already terabytes in size?
This is intuition talking, and not me doing actual calculus, though, so I may be way off in my estimates. It's very likely I'm missing something here, since you've encountered this in your work and had to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, I updated the PR to now enforce a maximum configurable limit of 4294967296
instead removing the need to check if the soft limit will ever overflow.
v1/plugins/logs/encoder.go
Outdated
if limit < 0 { | ||
limit = math.MaxInt64 - 1 | ||
} | ||
enc.softLimit = limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As always when it comes to math, I'm a bit confused 😅.
Why are we setting the soft-limit to 2x the configured limit (or even higher) here? Won't that cause us to write past the configured limit in WriteBytes()? There is probably some detail I'm missing,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted the math, it won't hurt us anymore 😜
Yes that's correct. It's been a while since I looked into this. But the goal is to pack as much as possible in the uploaded packet. We have some explanation of the algorithm in the section on Decision Logs. It's possible there could be a bug in some calculation which we haven't seen before. |
…unk body should be closed. Also enforce a maximum allowed upload limit to 2^32. Signed-off-by: sspaink <[email protected]>
✅ Deploy Preview for openpolicyagent ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
…s dropping the nd cache is less likely Signed-off-by: sspaink <[email protected]>
@ashutosh-narkar thank you for the clarification! In that case the bug is in the Latest changes now has Also added a maximum configurable upload size limit of 4294967296 bytes that prints a warning if the user exceeds it that it was capped. Thank you @johanfylling for this suggestion. Unit test added but also tested that out locally: decision_logs:
service: fakeservice
reporting:
upload_size_limit_bytes: 4294967296000000 ➜ buffertest ./opa_darwin_arm64 run -c opa-conf.yaml --server ./example.rego --log-level=error
{"level":"warning","msg":"the configured `upload_size_limit_bytes` (4294967296000000) has been set to the maximum limit (4294967296)","plugin":"discovery","time":"2025-04-15T15:28:29-05:00"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any existing tests for verifying that the soft-limit is properly modified over the lifetime of the encoder? If so, I'd expected to see adjustments to have them remain green. Maybe some should be added?
v1/plugins/logs/encoder.go
Outdated
} else if int64(len(bs)+2) > enc.limit { | ||
} | ||
|
||
// Compress the incoming event by itself in order to verify its final size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final size
this isn't strictly true, though, as incomingEventBuf
will additionally contain the gzip header. However, we know the header is 10
bytes (unless there are extra optional headers), so we could just subtract it. gzip also has a 8-byte trailer, but I suppose that's not part of incomingEventBuf.Len()
unless w
is flushed? (Haven't seen the trailer being accounted for anywhere, but maybe that's alright since we're a bit fuzzy with the soft-limit and all 🤔)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I updated the logic to include the header, opening/closing brackets, and trailer. There are two different lengths we need to consider, when the event is too big by itself and if the event is the last event added.
v1/plugins/logs/encoder.go
Outdated
} | ||
|
||
// if the compressed incoming event is bigger than the upload size limit reject it | ||
if int64(incomingEventBuf.Len()+2) > enc.limit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we confident that the correctness (disregarding gzip header) of checking the compressed event against the limit is a good tradeoff for the impact on memory footprint and cpu time of compressing the event twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is an alternative is there? If the limit is the final compressed size, then there isn't a way to know if a a single incoming event would be too big without compressing it first?
chunk3 := <-fixture.server.ch | ||
expLen1 := 122 | ||
expLen2 := 242 | ||
expLen3 := 36 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test to remain equivalent, maybe we should raise the number of events fired, or tweak the upload limit, so that we keep producing multiple chunks here?
v1/plugins/logs/encoder.go
Outdated
// close the current chunk and reset it so the incoming event can be written into the next chunk. | ||
// Note that the soft limit enforces the final compressed size. | ||
var result [][]byte | ||
if int64(incomingEventBuf.Len()+enc.buf.Len()+1) > enc.softLimit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this isn't your invention, but this 1
does a lot of heavy lifting. Depending on what's already written to enc.buf
, we'll either need to write an extra [
, ,
, ]
, or a combination thereof. Assuming a chunk will contain >2 events most of the time, 1
(,
) is mostly right 😄.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering things like this, and the various gzip headers and trailers that will be added, and the fact that the compressed forms of [,]
might not be exactly one byte each, it doesn't look like we'll ever firmly respect the configured upload limit. But given that the soft-limit adds even more variability, I suppose that's ok.
…byte. Signed-off-by: sspaink <[email protected]>
…stead of the uncompressed size. Signed-off-by: sspaink <[email protected]>
@johanfylling decided to close this PR in favor of this new one: #7521 |
The chunk encoder writes gzipped content to a buffer. The bug is that
enc.buf.Len()
doesn't represent the total chunk size, but the compressed size. Currentlyenc.WriteBytes
compares theenc.bytesWritten
toenc.softLimit
to determine if the chunk should be closed and returned. While withinenc.reset()
it usesenc.buf.Len()
to adjust theenc.softLimit
. It seems to me thatenc.bytesWritten
is the expected size and allows the encoder to adapt the soft limit correctly. The updated tests reflect the improvement showing a more stable chunk size. Something to think about, ifdecision_logs.reporting.upload_size_limit_bytes
is meant to limit the final compressed OR uncompressed size. I wrote this pull request with the assumption that it is meant to represent the final uncompressed size. I'd also assume this is what a user would expect when configuring the limit because they'd be more concerned with how the configured service will deal with the uncompressed size. Of course just speculative so open for discussion, could be this was meant to reduce network packet size but maximize number of events. Could make this configurable 🤔 although to useenc.buf.Len()
I think you also have to callenc.w.Flush()
to make sure all pending data is written.This bug definitely made me tear some of my hair out finding it haha Found it while working on: #7455. Currently the buffers reset the chunk encoder frequently hiding the problem because the soft limit never gets out of control. I was working on updating the event buffer to reuse the same chunk encoder throughout its lifecycle. This is were the problem revealed itself because the
enc.softLimit
began to overflow due to frequent calls toenc.reset()
! 🚨What was happening is that the encoder kept increasing the soft limit because it was checking against the compressed size it assumed the chunk buffer was constantly being underutilized. I also added a check to prevent the
enc.softLimit
from overflowing and setting a upper limit to the growth (twice the hard limit ormath.MaxInt64 -1
) . This might not be required becauseenc.reset()
shouldn't be called so aggressively anymore, but I added a unit test showing that it is possible.