Skip to content

plugin/decision: check if event is too large after compression #7521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jun 25, 2025

Conversation

sspaink
Copy link
Contributor

@sspaink sspaink commented Apr 17, 2025

Why the changes in this PR are needed?

resolve: #7526

What are the changes in this PR?

This PR changes what happens when an event is written to the chunk encoder when calling Write and WriteBytes. Originally the incoming event uncompressed size was compared to the compressed limit causing the issue. To fix this, the logic has changed to rely on the adaptive uncompressed limit to prevent large events from sneaking into a chunk. In case the uncompressed limit is wrong, the events are decoded and written recursively into a chunk. The base case is that the incoming event is the first event being written into a chunk. This is when the event is compressed and the ND cache or the entire event can be dropped, the benefit is that in case the event is too big even after compression only a single event had to be compressed multiple times.

Moving the logic when to drop the ND cache into the encoder also has the benefit that the size and event buffer can reuse the logic.

The variable soft limit has also been renamed to uncompressed limit throughout the code and documentation to help clarify what it is meant to represent.

Notes to assist PR review:

Repeating the reproduction steps outlined in #7526, but using a build with the changes in this PR no error is logged.

…D cache sparingly

Renamed the "soft" limit to "uncompressed limit" throughout the code and documentation for clarity.
In the size and event buffer the uncompressed limit was being dropped after each upload, now it is carried over. The event buffer doesn't reset the encoder at all. Checking if an individual size is too big was comparing the uncompressed limit to the compressed limit causing events to be dropped or lose the ND cache unnecesarily. This is now fixed, instead if the uncompressed limit allows it the event is compressed and then multiple attempts are made before losing the ND cache or dropping the event. The configurable upload is used to calculate the uncompressed size by exponentially growing it, this could cause an overflow if it was set too high. Added a max.

Signed-off-by: sspaink <[email protected]>
Copy link

netlify bot commented Apr 17, 2025

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit 4a17e02
🔍 Latest deploy log https://app.netlify.com/projects/openpolicyagent/deploys/685bfbfd2c2653000821ffdf
😎 Deploy Preview https://deploy-preview-7521--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts and questions.

@sspaink
Copy link
Contributor Author

sspaink commented Apr 29, 2025

@johanfylling I've updated the logic to find an event that is too big to instead make use of the recursion that splits chunks when the uncompressed limit grows too large. Now the uncompressed limit is taken in account, and the first event that is written helps adjust the uncompressed limit to a reasonable starting point opposed to growing from the upload size limit.

Also added a new histogram metric to track the number of events in each chunk. Not sure how useful this is for users 🤔 at the moment I am just using it in TestChunkEncoderAdaptive to find the maximum.

Thanks!

@sspaink sspaink changed the title fix: don't drop adaptive uncompressed size limit on upload and drop ND cache sparingly plugin/decision: check if event is too large after compression and don't drop adaptive uncompressed size limit on upload Apr 30, 2025
@sspaink sspaink added the monitoring Issues related to decision log and status plugins label Apr 30, 2025
Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions.

It's been a while since I looked at this last, so sorry if I'm rehashing old stuff 😅.

@sspaink sspaink changed the title plugin/decision: check if event is too large after compression and don't drop adaptive uncompressed size limit on upload plugin/decision: check if event is too large after compression May 9, 2025
* revert not dropping adaptive uncompressed limit

Signed-off-by: sspaink <[email protected]>
@sspaink
Copy link
Contributor Author

sspaink commented May 9, 2025

@johanfylling I think I put too much into one pull request, so I decided to split up the changes into separate issues/PRs.

I think this should help make the reviewing a little easier. I also added better documentation describing the specific problem and how to reproduce it in each issue/PR. Sorry for not doing this to begin with, I don't think it affects any of your most recent review comments. Thank you for bearing with me 😄

sspaink and others added 8 commits May 9, 2025 17:33
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
@sspaink sspaink requested a review from johanfylling June 4, 2025 17:17
Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional comments.

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran some perf tests of my own; plerformance looks largely unchanged.

Screenshot 2025-06-19 at 15 39 03

main:

Requests: 200
Requests total: 100000
Duration: 1m18.516058833s
Max concurrency: 500
Average req/s: 1273.6248034646687
timer_rego_query_eval_ns: Min: 5.375µs, Max: 817.916µs, Mean: 21.718µs, P50: 13.479µs, P75: 17.489µs, P90: 39.924µs, P95: 61.302µs, P99: 161.134µs, P99.9: 809.603µs, P99.99: 817.916µs
duration: Min: 180.792µs, Max: 88.047416ms, Mean: 678.158µs, P50: 385.375µs, P75: 477.729µs, P90: 796.262µs, P95: 1.218756ms, P99: 3.42696ms, P99.9: 87.394911ms, P99.99: 88.047416ms
timer_server_handler_ns: Min: 68.542µs, Max: 11.609167ms, Mean: 231.362µs, P50: 170.854µs, P75: 217.093µs, P90: 328.608µs, P95: 526.806µs, P99: 1.156035ms, P99.9: 11.396883ms, P99.99: 11.609167ms
timer_rego_external_resolve_ns: Min: 41ns, Max: 3.958µs, Mean: 115ns, P50: 84ns, P75: 125ns, P90: 167ns, P95: 208ns, P99: 334ns, P99.9: 3.868µs, P99.99: 3.958µs
timer_rego_query_compile_ns: Min: 10.875µs, Max: 385.125µs, Mean: 30.839µs, P50: 25.187µs, P75: 30.656µs, P90: 41.271µs, P95: 58.918µs, P99: 235.747µs, P99.9: 384.389µs, P99.99: 385.125µs
Peaks:
duration: 93.717667ms
timer_server_handler_ns: 93.573541ms
timer_rego_external_resolve_ns: 343.375µs
timer_rego_query_compile_ns: 7.865875ms
timer_rego_query_eval_ns: 4.087875ms
Peak duration: 93.717667ms

PR:

Global metrics:
Requests: 200
Requests total: 100000
Duration: 1m18.222317792s
Max concurrency: 500
Average req/s: 1278.4075289856378
duration: Min: 164.5µs, Max: 63.79725ms, Mean: 698.156µs, P50: 398.854µs, P75: 497.385µs, P90: 905.012µs, P95: 1.423327ms, P99: 4.176799ms, P99.9: 63.694633ms, P99.99: 63.79725ms
timer_rego_query_eval_ns: Min: 5.75µs, Max: 408.417µs, Mean: 20.925µs, P50: 13.833µs, P75: 17.364µs, P90: 39.675µs, P95: 55.189µs, P99: 180.39µs, P99.9: 406.198µs, P99.99: 408.417µs
timer_server_handler_ns: Min: 70.666µs, Max: 33.640792ms, Mean: 247.448µs, P50: 180.375µs, P75: 218.073µs, P90: 293.366µs, P95: 499.251µs, P99: 1.030884ms, P99.9: 32.711279ms, P99.99: 33.640792ms
timer_rego_external_resolve_ns: Min: 41ns, Max: 2.583µs, Mean: 116ns, P50: 125ns, P75: 125ns, P90: 167ns, P95: 167ns, P99: 292ns, P99.9: 2.547µs, P99.99: 2.583µs
timer_rego_query_compile_ns: Min: 11.208µs, Max: 739.959µs, Mean: 32.406µs, P50: 26.145µs, P75: 30.833µs, P90: 39.595µs, P95: 57.502µs, P99: 227.093µs, P99.9: 733.351µs, P99.99: 739.959µs
Peaks:
timer_rego_query_eval_ns: 2.902666ms
timer_server_handler_ns: 91.444459ms
timer_rego_external_resolve_ns: 299.792µs
timer_rego_query_compile_ns: 2.533792ms
duration: 92.222292ms
Peak duration: 92.222292ms

Looks like we're nearing the end of this story 🙂. I think there might be just the one thing left to fix.

Signed-off-by: sspaink <[email protected]>
Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 😃

Let's get this baby in! 🎉

@sspaink sspaink merged commit d917e3a into open-policy-agent:main Jun 25, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
monitoring Issues related to decision log and status plugins
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decision log plugin uses the upload_size_limit_bytes to represent both the compressed and uncompressed limit
2 participants