Description
Expected Behavior
My team would like to use decision logs for auditing purposes, but we noticed that there's a gap in the fault tolerance of OPA. If OPA were to crash, all the logs still left in the logs
plugin's buffer would be lost. Ideally, in the case that OPA crashes, decision logs that haven't been sent to the control plane would persist and be able to be sent to the control plane when another OPA instance starts running.
Actual Behavior
Currently if OPA were to crash, any decision logs left in the in-memory buffer would be lost.
Steps to Reproduce the Problem
Not exactly sure on the specifics of how this could happen, but some thoughts:
- In the case of an OPA instance running out of memory
- In the case where an OPA instance gets shut down so a new OPA sidecar can be deployed
Additional Info
Some of my team members were trying to think about how best to solve this fault tolerance gap. One idea we had was to create our own decision log plugin that would have a step to persist decision log data to a robust datastore.
We thought we'd create this issue to ask the community about any other possible ways of tackling this problem.