Skip to content

Decision Log fault tolerance #3043

Open
@cknakal

Description

@cknakal

Expected Behavior

My team would like to use decision logs for auditing purposes, but we noticed that there's a gap in the fault tolerance of OPA. If OPA were to crash, all the logs still left in the logs plugin's buffer would be lost. Ideally, in the case that OPA crashes, decision logs that haven't been sent to the control plane would persist and be able to be sent to the control plane when another OPA instance starts running.

Actual Behavior

Currently if OPA were to crash, any decision logs left in the in-memory buffer would be lost.

Steps to Reproduce the Problem

Not exactly sure on the specifics of how this could happen, but some thoughts:

  • In the case of an OPA instance running out of memory
  • In the case where an OPA instance gets shut down so a new OPA sidecar can be deployed

Additional Info

Some of my team members were trying to think about how best to solve this fault tolerance gap. One idea we had was to create our own decision log plugin that would have a step to persist decision log data to a robust datastore.

We thought we'd create this issue to ask the community about any other possible ways of tackling this problem.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions