-
Notifications
You must be signed in to change notification settings - Fork 381
Description
Background
Some organizations require data retention policies to be enforced to meet legal and/or regulatory requirements. This proposal provides a few ways that this functionality could be implemented within Marquez.
Retention Scope
Enabling data retention functionality will be an opt-in feature, and will be disabled by default. This ensures that existing deployments are unaffected, but provides the functionality to those who need it.
When targeting data for retention, the scope will be focused on dataset/job versions and job runs. Furthermore, the Marquez configurations will allow retention policies to be set individually for each type of entity (dataset version, job version, job run) to allow for flexible enforcement.
This will ensure that core entities (namespaces, sources, datasets, jobs) are not impacted and will not magically disappear from the system. Lineage events are also out of scope, so that we don't lose the ability to replay those events over time.
Retention Behavior
Retention rules can be configured to purge items that are older than a given timeframe AND/OR are X versions away from the current version. Other systems have implemented retention using one of these conditions OR the other; however, I think flexibility is required for this to work with many real world scenarios.
Here is how this could be configured within a Marquez deployment.
# Rules can be combined under a section to ensure both conditions are met prior to data being purged.
retention:
enabled: true # Enables retention enforcement, for the entities configured below. Omit sections when retention is not desired.
schedule: "0 3 * * *" # When deployed into Kubernetes, enforce retention policies on this schedule.
datasetVersions:
recentItemsToKeep: 10 # Purges any dataset version records older than the 10 most recent.
jobVersions:
daysToKeep: 365 # Purges any job version records older than 1 year.
jobRuns: # Purges job run records that are both older than the 10 most recent AND older than 1 year.
recentItemsToKeep: 10
daysToKeep: 365Retention Enforcement
Retention enforcement can be applied on a configurable schedule, using a Java container-based batch job. We can provide a script to execute this job via Docker to support local development, and also deploy this into Kubernetes as a CronJob as part of the Helm chart. The CronJob will execute on the schedule defined within the YAML configurations, and will purge the applicable records based on the retention policy rules.
As records are identified and purged, applicable information can be logged and incorporated into the metrics framework.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status