Skip to content

feat: add top-level encoding field to the data contract #263

@zorender

Description

@zorender

Summary

The ODCS currently describes the logical and physical types of schema properties (logicalType, physicalType) but does not provide a standardized way to declare the character encoding expected for the data (e.g. UTF-8, ISO-8859-1, ASCII, UTF-16).

Motivation

Encoding mismatches are a common and painful source of data pipeline failures — especially with flat files (CSV, TXT) or legacy systems that produce non-UTF-8 data. Today, teams work around this by documenting encoding in free-text descriptions or custom properties, which is inconsistent and not machine-readable.

Adding a top-level encoding field would allow:

  • Data producers to declare the expected encoding explicitly
  • Data consumers to validate and configure their readers accordingly
  • Tooling to automate encoding checks as part of data quality

Proposed Change

Add an optional encoding field at the contract level (alongside name, domain, status, etc.) with a free-form string value to remain flexible and future-proof:

{
  "apiVersion": "v3.1.0",
  "kind": "DataContract",
  "name": "my_dataset",
  "encoding": "UTF-8",
  ...
}

Schema addition (JSON Schema)

"encoding": {
  "type": "string",
  "description": "The expected character encoding of the data (e.g. UTF-8, ISO-8859-1, ASCII, UTF-16). Free-form string to remain flexible across use cases."
}

Alternatives Considered

  • At the column level (SchemaBaseProperty): More granular but arguably over-engineered for most use cases where encoding is consistent across the dataset. Could be a follow-up if the community needs it.
  • Via customProperties: Already possible today, but not standardized — no tooling can rely on it.

Open Questions

  • Should common values be documented as examples (UTF-8, ISO-8859-1...) even if the field stays free-form?
  • Should this be scoped per server instead of (or in addition to) the contract level?

Looking forward to the community's thoughts! 🙌

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions