Summary
The ODCS currently describes the logical and physical types of schema properties (logicalType, physicalType) but does not provide a standardized way to declare the character encoding expected for the data (e.g. UTF-8, ISO-8859-1, ASCII, UTF-16).
Motivation
Encoding mismatches are a common and painful source of data pipeline failures — especially with flat files (CSV, TXT) or legacy systems that produce non-UTF-8 data. Today, teams work around this by documenting encoding in free-text descriptions or custom properties, which is inconsistent and not machine-readable.
Adding a top-level encoding field would allow:
- Data producers to declare the expected encoding explicitly
- Data consumers to validate and configure their readers accordingly
- Tooling to automate encoding checks as part of data quality
Proposed Change
Add an optional encoding field at the contract level (alongside name, domain, status, etc.) with a free-form string value to remain flexible and future-proof:
{
"apiVersion": "v3.1.0",
"kind": "DataContract",
"name": "my_dataset",
"encoding": "UTF-8",
...
}
Schema addition (JSON Schema)
"encoding": {
"type": "string",
"description": "The expected character encoding of the data (e.g. UTF-8, ISO-8859-1, ASCII, UTF-16). Free-form string to remain flexible across use cases."
}
Alternatives Considered
- At the column level (
SchemaBaseProperty): More granular but arguably over-engineered for most use cases where encoding is consistent across the dataset. Could be a follow-up if the community needs it.
- Via
customProperties: Already possible today, but not standardized — no tooling can rely on it.
Open Questions
- Should common values be documented as examples (UTF-8, ISO-8859-1...) even if the field stays free-form?
- Should this be scoped per
server instead of (or in addition to) the contract level?
Looking forward to the community's thoughts! 🙌
Summary
The ODCS currently describes the logical and physical types of schema properties (
logicalType,physicalType) but does not provide a standardized way to declare the character encoding expected for the data (e.g. UTF-8, ISO-8859-1, ASCII, UTF-16).Motivation
Encoding mismatches are a common and painful source of data pipeline failures — especially with flat files (CSV, TXT) or legacy systems that produce non-UTF-8 data. Today, teams work around this by documenting encoding in free-text descriptions or custom properties, which is inconsistent and not machine-readable.
Adding a top-level
encodingfield would allow:Proposed Change
Add an optional
encodingfield at the contract level (alongsidename,domain,status, etc.) with a free-form string value to remain flexible and future-proof:{ "apiVersion": "v3.1.0", "kind": "DataContract", "name": "my_dataset", "encoding": "UTF-8", ... }Schema addition (JSON Schema)
Alternatives Considered
SchemaBaseProperty): More granular but arguably over-engineered for most use cases where encoding is consistent across the dataset. Could be a follow-up if the community needs it.customProperties: Already possible today, but not standardized — no tooling can rely on it.Open Questions
serverinstead of (or in addition to) the contract level?Looking forward to the community's thoughts! 🙌