Skip to content

Toward versioned profiles for shared serialization formats #8

@fingolfin

Description

@fingolfin

(Disclaimer: I wrote a rough draft of this issue text and then Codex helped me to smooth it out)

We want this project to do more than document a list of data types, especially since many such types are inherently application- or domain-specific. The longer-term goal should be to identify at least some common ground across systems and to make that common ground explicit.

A possible way to structure this is through profiles. A profile would describe a versioned subset of the serialization format. Profiles could then be combined into larger, application-level profiles.

Some examples of what this might look like:

  • A base profile could define common primitive types such as bool, int, and string, along with basic containers such as arrays/vectors, dictionaries, and sets.
  • If the serialization of booleans changes, that could lead to a new version of the base profile, for example base v2.
  • At a higher level, we could also define application profiles for specific releases, such as OSCAR 1.5 or OSCAR 1.8.
  • These application profiles could depend on lower-level profiles. For example, OSCAR 1.5 through OSCAR 1.7 might use base v1, while OSCAR 1.8 uses base v2.
  • Other systems such as Sage, Magma, or CoCoA could define their own application profiles as well.

This becomes especially relevant for more complex structures. For example, serialization of permutation groups may initially be defined only in an OSCAR-specific profile. In fact, we already have such a format in practice, and we could retroactively describe it as part of an OSCAR 1.5+ profile (or whenever it was introduced, I did not bother to figure that out). If that format later changes (we just talked about this being a possibility in Berlin), we could then refer to a revised definition as the permutation-group format in OSCAR 1.8, or similar.

Over time, some of these application-specific formats may mature into more general profiles shared across systems. For instance, if OSCAR and Magma eventually agree on a common serialization for permutation groups, that could become part of a shared group theory profile.

This suggests two broad classes of profiles:

  • Application-specific profiles, which may initially be defined mostly by implementation and may evolve quickly.
  • General/shared profiles, where multiple implementations agree on the format and which therefore deserve a more careful and explicit specification.

The intended workflow is pragmatic: people should be able to start producing data as soon as they have a need, without waiting for a perfect specification. But over time, those formats should be able to evolve into clearer, more formal, and more reusable specifications.

Examples are important here as well. The website is already a good starting point, but we will likely need examples not only for the current format, but also for older and newer variants over time. For example, users should be able to inspect both the old and new boolean formats and switch between them.

On the technical side, one possible next step would be to add metadata to each description.md file indicating the profiles in which that format appears. For example, data/basics/bool/description.md might gain a profiles key with entries such as base-v1.

That immediately raises another question: how should we represent revised versions of a format like bool, while still making it clear that they are revisions of the same conceptual type?

My view is that each format variant should also have its own identifier or version. I understand the concern (strongly expressed by @micjoswig) that this can become confusing if one starts tracking questions like “does permutation-group format X use bool format Y or Z?”. That is precisely why profiles seem useful: they provide the higher-level compatibility story. In many cases, these concerns are partly decoupled anyway. For example, the current permutation-group format serializes booleans, but it may work equally well with both the old and new boolean encodings.

Still, we need a precise way to distinguish different variants of a format. Referring informally to “old bool” and “new bool” is not sustainable once there are several revisions. I do not have a strong preference for the naming scheme itself, but we should choose something explicit. Possible options include:

  1. Sequential labels such as v1, v2, v3, or perhaps A, B, C.

    • Easy to explain and implement.
    • Gives an obvious ordering.
    • Makes it straightforward to determine predecessor/successor variants programmatically.
  2. Date-based labels such as 2026-03.

    • Retains most of the advantages of sequential labels.
    • Gives a rough sense of when a variant was introduced.
    • May also make relationships between revisions easier to guess at a glance.
  3. Some other explicit variant identifier

    • If there is a better idea, especially one that avoids confusion while remaining precise, we should consider it.

For the prototype, I think it is reasonable to use simple sequential version labels (v1, v2, ...) just to get started. That would let us build and test the overall structure now, while leaving room to revise the naming scheme later once we reach broader agreement.

Open Questions

  1. Should profiles be the main compatibility mechanism, with individual format variants treated as lower-level building blocks?
  2. Do we want to distinguish explicitly between application-specific and shared/general profiles?
  3. What naming scheme should we use for format variants in the prototype: sequential, date-based, or something else?
  4. How should the website present historical examples and switches between profile/variant versions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions