Skip to content

Format descriptions

Piotr Banski edited this page Nov 29, 2025 · 8 revisions

The SIS does not pretend to becoming an encyclopaedia of formats -- there are already several sources of this kind of information on the net, from respectable institutions. What we aim for is to provide (relatively) unique reference points in the formatsphere, in the context of the data that the centres that the SIS serves are sensitive to. This means, primarily, linguistic and language-resource-oriented contexts, but we do not exclude the option of serving neighbouring disciplines -- because linguistics and text technology are not firmly circumscribed and they overlap with numerous other fields.

Format description files reside in clarin/data/formats/.

Overall structure of format descriptions

A format description file is contained within a <format> element, and the first thing is to set the ID of the format, which has to be unique -- as a schema-aware XML editor will immediately let you know, especially if you begin creating your new format description by using another as your template. In fact, starting a new format by selecting a description of a similar one and "saving as..." is the recommended way to go -- this way, you keep all the infrastructural extras and at least some of the keywords in.

The overall internal structure of the <format/> element is as follows:

  • naming section: <titleStmt>
  • keyword section: <keyword>
  • external references: <extId> and <extDoc>
  • (mostly) prose information: <info>
  • standards references: <relation>
  • format-related labels: <mimeType> and <fileExt>
  • schema location: <schemaLoc/>
  • formal family anchoring: <formatFamily>

This is how the information in this document is organised, starting with the envelope, i.e. the <extId> element.

Top level: @id

The Schematron check for the uniqueness of @id works on the contents of the disk, so it will only go away when the file is saved after a modification. Also, the ID has to begin with the character 'f'.

There is no strict identity between the ID and the file name on the one hand, and the abbreviated name on the other. However, there is a strong preference to keep the ID identical to the file name, and very similar to the abbreviated name (minus the leading 'f', of course; a notorious counterexample to the latter is fTextPlain vs. "PlainText"; but is that really a counterexample?).

Do not worry about the vestigious @display attribute; it is currently not supported and just dangles there.

Naming section: <titleStmt>

  <titleStmt>
    <title>Structured Query Language</title>
    <abbr>SQL</abbr>
  </titleStmt>

<title> can be a fairly precise long name for the format.

<abbrev> has two main roles: it is used for table-based display, where we mind the length of the text field (to an extent), and it may also be used as an identifier of a node in the formal format family. For example, "XML" is the root of an entire subtree (or subgraph) of the formal family tree (see below), and that's only possible because the file fXML.xml has <abbrev>XML</abbrev>. In other words, it's a relatively sensitive element.

image

Keyword section: <keyword>

This is how we would like to ensure a measure of useful cross-format navigation, but the untyped version of this element still await some kind of effort that targets unification/coherence.

<keyword>database</keyword>

Keywords can be free, as above, and constrained, as in the following fragment, where for the "gdfr" types, the choice is predetermined and auto-completed:

  <keyword>annotation format</keyword>
  <keyword>corpus encoding</keyword>
  <keyword>format family</keyword>
  <keyword type="gdfr">form:text</keyword>
  <keyword type="gdfr">genre:text</keyword>
  <keyword type="gdfr">role:family</keyword>

The constraint layer is based on GDFR (Global Digital Format Registry) recommendations, or rather a pertinent subset of them, extracted from GDFR Classification v. 1.0.5, mirrored in our vault.

GDFR-constrained keywords await a dedicated project. Please let us know if you feel like participating in it.

Another value used for the @type attribute is "SSO", standing for "Standards-Setting Organisation". Hence, for example:

  <keyword type="SSO">ISO</keyword>

Keywords make up the cloud on the home page.

External references: <extId> and <extDoc>

<extDoc> ("external documentation") is just a way to provide a uniform link to Wikipedia (enwiki) or to the "format wiki".

<extId> ("external ID") is a way to provide a measure of authoritative control, by anchoring the given format in well-known external collections.

  <extId type="LOC">fdd000075</extId>
  <extId type="PRONOM">fmt/101</extId>
  <extId type="Wikidata">Q2115</extId>

There is an optional attribute @label there, which may be used to increase the granularity of the external references, by e.g. explicitly stating which version the given external ID refers to, if the info file describes a format with several variants.

  <extId type="LOC" label="v.1.2">fdd000247</extId>
  <extId type="PRONOM" label="v.1.2">fmt/291</extId>
  <extId type="PRONOM" label="v.1.3">fmt/1756</extId>
  <extId type="Wikidata" label="v.1.2">Q27203601</extId>
  <extId type="Wikidata" label="v.1.3">Q114074169</extId>

(Mostly) prose information: <info>

@umbrella

One piece of information that may happen to be present in a file that is "saved as..." for the purpose of creating a new format info file is the @umbrella attribute that indicates whether or not the format described by the info file is an 'umbrella format' or not. It makes sense to disregard this attribute in most cases (so, when you open, e.g., fXML.xml in order to "save as..", please delete that attribute). For the sake of completeness, here is the stock message that one can see displayed at the top of umbrella formats:

... is considered an umbrella format, which means that it is impossible to create a meaningful recommendation for or against using it, because in the context of research data, it is actually shorthand for many, sometimes drastically differing formats. Centres are advised to discern between the various subformats that are grouped under the general umbrella of this one, for the purpose of creating recommendations.

Umbrella formats do not appear in the list of popular formats (unconditionally) and are indicated as out-of-place (improperly used) in the recommendations – unless they are qualified by a comment.

Internal syntax of <info>

The outer element is expected to begin as <info type="description" ..., although the @type attribute doesn't play any role at the moment -- it is a vestige of the model for the description of standards (as opposed to formats).

What comes afterwards is a sequence of <p>, <ul> and <ol> elements, and do not expect any kind of sophistication there, such as nested lists. Interestingly, you should be able to use paragraphs inside list items. As usual: let us know if you're missing something here (or, better yet: submit a PR...).

Inside the above-mentioned block elements, the following are possible: <a> for hyperlinks, <i> for italics, <code> for monotype. What should also be possible is formatRef, so please see the linked issue for whether it's gotten handled by the time you read this. Otherwise, use the following to reference other formats:

<a href="../views/view-format.xq?id=fEXB">EXMARaLDA</a>

(All of those above are going to get turned into <formatRef ref="fEXB"/> one day...)

Standards references: <relation>

This piece awaits the update and development of the Watchtower part of the SIS. Would you like to participate in that?

  <relation target="SpecXML" type="isDefinedBy"/>

Format-related labels: <mimeType> and <fileExt>

In both cases, where several possible options exist, use the @recommended attribute to select those recommended, e.g. from the point of view the Switchboard.

 <mimeType>application/xml</mimeType>
  <fileExt>.xml</fileExt>

Schema location: <schemaLoc/>

This is for formats with a specific schema available publicly on the net. The @type attribute specifies the kind of schema.

Formal family anchoring: <formatFamily>

Please treat this element as largely of internal use, for now. This area should be a target of a subproject.

Clone this wiki locally