Skip to content

Multilingual Datasets, the Government of Canada approach

wardi edited this page Mar 4, 2013 · 6 revisions

Problem We're Solving

The Government of Canada Official Languages Act requires that both official languages have an equal status in communications to the public.

CKAN's multilingual extensions allow registering translations for complete field values and displaying them to the user, but not associating them with specific datasets/resources or exposing them in the API. There no simple way to ensure translations are updated when a dataset or resource is modified, which would result in an incorrect language being displayed. There is also no way to provide different translations for the same string in different contexts, which may be required.

Our Approach

We add a new dataset field: language.

This field contains a list of ISO639-2/T three letter language code, ISO3166-1 three letter country code pairs. The language and country codes are joined with a semicolon and space, and the pairs are joined with a vertical bar.

For our datasets this field will always contain the value: "eng; CAN | fra; CAN". This is interpreted as :

  1. the language found in all translated fields is Canadian English, e.g. title, notes... contain Canadian English text
  2. Canadian French versions of translated fields are stored as original field name + _fra, e.g. title_fra, notes_fra... contain Canadian French text
  3. Very few special characters are allowed in tags so we use a different approach. Tags associated with this dataset will contain Canadian English tag name + (two spaces) + Canadian French tag name, e.g.
    "Economics and Industry  Économie et industrie"
    

Some of our "translated fields" are actually URLs, where the different language versions are URLs pointing to information in the correct language. These are used for linking to the web site for the program responsible for the dataset or for supporting human-readable materials, not for the actual resource URLs.

Limitations

This approach would allow aggregation of datasets in multiple languages, but tags are problematic because of the way multiple languages are wedged into the same strings. Storing each language in its own tag vocabulary instead however doesn't allow free-form tags or associating different language versions of the same tag.

Clone this wiki locally