Skip to content

Language Specifications

Aswin Pradeep edited this page Dec 29, 2022 · 3 revisions

The language codes used within ULCA datasets is the recommended and standardised nomenclature used to specify languages i.e. ISO 639. Each language is assigned a two-letter (639-1) and three-letter (639-2 and 639-3) lowercase abbreviation as per the ISO standardisation.

References

Refer to these links for further details :

The below table consists of the ISO 639 Code used within ULCA and the language label for reference.
The source of truth for this list is https://github.com/ULCA-IN/ulca/blob/master/specs/common-schemas.yml#SupportedLanguages
"mixed" and "unknown" are the only exceptions - they are added to simply the contribution & search.

📔 Any new language support (to be added to the spec) can be requested to the ULCA team.

Code Label ISO Standard
en English ISO 639-1
hi Hindi ISO 639-1
mr Marathi ISO 639-1
ta Tamil ISO 639-1
te Telugu ISO 639-1
kn Kannada ISO 639-1
gu Gujarati ISO 639-1
pa Punjabi ISO 639-1
bn Bengali ISO 639-1
ml Malayalam ISO 639-1
as Assamese ISO 639-1
ks Kashmiri ISO 639-1
ne Nepali ISO 639-1
or Odia ISO 639-1
sd Sindhi ISO 639-1
si Sinhala ISO 639-1
ur Urdu ISO 639-1
sa Sanskrit ISO 639-1
brx Bodo ISO 639-3
doi Dogri ISO 639-3
kok Konkani ISO 639-3
mai Maithili ISO 639-3
mni Manipuri ISO 639-3
sat Santali ISO 639-3
lus Lushai ISO 639-3
njz Ngungwel ISO 639-3
pnr Panim ISO 639-3
kha Khasi ISO 639-3
grt Garo ISO 639-3
bho Bhojpuri ISO 639-3
raj Rajasthani ISO 639-3
gom Goan ISO 639-3
awa Awadhi ISO 639-3
hne Chhattisgarhi ISO 639-3
mag Magahi ISO 639-3
mwr Marwari ISO 639-3
sjp Surjapuri ISO 639-3
anp Angika ISO 639-3
gbm Garhwali ISO 639-3
tcy Tulu ISO 639-3
hlb Halbi ISO 639-3
bih Bihari ISO 639-2/5
bns Bundeli ISO 639-3
unknown Unknown NA
mixed Mixed NA

Clone this wiki locally