Awesome Georgian NLP Resources 🇬🇪
A comprehensive, curated list of Natural Language Processing resources for the Georgian language (ქართული ენა, ISO 639-1: ka, ISO 639-3: kat). Georgian is a Kartvelian language spoken by ~4 million people, written in the unique Mkhedruli script. It remains classified as a low-resource language for NLP, though resources have expanded significantly since 2022.
Current version generated with Claude Deep Research.
Name
Description
Size
License
Link
Georgian National Corpus (GNC)
Diachronic corpus spanning ~1,600 years (5th c.–present), covering Modern/Middle/Old Georgian, Mingrelian, and Svan. MSD-tagged and lemmatized via Constraint Grammar. Developed by Goethe University Frankfurt & University of Bergen.
~217M words (~20M morphologically annotated)
CC-BY-NC
gnc.gov.ge · CLARINO
Georgian Language Corpus (GLC)
Monolingual and bilingual corpus from Ilia State University (2009–2016). Includes Old/Middle/New Georgian texts (1832–2012). Each word tagged with lemma and morphosyntactic description.
>100M word-forms
Academic
corpora.iliauni.edu.ge
Georgian Wikipedia Dumps
Full Georgian Wikipedia articles, regularly updated. ~188K articles as of late 2025.
~188K articles
CC-BY-SA 3.0
dumps.wikimedia.org/kawiki · HuggingFace
OSCAR Corpus (ka)
Web-crawled multilingual corpus from Common Crawl with language identification. Georgian subset available across multiple versions.
Varies by version
CC0 annotations / CC ToU for text
HuggingFace OSCAR-2301 · oscar-project.org
CC-100 (ka)
Monolingual data from Common Crawl (2018) extracted via CCNet pipeline. Used for XLM-R training.
1.1 GB
Common Crawl ToU
data.statmt.org/cc-100 · HuggingFace
mC4 (ka)
Multilingual C4 corpus used for mT5 pre-training. Georgian subset included. Quality may vary for low-resource languages.
Varies
Common Crawl ToU
HuggingFace
CulturaX (ka)
Cleaned combination of mC4 + OSCAR. 6.3 trillion tokens across 167 languages including Georgian.
Part of 6.3T tokens
Research
HuggingFace
GeoWordsDatabase
Database of ~310,000 unique Georgian words in MySQL format.
~310K words
Open
GitHub · Web
RichNachos Georgian Corpus
Community-contributed Georgian text corpus on HuggingFace.
—
—
HuggingFace
Georgian Dialect Corpus
Dialectal data integrated into the GNC. Covers geographical varieties of Georgian.
—
Academic
Integrated into GNC
geo-words
Georgian words database (txt, dic, sql) + CLI web crawler.
—
Open
GitHub (akalongman)
Name
Description
Size
License
Link
OPUS Collection (ka)
Largest open collection of parallel corpora. Georgian available in OpenSubtitles, WikiMatrix, CCAligned, GNOME/KDE/Ubuntu, Tanzil, QED, Tatoeba, and more.
Multiple sub-corpora
Varies
opus.nlpl.eu
OPUS-100 (en-ka)
English-Georgian parallel pairs from OPUS.
—
Open
HuggingFace
FLoRes-200 / FLORES+
Meta's n-way parallel MT benchmark including Georgian (kat_Geor).
~2,000 sentences
CC-BY-SA 4.0
GitHub · HuggingFace
GLC Bilingual Sub-corpora
Georgian-English parallel "Vepkhistkaosani" (The Knight in the Panther's Skin) and Georgian-Armenian "Kartlis Tskhovreba" (Georgian Chronicles).
—
Academic
corpora.iliauni.edu.ge
tbilisi-ai-lab/en-ka-human-translated
Human-translated EN↔KA parallel pairs from Tbilisi AI Lab.
5K pairs
—
HuggingFace
Name
Description
Size
License
Link
UD_Georgian-GLC
First Georgian treebank in Universal Dependencies framework. Based on GLC sentences and 3,013 Wikipedia sentences across 131 scientific domains. CoNLL-U format.
~60K tokens (3,013 sentences)
CC BY-SA
GitHub · UD page
UD_Georgian-GNC
Treebank from the Georgian National Corpus texts (novels and news). Uses finite-state morphological analyzer + Constraint Grammar, manually corrected.
~22K tokens
UD license
GitHub
GRUG Parallel Treebank
Georgian-Russian-Ukrainian-German parallel treebank. Syntactically annotated using TIGER guidelines. Viewable via Stockholm TreeAligner.
4 monolingual + 4 parallel treebanks
CC-BY 3.0
CLARIN-D
Name
Description
License
Link
WikiANN / PAN-X (ka)
Multilingual NER dataset from Wikipedia. Georgian is one of 176 languages. Tags: LOC, PER, ORG in IOB2 format.
Research use
HuggingFace
Sentiment & Classification Datasets
Name
Description
License
Link
JRC Georgian Sentiment Dataset
First publicly available annotated dataset for Georgian sentiment classification + semantic polarity dictionary. 3-label and 4-label settings. From the European Commission Joint Research Centre.
Open
JRC Data Catalogue
senti_lex (ka)
Sentiment lexicons for 81 languages including Georgian.
—
HuggingFace
Name
Description
Size
License
Link
Mozilla Common Voice (ka)
Crowd-sourced read speech with transcriptions. Primary resource for Georgian ASR.
~76h validated
CC-0
commonvoice.mozilla.org · HuggingFace (v17)
FLEURS (ka_ge)
Google's Few-shot Learning Evaluation of Universal Representations of Speech. N-way parallel speech benchmark in 102 languages.
~12h per language
CC-BY 4.0
HuggingFace
IARPA Babel Georgian (LDC2016S12)
Conversational and scripted telephone speech (Eastern/Western dialects). Equal gender distribution, ages 16–73.
~190 hours
LDC license
LDC Catalog
MATERIAL Georgian-English (LDC2025S01)
Georgian-English ASR and MT resources for cross-lingual information retrieval. IARPA MATERIAL program.
—
LDC license
LDC2025S01
OpenSLR 153
Georgian crowd-sourced speech data. Part of effort achieving 5.73% WER for Georgian.
—
CC-BY-SA 4.0
OpenSLR
CommonLanguage (SpeechBrain)
Speech recordings from CommonVoice for 45 languages including Georgian, curated for language identification.
~1h Georgian
CC-0
HuggingFace
Tbilisi AI Lab Datasets (Instruction-Tuning & Benchmarks)
The Tbilisi AI Lab released 19+ datasets for training and evaluating Georgian LLMs (October 2025):
Name
Size
Description
Link
kona-sft-mix-2.6M
2.61M pairs
Instruction/SFT training mix
HuggingFace
kona-dpo-mix-387k
387K pairs
DPO preference alignment data
HuggingFace
kona-sft-function-calling-ka-93k
93K
Function-calling SFT data (Georgian)
HuggingFace
kona-sft-function-calling-115k
115K
Function-calling SFT data (English)
HuggingFace
wiki-ka-QA
42.6K
Wikipedia-based QA in Georgian
HuggingFace
code-instruct-ka
61.3K
Code instruction in Georgian
HuggingFace
math-instruct-ka
32.4K
Math instruction in Georgian
HuggingFace
learnlm-chat-ka
5.86K
Educational chat data in Georgian
HuggingFace
ai2_arc-ka
1.68K
ARC benchmark translated to Georgian
HuggingFace
boolq-ka
3.27K
BoolQ benchmark in Georgian
HuggingFace
commonsense_qa-ka
1.22K
CommonsenseQA in Georgian
HuggingFace
Browse all: huggingface.co/tbilisi-ai-lab
Evaluation & Benchmark Datasets
Name
Description
Link
Georgian Case-Alignment Syntactic Tests
370 syntactic tests for evaluating LMs on Georgian split-ergative case system (nominative-dative, ergative-nominative, dative-nominative). Generated from GLC UD treebank.
HuggingFace · GitHub
GeoLogicQA
100-question benchmark for evaluating LLM logical reasoning in Georgian. From TSU.
ACL Anthology
Georgian-Specific Language Models
Name
Architecture
Params
Training Data
Tasks
Link
Kona2-12B
Causal LM
12B
Georgian-first training data
Text generation
HuggingFace
Kona2-12B-Instruct
Causal LM
12B
SFT + DPO alignment
Instruction following, function calling
HuggingFace
Kona2-12B-Base
Causal LM
12B
Pre-instruct base model
Base model
HuggingFace
Kona2-small-3.8B
Causal LM
3.8B
Georgian-first training data
Text generation
HuggingFace
mGPT-1.3B-Georgian
GPT-2
1.3B
Wikipedia + C4, fine-tuned 10K steps on Georgian
Text generation
HuggingFace
electra-ka
ELECTRA
BERT-base
33GB Georgian text from ~4.85M CommonCrawl pages
Feature extraction, fine-tuning base
HuggingFace · GitHub
georgian-distilbert-mlm
DistilBERT
base
mC4 Georgian subset
Fill-mask, feature extraction
HuggingFace
gpt2-ka-wiki
GPT-2
small
Georgian Wikipedia
Text generation
HuggingFace
gpt2-geo
GPT-2
small
Georgian text (limited training)
Text generation
HuggingFace
Note: The Kona2 family from Tbilisi AI Lab (released October 2025) represents the most comprehensive Georgian-first LLM effort to date.
Fine-Tuned Georgian Task Models
Name
Base Model
Task
Link
electra-ka-discrediting
electra-ka
Discrediting text detection
HuggingFace
electra-ka-fake-news-tagging
electra-ka
Fake news classification
HuggingFace
stefan-it Georgian NER Models
XLM-R Large + Flair
Named entity recognition (LOC, PER, ORG)
GitHub · HuggingFace collection
Georgian Translation Models
Name
Architecture
Direction
Link
opus-mt-ka-en
Marian/Transformer
Georgian → English
HuggingFace
opus-mt-en-ka
Marian/Transformer
English → Georgian
HuggingFace
english-georgian
T5-small (fine-tuned)
English → Georgian
HuggingFace
Georgian Speech Recognition Models
Name
Architecture
Training Data
WER
Link
NVIDIA stt_ka_fastconformer
FastConformer Hybrid CTC-Transducer (~115M)
Common Voice + FLEURS (~163h)
5.73%
HuggingFace
whisper-large-v2-ka
Whisper Large V2
Common Voice 11.0 (ka)
31.85%
HuggingFace
wav2vec2-xlsr-georgian (sammy786)
Wav2Vec2-XLS-R-1B
Common Voice 8.0 (ka)
—
HuggingFace
wav2vec2-large-xlsr-georgian (m3hrdadfi)
Wav2Vec2-XLSR-53
Common Voice (ka)
—
HuggingFace
wav2vec2-large-xlsr-georgian (xsway)
Wav2Vec2-XLSR-53
Common Voice (ka)
—
HuggingFace
Note: The NVIDIA FastConformer model (5.73% WER) is the current state of the art for Georgian ASR, significantly outperforming Whisper Large V3 and Meta Seamless.
Name
Type
Dimensions
Training Data
Link
fastText Common Crawl+Wikipedia (ka)
CBOW with character n-grams
300
Common Crawl + Wikipedia
fasttext.cc → cc.ka.300.bin
fastText Wikipedia (ka)
Skip-gram
300
Wikipedia
fasttext.cc → wiki.ka
georgian-word2vec
Word2Vec (Gensim)
—
Georgian Wikipedia dump
GitHub
Georgian_Word_Embedding
FastText + Word2Vec
—
Georgian text
GitHub
ConceptNet Numberbatch
Hybrid (word2vec + GloVe + ConceptNet)
300
Multilingual, includes ka
GitHub
Multilingual Models with Georgian Support
These major multilingual models include Georgian in their training data and can be used for Georgian NLP tasks directly or via fine-tuning:
Encoders (BERT-family):
Name
Languages
Link
mBERT (bert-base-multilingual-cased)
104 languages (incl. ka)
HuggingFace
XLM-RoBERTa Base
100 languages (incl. ka)
HuggingFace
XLM-RoBERTa Large
100 languages (incl. ka)
HuggingFace
XLM-RoBERTa XL
100 languages (incl. ka), 3.5B params
HuggingFace
Generative:
Name
Languages
Link
mGPT (ai-forever)
61 languages (incl. ka), 1.3B
HuggingFace
mGPT-13B
60+ languages (incl. ka), 13B
HuggingFace
Translation:
Name
Languages
Link
NLLB-200 (Meta)
200 languages (Georgian: kat_Geor)
600M · 1.3B · 3.3B
M2M-100 (Meta)
100 languages (incl. ka)
418M · 1.2B
SMaLL-100
100 languages (incl. ka)
HuggingFace
Speech:
Name
Languages
Link
Whisper (OpenAI)
96+ languages (incl. ka)
Large V2 · Large V3
XLS-R (Meta)
128 languages, base for fine-tuning
1B
Name
Description
Language
Status
Link
Stanza (Stanford NLP)
Full NLP pipeline for Georgian: tokenization, POS tagging, lemmatization, dependency parsing. Uses UD 2.15 models. stanza.download("ka")
Python
✅ Active (2025)
GitHub
spaCy (via spacy-stanza)
Georgian support via spacy-stanza bridge. spacy_stanza.load_pipeline("xx", lang="ka")
Python
✅ Active
spacy-stanza
Anbani.py
Georgian toolkit: script conversion (Mkhedruli, Asomtavruli, Nuskhuri), Latin↔Georgian transliteration, text classification. pip install anbani
Python
✅ Active
GitHub
Anbani.js
Script conversion, transliteration, Lorem Ipsum, letter frequency analysis, Friedman index.
JavaScript
✅ Active
GitHub
QartNLPWebService
Georgian NLP Toolset (Flask web service). Developed at Ilia State University / Unilab.
Python
⚠️ Last updated Aug 2022
GitHub
Georgian Language Toolkit
Latinize/Georgianize strings, language detection (ka/en), morphological operations, Django slug generation.
Ruby, Python
⚠️ Last updated Mar 2021
GitHub
georgian-linguistics-tools
UTF-8 Georgian text handling, Latin transcription for C++ applications.
C++
⚠️ Likely unmaintained
GitHub
Name
Description
Link
FST Morphological Analyzer (Lobzhanidze)
Comprehensive finite-state analyzer/generator for Modern Georgian using XEROX tools (xfst, lexc). Covers all POS and verb paradigms. Used in GLC annotation.
Springer Book (2022)
GNC Morphological Analyzer (Meurer)
FST analyzer + Constraint Grammar disambiguation for Old/Middle/Modern Georgian in the GNC.
Documentation
FST + FCFG Parser (Kapanadze)
Finite-state morphological transducer/POS-tagger combined with Feature-Based CFG parser for syntactic chunking.
TbiLLC 2023 Paper
UniMorph (Georgian)
Morphological paradigm tables for Georgian, including polypersonal verb agreement.
unimorph.github.io
Name
Description
Link
ka_GE.spell (Hunspell)
Georgian orthographic spell-checking dictionary for Firefox, LibreOffice, Chrome. Auto-generated word lists from web crawling. MIT License.
GitHub
Georgian Seq2Seq Spellchecker
Character-level spellchecker using GRU Seq2Seq model with synthetic typo dataset.
GitHub (Dec 2025)
gegram-class
Library for replacing barbarisms in Georgian sentences.
Java
Transliteration & Script Tools
Name
Description
Link
translitit-latin-to-mkhedruli-georgian
Latin → ქართული (Mkhedruli) transliteration function.
JavaScript
translitit-mkhedruli-georgian-to-ipa
Mkhedruli Georgian → IPA transliteration function.
JavaScript
KartuliChromeExtension
Chrome extension converting English letters to Georgian equivalents.
Chrome Web Store
kautilities
Convert Georgian ↔ Latin letters.
PHP
Name
Description
Link
Tesseract OCR (kat)
Google's Tesseract supports Georgian via trained data (kat.traineddata).
tessdata
tesseract-georgian
Training data for Tesseract on Georgian, derived from Wikipedia dumps. Includes wordlists and bigrams.
GitHub
Name
Type
Description
Link
NVIDIA FastConformer Georgian
STT
State-of-the-art Georgian ASR (5.73% WER). NeMo toolkit. CC-BY-4.0.
HuggingFace
ElevenLabs Georgian TTS
TTS
Neural TTS using Eleven Multilingual v2/v3 models. Supports voice cloning.
elevenlabs.io
ElevenLabs Scribe Georgian
STT
Georgian transcription (5–10% WER). Speaker diarization support.
elevenlabs.io
Google Cloud STT (ka-GE)
STT
Georgian speech-to-text via Google Cloud API.
cloud.google.com
eSpeak NG
TTS
Open-source formant-based TTS. Georgian supported but at early stage with limited quality.
GitHub
Georgian eSpeak Chrome Extension
TTS
Browser extension reading Georgian text aloud using eSpeak.js.
GitHub
Georgian-TTS
TTS
Georgian text-to-speech synthesis system (research).
GitHub
KaRead
TTS
Experimental TTS for Georgian using Fourier transform letter frequency analysis.
GitHub
Note: Google Cloud TTS does not currently offer Georgian voice synthesis (only STT). For neural TTS, ElevenLabs is currently the primary commercial option.
Name
Description
Link
Google Translate
Georgian fully supported (GNMT).
translate.google.com
Google Cloud Translation API
Georgian supported (NMT model).
cloud.google.com
Meta NLLB-200
200 languages including Georgian (kat_Geor). Open-sourced.
ai.meta.com
Microsoft Translator
Georgian supported.
microsoft.com
Lingvanex
Commercial Georgian NLP services (tokenization, NER, sentiment, MT).
lingvanex.com
Name
Description
Language
Link
GeoParaphrase (Gadawere)
First Georgian paraphrasing/summarization tool (~10K+ users).
React + Express
GitHub
num2geotext
Convert numbers to Georgian text and currency.
Python
GitHub
dimakura/ka
Common functionality for Georgian projects.
Ruby
dimakura/ka.js
Georgian language support for Node.js.
JavaScript
Stichoza/money-num-to-string
Convert numbers/money to localized Georgian strings.
PHP, JS
Declensions for Georgian
Generate declensions for Georgian words.
—
Morphology & Morphological Analysis
Title
Authors
Year
Venue
Link
A Finite-State Model of Georgian Verbal Morphology
Gurevich
2006
NAACL 2006
ACL Anthology
Describing Georgian Morphology with a Finite-State System
Kapanadze
2010
FSMNLP 2009, Springer LNCS
Springer
Morphological Reinflection with Multiple Arguments: An Extended Annotation Schema and a Georgian Case Study
Guriel, Goldman, Tsarfaty
2022
ACL 2022 (Short Papers)
ACL Anthology
Universal Morphologies for the Caucasus Region
Chiarcos, Donandt, Ionov, Rind-Pawlowski et al.
2018
LREC 2018
ACL Anthology
Automatic Morphological Analysis and Syntactic Parsing for the Georgian Language
Kapanadze, Kapanadze
2026
TbiLLC 2023, Springer
Springer
Finite-State Computational Morphology: An Analyzer and Generator for Georgian
Lobzhanidze
2022
Springer (book)
Springer
Title
Authors
Year
Venue
Link
Building a Universal Dependencies Treebank for Georgian
Lobzhanidze, Magradze, Berikashvili et al.
2024
TLT 2024
ACL Anthology
Building Resources for Georgian Treebanking-Based NLP
Kapanadze, Kotzé, Hanneforth
2022
TbiLLC 2019, Springer LNCS
ACM/Springer
A Computational Grammar for Georgian
Kapanadze
2009
Logic, Language, and Computation, Springer
Springer
Title
Authors
Year
Venue
Link
A Comparison of Different Tokenization Methods for the Georgian Language
Mikaberidze, Saghinadze, Mikaberidze, Kalandadze, Pkhakadze, van Genabith, Ostermann, van der Plas, Müller
2024
ICNLSP 2024
ACL Anthology
Sentiment Analysis & Text Classification
Title
Authors
Year
Venue
Link
Resources and Experiments on Sentiment Classification for Georgian
Stefanovitch, Piskorski, Kharazi
2022
LREC 2022
ACL Anthology
Toxicity Detection in Online Georgian Discussions
Lashkarashvili, Tsintsadze
2022
Int'l Journal of Information Management Data Insights
Elsevier
Lemmatization & POS Tagging
Title
Authors
Year
Venue
Link
Lemmatization and POS-tagging process by using joint learning approach (Classical Armenian, Old Georgian, Syriac)
Vidal-Gorène, Kindt
2020
LT4HALA (LREC 2020)
ACL Anthology
Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac
Vidal-Gorène et al.
2025
arXiv
arXiv:2602.15753
A theory for words in Georgian: traditional constructs versus corpus annotation
Daraselia et al.
2024
Corpus Linguistics and Linguistic Theory
De Gruyter
Word Sense Disambiguation
Title
Authors
Year
Venue
Link
Homonym Sense Disambiguation in the Georgian Language
—
2024
arXiv
arXiv:2405.00710
LLM Evaluation & Syntactic Evaluation
Title
Authors
Year
Venue
Link
Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment
—
2025
arXiv
arXiv:2602.10661
GeoLogicQA – A Benchmark for Evaluating Logical Reasoning in Georgian for LLMs
Koberidze, Elizbarashvili, Tsintsadze
2025
LowResNLP @ RANLP 2025
ACL Anthology
Evaluating and Mitigating Linguistic Discrimination in Large Language Models
—
2024
arXiv
arXiv:2404.18534
Title
Authors
Year
Venue
Link
Developing Robust Georgian ASR with FastConformer Hybrid Transducer CTC BPE
—
2024
NVIDIA Blog + arXiv
NVIDIA Blog · arXiv:2501.14788
Fast Multi-language LSTM-based Online Handwriting Recognition
Carbune et al.
2020
IJDAR / arXiv
arXiv:1902.10525
OCR & Handwriting Recognition
Title
Authors
Year
Venue
Link
On Georgian Handwritten Character Recognition
Tsintsadze et al.
2018
IFAC-PapersOnLine
ScienceDirect
Optical Character Recognition Tool for Georgian Handwritten Text Recognition Based on YOLOv8
—
~2024
ResearchGate
ResearchGate
Corpus & Language Resources
Title
Authors
Year
Venue
Link
Structuring a Diachronic Corpus: The Georgian National Corpus Project
Gippert et al.
2012–2016
EURALEX 2016 and others
CLARINO
Enhancement Possibilities for the Georgian National Corpus
Kamarauli
2024
Caucasus Journal of Social Sciences
Journal
Creating Corpus for Georgian Language Modelling
—
—
OpenReview
OpenReview
Multilingual Works Including Georgian
Title
Authors
Year
Venue
Link
mGPT: Few-shot learners go multilingual
Shliazhko et al.
2024
Computational Linguistics
—
UniMorph 3.0: Universal Morphology
McCarthy, Kirov et al.
2020
LREC 2020
—
Functional and Cognitive Analysis of Grammar in Georgian Using the BERT Model
—
~2024
Language and Culture journal
4science.ge
Name
Type
Description
Link
Georgian Language Model (BiLSTM)
GitHub Project
Compares n-gram (perplexity 415K), Transformer (729), and BiLSTM (24) for Georgian text generation. Includes trained word2vec models.
GitHub
electra-ka Training Code
GitHub Repo
Code and instructions for training ELECTRA on 33GB Georgian text. Includes fine-tuning examples for sequence classification.
GitHub
Tokenization Comparison Code
GitHub Repo
Code for comparing BPE, WordPiece, SentencePiece tokenizers on Georgian downstream tasks. 30 stars, active (Jan 2026).
ACL Paper
NLP Text Classification for Georgian Medical Records
Research Paper
SVM/KNN text classification for Georgian medical records; includes Georgian stemming and stop-word removal.
PMC
NVIDIA Georgian ASR Blog
Tutorial/Blog
Step-by-step development of Georgian ASR with FastConformer. Covers data preparation, tokenizer creation, training.
NVIDIA Blog
Conferences with Georgian NLP Content
Name
Description
Link
TbiLLC
International Tbilisi Symposium on Logic, Language, and Computation. Biennial. Proceedings in Springer LNCS.
Events page · Springer
LMT Tbilisi 2022
Computational Modeling of Language conference at TSU Arnold Chikobava Institute of Linguistics.
LinguistList
LowResNLP (RANLP)
Workshop on low-resource NLP, featuring Georgian benchmarks.
ACL Anthology
Georgian uses three Unicode blocks covering four script styles:
Block
Range
Characters
Contents
Georgian
U+10A0–U+10FF
96
Asomtavruli (capitals, U+10A0–U+10CF) + Mkhedruli (modern lowercase, U+10D0–U+10FF)
Georgian Supplement
U+2D00–U+2D2F
40
Nuskhuri (ecclesiastical lowercase)
Georgian Extended
U+1C90–U+1CBF
48
Mtavruli (modern capitals, added Unicode 11.0, June 2018)
Modern Georgian uses 33 Mkhedruli letters (5 archaic letters are obsolete)
Georgian is primarily unicameral (no case distinction in standard usage), which simplifies text normalization for NLP
Official Unicode chart: U+10A0 PDF · U+1C90 PDF
Detailed orthography notes: r12a.github.io
Interactive codepoint explorer: symbl.cc · codepoints.net
Name
Description
Link
Georgian QWERTY (most popular)
Standard keyboard layout for Georgian.
kbdlayout.info
Georgian Standard (JCUKEN-based)
Government standard layout.
Wikipedia
Keyman Georgian QWERTY
Cross-platform input method (Win/Mac/Linux/iOS/Android).
keyman.com
GeorgianCapital (Anbani)
Full keyboard including Mtavruli capitals for Windows.
GitHub
Branah Online Georgian Keyboard
Virtual keyboard with transliteration.
branah.com
Setup Guide (Wikibooks)
Georgian input on Windows, Mac, Linux.
Wikibooks
Name
Description
Link
Noto Sans Georgian
Google's comprehensive sans-serif font for Georgian. Variable weights.
Google Fonts
Noto Serif Georgian
Google's serif font for Georgian.
Google Fonts
Noto Georgian (Variable Font)
Multiple widths and weights.
notofonts.github.io
FONTS.GE
"All Georgian fonts in one place" — comprehensive font repository.
fonts.ge
BPG InfoTech Fonts
Widely-used Unicode Georgian fonts (serif, sans-serif, monospace).
Referenced in multiple projects
georgian-webfonts (npm)
CSS package for Georgian web fonts.
GitHub (thecotne)
Wikipedia & Web Data Availability
Source
Size
Notes
Link
Georgian Wikipedia (ka.wikipedia)
~188K articles
Founded Nov 2003; ~150K registered users
ka.wikipedia.org · Stats
CC-100 Georgian
1.1 GB
From 2018 Common Crawl; used for XLM-R
data.statmt.org
OSCAR Georgian
Multiple versions
Available in OSCAR 19, 21.09, 22.01, 23.01
oscar-project.org
mC4 Georgian
Part of mC4
Quality may vary; audit recommended
HuggingFace
CulturaX Georgian
Part of 6.3T tokens
Cleaned mC4 + OSCAR
HuggingFace
Name
Description
Link
Anbani.db
Georgian datasets: "Vepkhistkaosani" full text, aphorisms, poet/writer names, baby names, alphabet data. ⚠️ Last updated 2019.
GitHub
Gadatsqvetilebebi
Web spider and corpora importer for public legal decisions in Georgian.
Referenced in low-resource-languages
loremtyaosani
Georgian Lorem Ipsum — random lines from Vepkhistkaosani.
GitHub (safareli)
Epigraphic Corpus of Georgia
EpiDoc-standard digital epigraphy (Georgian, Urartian, Aramaic, Greek inscriptions). Ilia State University.
epigraphy.iliauni.edu.ge
Online Dictionary of Georgian Idioms
Digital idioms dictionary from Ilia State University.
idioms.iliauni.edu.ge
Megrelian Language Corpus
Corpus for the endangered Megrelian (Kartvelian family) with morpheme-level annotation.
xmf.iliauni.edu.ge
TITUS
Frankfurt-based digital archive of South Caucasian language materials (Georgian, Megrelian, Svan, Laz).
Goethe University Frankfurt
awesome-georgia
Curated list of Georgian libraries and packages (payments, i18n, fonts, NLP). 91 stars.
GitHub
low-resource-languages
Meta-list including Georgian tools section.
GitHub
awesome-georgian-datasets
Collection of datasets specific to Georgia.
GitHub
Key Research Groups & Institutions
Institution
Focus
Key People
Links
Tbilisi AI Lab
Georgian-first LLMs (Kona2 family), datasets, benchmarks
—
HuggingFace · ailab.ge
Ilia State University
GLC corpus, UD treebank, idiom dictionaries, epigraphy
Irina Lobzhanidze, Nino Doborjginidze
corpora.iliauni.edu.ge
TSU (Tbilisi State University)
LLM benchmarks (GeoLogicQA), toxicity detection, handwriting recognition
Magda Tsintsadze, Irakli Koberidze
—
TSU Arnold Chikobava Institute of Linguistics
Computational linguistics conferences
—
—
Goethe University Frankfurt
GNC project, TITUS archive, Caucasus languages
Jost Gippert
—
University of Bergen / CLARINO
GNC infrastructure, morphosyntactic analysis
Paul Meurer
clarino.uib.no
OK'OMPLEX (Tbilisi)
FST morphology, GRUG treebank, computational grammar
Oleg Kapanadze
—
Bar-Ilan University
Morphological reinflection for Georgian
David Guriel, Reut Tsarfaty
—
DFKI / Saarland University
Tokenization methods for Georgian
Beso Mikaberidze, Josef van Genabith
—
JRC (European Commission)
Georgian sentiment analysis, NER
Jakub Piskorski, Sopho Kharazi
—
UCLouvain
GREgORI Project for Old Georgian
Chahan Vidal-Gorène
—
Anbani
Open-source Georgian language tools
—
GitHub Org · anbani.ge
Notable Gaps in Georgian NLP (as of early 2026)
No dedicated Georgian BERT model — the closest are electra-ka and georgian-distilbert-mlm; otherwise multilingual models (mBERT, XLM-R) must be used
No dedicated Georgian spaCy pipeline — must use via spacy-stanza bridge
No Georgian WordNet exists
No Google Cloud TTS for Georgian (only STT is available)
Limited NER resources — WikiANN is the primary dataset; stefan-it provides fine-tuned models
Morphological analyzers remain largely academic/closed-source tools (FST-based, require XEROX tools)
Small UD treebanks — both GLC (~60K tokens) and GNC (~22K tokens) are relatively small by UD standards
Key NLP challenges : agglutinative morphology, split-ergative case system, polypersonal verb agreement, free word order, unique Mkhedruli script
This list aims to be comprehensive as of February 2026. Items marked with ⚠️ may be outdated or unmaintained. Contributions and corrections welcome.