Awesome Georgian NLP Resources 🇬🇪

A comprehensive, curated list of Natural Language Processing resources for the Georgian language (ქართული ენა, ISO 639-1: ka, ISO 639-3: kat). Georgian is a Kartvelian language spoken by ~4 million people, written in the unique Mkhedruli script. It remains classified as a low-resource language for NLP, though resources have expanded significantly since 2022.

Current version generated with Claude Deep Research.

Datasets & Corpora

Text Corpora

Name	Description	Size	License	Link
Georgian National Corpus (GNC)	Diachronic corpus spanning ~1,600 years (5th c.–present), covering Modern/Middle/Old Georgian, Mingrelian, and Svan. MSD-tagged and lemmatized via Constraint Grammar. Developed by Goethe University Frankfurt & University of Bergen.	~217M words (~20M morphologically annotated)	CC-BY-NC	gnc.gov.ge · CLARINO
Georgian Language Corpus (GLC)	Monolingual and bilingual corpus from Ilia State University (2009–2016). Includes Old/Middle/New Georgian texts (1832–2012). Each word tagged with lemma and morphosyntactic description.	>100M word-forms	Academic	corpora.iliauni.edu.ge
Georgian Wikipedia Dumps	Full Georgian Wikipedia articles, regularly updated. ~188K articles as of late 2025.	~188K articles	CC-BY-SA 3.0	dumps.wikimedia.org/kawiki · HuggingFace
OSCAR Corpus (ka)	Web-crawled multilingual corpus from Common Crawl with language identification. Georgian subset available across multiple versions.	Varies by version	CC0 annotations / CC ToU for text	HuggingFace OSCAR-2301 · oscar-project.org
CC-100 (ka)	Monolingual data from Common Crawl (2018) extracted via CCNet pipeline. Used for XLM-R training.	1.1 GB	Common Crawl ToU	data.statmt.org/cc-100 · HuggingFace
mC4 (ka)	Multilingual C4 corpus used for mT5 pre-training. Georgian subset included. Quality may vary for low-resource languages.	Varies	Common Crawl ToU	HuggingFace
CulturaX (ka)	Cleaned combination of mC4 + OSCAR. 6.3 trillion tokens across 167 languages including Georgian.	Part of 6.3T tokens	Research	HuggingFace
GeoWordsDatabase	Database of ~310,000 unique Georgian words in MySQL format.	~310K words	Open	GitHub · Web
RichNachos Georgian Corpus	Community-contributed Georgian text corpus on HuggingFace.	—	—	HuggingFace
Georgian Dialect Corpus	Dialectal data integrated into the GNC. Covers geographical varieties of Georgian.	—	Academic	Integrated into GNC
geo-words	Georgian words database (txt, dic, sql) + CLI web crawler.	—	Open	GitHub (akalongman)

Parallel Corpora

Name	Description	Size	License	Link
OPUS Collection (ka)	Largest open collection of parallel corpora. Georgian available in OpenSubtitles, WikiMatrix, CCAligned, GNOME/KDE/Ubuntu, Tanzil, QED, Tatoeba, and more.	Multiple sub-corpora	Varies	opus.nlpl.eu
OPUS-100 (en-ka)	English-Georgian parallel pairs from OPUS.	—	Open	HuggingFace
FLoRes-200 / FLORES+	Meta's n-way parallel MT benchmark including Georgian (`kat_Geor`).	~2,000 sentences	CC-BY-SA 4.0	GitHub · HuggingFace
GLC Bilingual Sub-corpora	Georgian-English parallel "Vepkhistkaosani" (The Knight in the Panther's Skin) and Georgian-Armenian "Kartlis Tskhovreba" (Georgian Chronicles).	—	Academic	corpora.iliauni.edu.ge
tbilisi-ai-lab/en-ka-human-translated	Human-translated EN↔KA parallel pairs from Tbilisi AI Lab.	5K pairs	—	HuggingFace

Treebanks

Name	Description	Size	License	Link
UD_Georgian-GLC	First Georgian treebank in Universal Dependencies framework. Based on GLC sentences and 3,013 Wikipedia sentences across 131 scientific domains. CoNLL-U format.	~60K tokens (3,013 sentences)	CC BY-SA	GitHub · UD page
UD_Georgian-GNC	Treebank from the Georgian National Corpus texts (novels and news). Uses finite-state morphological analyzer + Constraint Grammar, manually corrected.	~22K tokens	UD license	GitHub
GRUG Parallel Treebank	Georgian-Russian-Ukrainian-German parallel treebank. Syntactically annotated using TIGER guidelines. Viewable via Stockholm TreeAligner.	4 monolingual + 4 parallel treebanks	CC-BY 3.0	CLARIN-D

NER Datasets

Name	Description	License	Link
WikiANN / PAN-X (ka)	Multilingual NER dataset from Wikipedia. Georgian is one of 176 languages. Tags: LOC, PER, ORG in IOB2 format.	Research use	HuggingFace

Sentiment & Classification Datasets

Name	Description	License	Link
JRC Georgian Sentiment Dataset	First publicly available annotated dataset for Georgian sentiment classification + semantic polarity dictionary. 3-label and 4-label settings. From the European Commission Joint Research Centre.	Open	JRC Data Catalogue
senti_lex (ka)	Sentiment lexicons for 81 languages including Georgian.	—	HuggingFace

Speech Corpora

Name	Description	Size	License	Link
Mozilla Common Voice (ka)	Crowd-sourced read speech with transcriptions. Primary resource for Georgian ASR.	~76h validated	CC-0	commonvoice.mozilla.org · HuggingFace (v17)
FLEURS (ka_ge)	Google's Few-shot Learning Evaluation of Universal Representations of Speech. N-way parallel speech benchmark in 102 languages.	~12h per language	CC-BY 4.0	HuggingFace
IARPA Babel Georgian (LDC2016S12)	Conversational and scripted telephone speech (Eastern/Western dialects). Equal gender distribution, ages 16–73.	~190 hours	LDC license	LDC Catalog
MATERIAL Georgian-English (LDC2025S01)	Georgian-English ASR and MT resources for cross-lingual information retrieval. IARPA MATERIAL program.	—	LDC license	LDC2025S01
OpenSLR 153	Georgian crowd-sourced speech data. Part of effort achieving 5.73% WER for Georgian.	—	CC-BY-SA 4.0	OpenSLR
CommonLanguage (SpeechBrain)	Speech recordings from CommonVoice for 45 languages including Georgian, curated for language identification.	~1h Georgian	CC-0	HuggingFace

Tbilisi AI Lab Datasets (Instruction-Tuning & Benchmarks)

The Tbilisi AI Lab released 19+ datasets for training and evaluating Georgian LLMs (October 2025):

Name	Size	Description	Link
kona-sft-mix-2.6M	2.61M pairs	Instruction/SFT training mix	HuggingFace
kona-dpo-mix-387k	387K pairs	DPO preference alignment data	HuggingFace
kona-sft-function-calling-ka-93k	93K	Function-calling SFT data (Georgian)	HuggingFace
kona-sft-function-calling-115k	115K	Function-calling SFT data (English)	HuggingFace
wiki-ka-QA	42.6K	Wikipedia-based QA in Georgian	HuggingFace
code-instruct-ka	61.3K	Code instruction in Georgian	HuggingFace
math-instruct-ka	32.4K	Math instruction in Georgian	HuggingFace
learnlm-chat-ka	5.86K	Educational chat data in Georgian	HuggingFace
ai2_arc-ka	1.68K	ARC benchmark translated to Georgian	HuggingFace
boolq-ka	3.27K	BoolQ benchmark in Georgian	HuggingFace
commonsense_qa-ka	1.22K	CommonsenseQA in Georgian	HuggingFace

Browse all: huggingface.co/tbilisi-ai-lab

Evaluation & Benchmark Datasets

Name	Description	Link
Georgian Case-Alignment Syntactic Tests	370 syntactic tests for evaluating LMs on Georgian split-ergative case system (nominative-dative, ergative-nominative, dative-nominative). Generated from GLC UD treebank.	HuggingFace · GitHub
GeoLogicQA	100-question benchmark for evaluating LLM logical reasoning in Georgian. From TSU.	ACL Anthology

Pretrained Models

Georgian-Specific Language Models

Name	Architecture	Params	Training Data	Tasks	Link
Kona2-12B	Causal LM	12B	Georgian-first training data	Text generation	HuggingFace
Kona2-12B-Instruct	Causal LM	12B	SFT + DPO alignment	Instruction following, function calling	HuggingFace
Kona2-12B-Base	Causal LM	12B	Pre-instruct base model	Base model	HuggingFace
Kona2-small-3.8B	Causal LM	3.8B	Georgian-first training data	Text generation	HuggingFace
mGPT-1.3B-Georgian	GPT-2	1.3B	Wikipedia + C4, fine-tuned 10K steps on Georgian	Text generation	HuggingFace
electra-ka	ELECTRA	BERT-base	33GB Georgian text from ~4.85M CommonCrawl pages	Feature extraction, fine-tuning base	HuggingFace · GitHub
georgian-distilbert-mlm	DistilBERT	base	mC4 Georgian subset	Fill-mask, feature extraction	HuggingFace
gpt2-ka-wiki	GPT-2	small	Georgian Wikipedia	Text generation	HuggingFace
gpt2-geo	GPT-2	small	Georgian text (limited training)	Text generation	HuggingFace

Note: The Kona2 family from Tbilisi AI Lab (released October 2025) represents the most comprehensive Georgian-first LLM effort to date.

Fine-Tuned Georgian Task Models

Name	Base Model	Task	Link
electra-ka-discrediting	electra-ka	Discrediting text detection	HuggingFace
electra-ka-fake-news-tagging	electra-ka	Fake news classification	HuggingFace
stefan-it Georgian NER Models	XLM-R Large + Flair	Named entity recognition (LOC, PER, ORG)	GitHub · HuggingFace collection

Georgian Translation Models

Name	Architecture	Direction	Link
opus-mt-ka-en	Marian/Transformer	Georgian → English	HuggingFace
opus-mt-en-ka	Marian/Transformer	English → Georgian	HuggingFace
english-georgian	T5-small (fine-tuned)	English → Georgian	HuggingFace

Georgian Speech Recognition Models

Name	Architecture	Training Data	WER	Link
NVIDIA stt_ka_fastconformer	FastConformer Hybrid CTC-Transducer (~115M)	Common Voice + FLEURS (~163h)	5.73%	HuggingFace
whisper-large-v2-ka	Whisper Large V2	Common Voice 11.0 (ka)	31.85%	HuggingFace
wav2vec2-xlsr-georgian (sammy786)	Wav2Vec2-XLS-R-1B	Common Voice 8.0 (ka)	—	HuggingFace
wav2vec2-large-xlsr-georgian (m3hrdadfi)	Wav2Vec2-XLSR-53	Common Voice (ka)	—	HuggingFace
wav2vec2-large-xlsr-georgian (xsway)	Wav2Vec2-XLSR-53	Common Voice (ka)	—	HuggingFace

Note: The NVIDIA FastConformer model (5.73% WER) is the current state of the art for Georgian ASR, significantly outperforming Whisper Large V3 and Meta Seamless.

Word Embeddings

Name	Type	Dimensions	Training Data	Link
fastText Common Crawl+Wikipedia (ka)	CBOW with character n-grams	300	Common Crawl + Wikipedia	fasttext.cc → `cc.ka.300.bin`
fastText Wikipedia (ka)	Skip-gram	300	Wikipedia	fasttext.cc → `wiki.ka`
georgian-word2vec	Word2Vec (Gensim)	—	Georgian Wikipedia dump	GitHub
Georgian_Word_Embedding	FastText + Word2Vec	—	Georgian text	GitHub
ConceptNet Numberbatch	Hybrid (word2vec + GloVe + ConceptNet)	300	Multilingual, includes `ka`	GitHub

Multilingual Models with Georgian Support

These major multilingual models include Georgian in their training data and can be used for Georgian NLP tasks directly or via fine-tuning:

Encoders (BERT-family):

Name	Languages	Link
mBERT (bert-base-multilingual-cased)	104 languages (incl. ka)	HuggingFace
XLM-RoBERTa Base	100 languages (incl. ka)	HuggingFace
XLM-RoBERTa Large	100 languages (incl. ka)	HuggingFace
XLM-RoBERTa XL	100 languages (incl. ka), 3.5B params	HuggingFace

Generative:

Name	Languages	Link
mGPT (ai-forever)	61 languages (incl. ka), 1.3B	HuggingFace
mGPT-13B	60+ languages (incl. ka), 13B	HuggingFace

Translation:

Name	Languages	Link
NLLB-200 (Meta)	200 languages (Georgian: `kat_Geor`)	600M · 1.3B · 3.3B
M2M-100 (Meta)	100 languages (incl. ka)	418M · 1.2B
SMaLL-100	100 languages (incl. ka)	HuggingFace

Speech:

Name	Languages	Link
Whisper (OpenAI)	96+ languages (incl. ka)	Large V2 · Large V3
XLS-R (Meta)	128 languages, base for fine-tuning	1B

Tools & Libraries

NLP Toolkits

Name	Description	Language	Status	Link
Stanza (Stanford NLP)	Full NLP pipeline for Georgian: tokenization, POS tagging, lemmatization, dependency parsing. Uses UD 2.15 models. `stanza.download("ka")`	Python	✅ Active (2025)	GitHub
spaCy (via spacy-stanza)	Georgian support via spacy-stanza bridge. `spacy_stanza.load_pipeline("xx", lang="ka")`	Python	✅ Active	spacy-stanza
Anbani.py	Georgian toolkit: script conversion (Mkhedruli, Asomtavruli, Nuskhuri), Latin↔Georgian transliteration, text classification. `pip install anbani`	Python	✅ Active	GitHub
Anbani.js	Script conversion, transliteration, Lorem Ipsum, letter frequency analysis, Friedman index.	JavaScript	✅ Active	GitHub
QartNLPWebService	Georgian NLP Toolset (Flask web service). Developed at Ilia State University / Unilab.	Python	⚠️ Last updated Aug 2022	GitHub
Georgian Language Toolkit	Latinize/Georgianize strings, language detection (ka/en), morphological operations, Django slug generation.	Ruby, Python	⚠️ Last updated Mar 2021	GitHub
georgian-linguistics-tools	UTF-8 Georgian text handling, Latin transcription for C++ applications.	C++	⚠️ Likely unmaintained	GitHub

Morphological Analysis

Name	Description	Link
FST Morphological Analyzer (Lobzhanidze)	Comprehensive finite-state analyzer/generator for Modern Georgian using XEROX tools (xfst, lexc). Covers all POS and verb paradigms. Used in GLC annotation.	Springer Book (2022)
GNC Morphological Analyzer (Meurer)	FST analyzer + Constraint Grammar disambiguation for Old/Middle/Modern Georgian in the GNC.	Documentation
FST + FCFG Parser (Kapanadze)	Finite-state morphological transducer/POS-tagger combined with Feature-Based CFG parser for syntactic chunking.	TbiLLC 2023 Paper
UniMorph (Georgian)	Morphological paradigm tables for Georgian, including polypersonal verb agreement.	unimorph.github.io

Spell Checkers

Name	Description	Link
ka_GE.spell (Hunspell)	Georgian orthographic spell-checking dictionary for Firefox, LibreOffice, Chrome. Auto-generated word lists from web crawling. MIT License.	GitHub
Georgian Seq2Seq Spellchecker	Character-level spellchecker using GRU Seq2Seq model with synthetic typo dataset.	GitHub (Dec 2025)
gegram-class	Library for replacing barbarisms in Georgian sentences.	Java

Transliteration & Script Tools

Name	Description	Link
translitit-latin-to-mkhedruli-georgian	Latin → ქართული (Mkhedruli) transliteration function.	JavaScript
translitit-mkhedruli-georgian-to-ipa	Mkhedruli Georgian → IPA transliteration function.	JavaScript
KartuliChromeExtension	Chrome extension converting English letters to Georgian equivalents.	Chrome Web Store
kautilities	Convert Georgian ↔ Latin letters.	PHP

OCR Tools

Name	Description	Link
Tesseract OCR (kat)	Google's Tesseract supports Georgian via trained data (`kat.traineddata`).	tessdata
tesseract-georgian	Training data for Tesseract on Georgian, derived from Wikipedia dumps. Includes wordlists and bigrams.	GitHub

TTS / STT Systems

Name	Type	Description	Link
NVIDIA FastConformer Georgian	STT	State-of-the-art Georgian ASR (5.73% WER). NeMo toolkit. CC-BY-4.0.	HuggingFace
ElevenLabs Georgian TTS	TTS	Neural TTS using Eleven Multilingual v2/v3 models. Supports voice cloning.	elevenlabs.io
ElevenLabs Scribe Georgian	STT	Georgian transcription (5–10% WER). Speaker diarization support.	elevenlabs.io
Google Cloud STT (ka-GE)	STT	Georgian speech-to-text via Google Cloud API.	cloud.google.com
eSpeak NG	TTS	Open-source formant-based TTS. Georgian supported but at early stage with limited quality.	GitHub
Georgian eSpeak Chrome Extension	TTS	Browser extension reading Georgian text aloud using eSpeak.js.	GitHub
Georgian-TTS	TTS	Georgian text-to-speech synthesis system (research).	GitHub
KaRead	TTS	Experimental TTS for Georgian using Fourier transform letter frequency analysis.	GitHub

Note: Google Cloud TTS does not currently offer Georgian voice synthesis (only STT). For neural TTS, ElevenLabs is currently the primary commercial option.

Machine Translation

Name	Description	Link
Google Translate	Georgian fully supported (GNMT).	translate.google.com
Google Cloud Translation API	Georgian supported (NMT model).	cloud.google.com
Meta NLLB-200	200 languages including Georgian (`kat_Geor`). Open-sourced.	ai.meta.com
Microsoft Translator	Georgian supported.	microsoft.com
Lingvanex	Commercial Georgian NLP services (tokenization, NER, sentiment, MT).	lingvanex.com

Utility Libraries

Name	Description	Language	Link
GeoParaphrase (Gadawere)	First Georgian paraphrasing/summarization tool (~10K+ users).	React + Express	GitHub
num2geotext	Convert numbers to Georgian text and currency.	Python	GitHub
dimakura/ka	Common functionality for Georgian projects.	Ruby
dimakura/ka.js	Georgian language support for Node.js.	JavaScript
Stichoza/money-num-to-string	Convert numbers/money to localized Georgian strings.	PHP, JS
Declensions for Georgian	Generate declensions for Georgian words.	—

Papers & Research

Morphology & Morphological Analysis

Title	Authors	Year	Venue	Link
A Finite-State Model of Georgian Verbal Morphology	Gurevich	2006	NAACL 2006	ACL Anthology
Describing Georgian Morphology with a Finite-State System	Kapanadze	2010	FSMNLP 2009, Springer LNCS	Springer
Morphological Reinflection with Multiple Arguments: An Extended Annotation Schema and a Georgian Case Study	Guriel, Goldman, Tsarfaty	2022	ACL 2022 (Short Papers)	ACL Anthology
Universal Morphologies for the Caucasus Region	Chiarcos, Donandt, Ionov, Rind-Pawlowski et al.	2018	LREC 2018	ACL Anthology
Automatic Morphological Analysis and Syntactic Parsing for the Georgian Language	Kapanadze, Kapanadze	2026	TbiLLC 2023, Springer	Springer
Finite-State Computational Morphology: An Analyzer and Generator for Georgian	Lobzhanidze	2022	Springer (book)	Springer

Treebanks & Syntax

Title	Authors	Year	Venue	Link
Building a Universal Dependencies Treebank for Georgian	Lobzhanidze, Magradze, Berikashvili et al.	2024	TLT 2024	ACL Anthology
Building Resources for Georgian Treebanking-Based NLP	Kapanadze, Kotzé, Hanneforth	2022	TbiLLC 2019, Springer LNCS	ACM/Springer
A Computational Grammar for Georgian	Kapanadze	2009	Logic, Language, and Computation, Springer	Springer

Tokenization

Title	Authors	Year	Venue	Link
A Comparison of Different Tokenization Methods for the Georgian Language	Mikaberidze, Saghinadze, Mikaberidze, Kalandadze, Pkhakadze, van Genabith, Ostermann, van der Plas, Müller	2024	ICNLSP 2024	ACL Anthology

Sentiment Analysis & Text Classification

Title	Authors	Year	Venue	Link
Resources and Experiments on Sentiment Classification for Georgian	Stefanovitch, Piskorski, Kharazi	2022	LREC 2022	ACL Anthology
Toxicity Detection in Online Georgian Discussions	Lashkarashvili, Tsintsadze	2022	Int'l Journal of Information Management Data Insights	Elsevier

Lemmatization & POS Tagging

Title	Authors	Year	Venue	Link
Lemmatization and POS-tagging process by using joint learning approach (Classical Armenian, Old Georgian, Syriac)	Vidal-Gorène, Kindt	2020	LT4HALA (LREC 2020)	ACL Anthology
Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac	Vidal-Gorène et al.	2025	arXiv	arXiv:2602.15753
A theory for words in Georgian: traditional constructs versus corpus annotation	Daraselia et al.	2024	Corpus Linguistics and Linguistic Theory	De Gruyter

Word Sense Disambiguation

Title	Authors	Year	Venue	Link
Homonym Sense Disambiguation in the Georgian Language	—	2024	arXiv	arXiv:2405.00710

LLM Evaluation & Syntactic Evaluation

Title	Authors	Year	Venue	Link
Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment	—	2025	arXiv	arXiv:2602.10661
GeoLogicQA – A Benchmark for Evaluating Logical Reasoning in Georgian for LLMs	Koberidze, Elizbarashvili, Tsintsadze	2025	LowResNLP @ RANLP 2025	ACL Anthology
Evaluating and Mitigating Linguistic Discrimination in Large Language Models	—	2024	arXiv	arXiv:2404.18534

Speech Recognition

Title	Authors	Year	Venue	Link
Developing Robust Georgian ASR with FastConformer Hybrid Transducer CTC BPE	—	2024	NVIDIA Blog + arXiv	NVIDIA Blog · arXiv:2501.14788
Fast Multi-language LSTM-based Online Handwriting Recognition	Carbune et al.	2020	IJDAR / arXiv	arXiv:1902.10525

OCR & Handwriting Recognition

Title	Authors	Year	Venue	Link
On Georgian Handwritten Character Recognition	Tsintsadze et al.	2018	IFAC-PapersOnLine	ScienceDirect
Optical Character Recognition Tool for Georgian Handwritten Text Recognition Based on YOLOv8	—	~2024	ResearchGate	ResearchGate

Corpus & Language Resources

Title	Authors	Year	Venue	Link
Structuring a Diachronic Corpus: The Georgian National Corpus Project	Gippert et al.	2012–2016	EURALEX 2016 and others	CLARINO
Enhancement Possibilities for the Georgian National Corpus	Kamarauli	2024	Caucasus Journal of Social Sciences	Journal
Creating Corpus for Georgian Language Modelling	—	—	OpenReview	OpenReview

Multilingual Works Including Georgian

Title	Authors	Year	Venue	Link
mGPT: Few-shot learners go multilingual	Shliazhko et al.	2024	Computational Linguistics	—
UniMorph 3.0: Universal Morphology	McCarthy, Kirov et al.	2020	LREC 2020	—
Functional and Cognitive Analysis of Grammar in Georgian Using the BERT Model	—	~2024	Language and Culture journal	4science.ge

Tutorials & Courses

Tutorials & Projects

Name	Type	Description	Link
Georgian Language Model (BiLSTM)	GitHub Project	Compares n-gram (perplexity 415K), Transformer (729), and BiLSTM (24) for Georgian text generation. Includes trained word2vec models.	GitHub
electra-ka Training Code	GitHub Repo	Code and instructions for training ELECTRA on 33GB Georgian text. Includes fine-tuning examples for sequence classification.	GitHub
Tokenization Comparison Code	GitHub Repo	Code for comparing BPE, WordPiece, SentencePiece tokenizers on Georgian downstream tasks. 30 stars, active (Jan 2026).	ACL Paper
NLP Text Classification for Georgian Medical Records	Research Paper	SVM/KNN text classification for Georgian medical records; includes Georgian stemming and stop-word removal.	PMC
NVIDIA Georgian ASR Blog	Tutorial/Blog	Step-by-step development of Georgian ASR with FastConformer. Covers data preparation, tokenizer creation, training.	NVIDIA Blog

Conferences with Georgian NLP Content

Name	Description	Link
TbiLLC	International Tbilisi Symposium on Logic, Language, and Computation. Biennial. Proceedings in Springer LNCS.	Events page · Springer
LMT Tbilisi 2022	Computational Modeling of Language conference at TSU Arnold Chikobava Institute of Linguistics.	LinguistList
LowResNLP (RANLP)	Workshop on low-resource NLP, featuring Georgian benchmarks.	ACL Anthology

Miscellaneous

Unicode Information

Georgian uses three Unicode blocks covering four script styles:

Block	Range	Characters	Contents
Georgian	U+10A0–U+10FF	96	Asomtavruli (capitals, U+10A0–U+10CF) + Mkhedruli (modern lowercase, U+10D0–U+10FF)
Georgian Supplement	U+2D00–U+2D2F	40	Nuskhuri (ecclesiastical lowercase)
Georgian Extended	U+1C90–U+1CBF	48	Mtavruli (modern capitals, added Unicode 11.0, June 2018)

Modern Georgian uses 33 Mkhedruli letters (5 archaic letters are obsolete)
Georgian is primarily unicameral (no case distinction in standard usage), which simplifies text normalization for NLP
Official Unicode chart: U+10A0 PDF · U+1C90 PDF
Detailed orthography notes: r12a.github.io
Interactive codepoint explorer: symbl.cc · codepoints.net

Keyboard Layouts

Name	Description	Link
Georgian QWERTY (most popular)	Standard keyboard layout for Georgian.	kbdlayout.info
Georgian Standard (JCUKEN-based)	Government standard layout.	Wikipedia
Keyman Georgian QWERTY	Cross-platform input method (Win/Mac/Linux/iOS/Android).	keyman.com
GeorgianCapital (Anbani)	Full keyboard including Mtavruli capitals for Windows.	GitHub
Branah Online Georgian Keyboard	Virtual keyboard with transliteration.	branah.com
Setup Guide (Wikibooks)	Georgian input on Windows, Mac, Linux.	Wikibooks

Font Resources

Name	Description	Link
Noto Sans Georgian	Google's comprehensive sans-serif font for Georgian. Variable weights.	Google Fonts
Noto Serif Georgian	Google's serif font for Georgian.	Google Fonts
Noto Georgian (Variable Font)	Multiple widths and weights.	notofonts.github.io
FONTS.GE	"All Georgian fonts in one place" — comprehensive font repository.	fonts.ge
BPG InfoTech Fonts	Widely-used Unicode Georgian fonts (serif, sans-serif, monospace).	Referenced in multiple projects
georgian-webfonts (npm)	CSS package for Georgian web fonts.	GitHub (thecotne)

Wikipedia & Web Data Availability

Source	Size	Notes	Link
Georgian Wikipedia (ka.wikipedia)	~188K articles	Founded Nov 2003; ~150K registered users	ka.wikipedia.org · Stats
CC-100 Georgian	1.1 GB	From 2018 Common Crawl; used for XLM-R	data.statmt.org
OSCAR Georgian	Multiple versions	Available in OSCAR 19, 21.09, 22.01, 23.01	oscar-project.org
mC4 Georgian	Part of mC4	Quality may vary; audit recommended	HuggingFace
CulturaX Georgian	Part of 6.3T tokens	Cleaned mC4 + OSCAR	HuggingFace

Other Resources

Name	Description	Link
Anbani.db	Georgian datasets: "Vepkhistkaosani" full text, aphorisms, poet/writer names, baby names, alphabet data. ⚠️ Last updated 2019.	GitHub
Gadatsqvetilebebi	Web spider and corpora importer for public legal decisions in Georgian.	Referenced in low-resource-languages
loremtyaosani	Georgian Lorem Ipsum — random lines from Vepkhistkaosani.	GitHub (safareli)
Epigraphic Corpus of Georgia	EpiDoc-standard digital epigraphy (Georgian, Urartian, Aramaic, Greek inscriptions). Ilia State University.	epigraphy.iliauni.edu.ge
Online Dictionary of Georgian Idioms	Digital idioms dictionary from Ilia State University.	idioms.iliauni.edu.ge
Megrelian Language Corpus	Corpus for the endangered Megrelian (Kartvelian family) with morpheme-level annotation.	xmf.iliauni.edu.ge
TITUS	Frankfurt-based digital archive of South Caucasian language materials (Georgian, Megrelian, Svan, Laz).	Goethe University Frankfurt
awesome-georgia	Curated list of Georgian libraries and packages (payments, i18n, fonts, NLP). 91 stars.	GitHub
low-resource-languages	Meta-list including Georgian tools section.	GitHub
awesome-georgian-datasets	Collection of datasets specific to Georgia.	GitHub

Key Research Groups & Institutions

Institution	Focus	Key People	Links
Tbilisi AI Lab	Georgian-first LLMs (Kona2 family), datasets, benchmarks	—	HuggingFace · ailab.ge
Ilia State University	GLC corpus, UD treebank, idiom dictionaries, epigraphy	Irina Lobzhanidze, Nino Doborjginidze	corpora.iliauni.edu.ge
TSU (Tbilisi State University)	LLM benchmarks (GeoLogicQA), toxicity detection, handwriting recognition	Magda Tsintsadze, Irakli Koberidze	—
TSU Arnold Chikobava Institute of Linguistics	Computational linguistics conferences	—	—
Goethe University Frankfurt	GNC project, TITUS archive, Caucasus languages	Jost Gippert	—
University of Bergen / CLARINO	GNC infrastructure, morphosyntactic analysis	Paul Meurer	clarino.uib.no
OK'OMPLEX (Tbilisi)	FST morphology, GRUG treebank, computational grammar	Oleg Kapanadze	—
Bar-Ilan University	Morphological reinflection for Georgian	David Guriel, Reut Tsarfaty	—
DFKI / Saarland University	Tokenization methods for Georgian	Beso Mikaberidze, Josef van Genabith	—
JRC (European Commission)	Georgian sentiment analysis, NER	Jakub Piskorski, Sopho Kharazi	—
UCLouvain	GREgORI Project for Old Georgian	Chahan Vidal-Gorène	—
Anbani	Open-source Georgian language tools	—	GitHub Org · anbani.ge

Notable Gaps in Georgian NLP (as of early 2026)

No dedicated Georgian BERT model — the closest are electra-ka and georgian-distilbert-mlm; otherwise multilingual models (mBERT, XLM-R) must be used
No dedicated Georgian spaCy pipeline — must use via spacy-stanza bridge
No Georgian WordNet exists
No Google Cloud TTS for Georgian (only STT is available)
Limited NER resources — WikiANN is the primary dataset; stefan-it provides fine-tuned models
Morphological analyzers remain largely academic/closed-source tools (FST-based, require XEROX tools)
Small UD treebanks — both GLC (~60K tokens) and GNC (~22K tokens) are relatively small by UD standards
Key NLP challenges: agglutinative morphology, split-ergative case system, polypersonal verb agreement, free word order, unique Mkhedruli script

This list aims to be comprehensive as of February 2026. Items marked with ⚠️ may be outdated or unmaintained. Contributions and corrections welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome Georgian NLP Resources 🇬🇪

Table of Contents

Datasets & Corpora

Text Corpora

Parallel Corpora

Treebanks

NER Datasets

Sentiment & Classification Datasets

Speech Corpora

Tbilisi AI Lab Datasets (Instruction-Tuning & Benchmarks)

Evaluation & Benchmark Datasets

Pretrained Models

Georgian-Specific Language Models

Fine-Tuned Georgian Task Models

Georgian Translation Models

Georgian Speech Recognition Models

Word Embeddings

Multilingual Models with Georgian Support

Tools & Libraries

NLP Toolkits

Morphological Analysis

Spell Checkers

Transliteration & Script Tools

OCR Tools

TTS / STT Systems

Machine Translation

Utility Libraries

Papers & Research

Morphology & Morphological Analysis

Treebanks & Syntax

Tokenization

Sentiment Analysis & Text Classification

Lemmatization & POS Tagging

Word Sense Disambiguation

LLM Evaluation & Syntactic Evaluation

Speech Recognition

OCR & Handwriting Recognition

Corpus & Language Resources

Multilingual Works Including Georgian

Tutorials & Courses

Tutorials & Projects

Conferences with Georgian NLP Content

Miscellaneous

Unicode Information

Keyboard Layouts

Font Resources

Wikipedia & Web Data Availability

Other Resources

Key Research Groups & Institutions

Notable Gaps in Georgian NLP (as of early 2026)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages