This is repository is about task-agnostic multilingual evaluation and benchmark flexibility over diverse script languages
This script evaluates multilingual LLMs on low resources languages for different tasks using different datasets: Opus, XLSum and Belebele
Machine Translation (Opus100)Text Summarization (XLSum)Question Answering (Belebele)
| Iso code | Language | Language script | Language class |
|---|---|---|---|
| am | Amharic | Ge'ez | 2 |
| te | Telugu | Devanagari | 1 |
| my | Burmese | Burmese | 1 |
| ne | Nepali | Devanagari | 1 |
| kn | Kannada | Kannada | 1 |
| ps | Pashto | Arabic | 1 |
| tg | Tajik | Cyrillic | 1 |
| sw | Swahili | Latin | 2 |
| yo | Yoruba | Latin | 2 |
| so | Somali | Latin | 1 |
| si | Sinhala | Sinhala | 0 |
| mr | Marathi | Devanagari | 2 |
| pa | Punjabi | Gurmukhi | 2 |
| ky | Kyrgyz | Cyrillic | 2 |
| Models | Tokenizer type | Task |
|---|---|---|
| LLama2 | SentencePiece (BPE) | Translation , Summarization, QA |
| Mistral | SentencePiece (BPE) | Translation , Summarization, QA |
| XGLM | Byte-Pair Encoding (BPE) | QA |
| BLOOM | Byte-level BPE | Translation , Summarization, QA |
| Qwen | tiktoken or SentencePiece | QA |
| NLLB | SentencePiece (BPE) | Translation |
| mBART | SentencePiece (BPE) | Translation |
| mT5 | SentencePiece (Unigram) | Translation, Summarization, QA |
Usage:
python scripts/eval_opus.py \
--model meta-llama/Llama-2-7b-hf \
--source_lang en\
--target_lang mrThis script evaluates the token coverage of a tokenizer across multiple languages.
Usage:
python scripts/eval_tokenizer_coverage.py \
--tokenizer meta-llama/Llama-2-7b-hf \
--dataset Helsinki-NLP/opus-100 \
--text_column text \
--samples 1000 \
--lang mr am kn my \
--output Llama2_tokenizer_coverage.csvAfter running, you’ll get a file like tokenizer_coverage.csv with columns:
LanguageTokenizerSamplesTotal TokensUNK TokensToken Coverage (%): 100%: Does not mean good tokenizationAvg Token Length: ≈ 1.0–1.5: Suggests suboptimal handling of the script (i.e., character-level fallback)Performance Flag