Skip to content

This is repository is about task-agnostic multilingual evaluation and benchmark flexibility over diverse script languages

License

Notifications You must be signed in to change notification settings

dice-group/Benchmarking-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Benchmarking-LLM

This is repository is about task-agnostic multilingual evaluation and benchmark flexibility over diverse script languages

Multilingual LLM Evaluation

This script evaluates multilingual LLMs on low resources languages for different tasks using different datasets: Opus, XLSum and Belebele

Tasks and Datasets

  • Machine Translation (Opus100)
  • Text Summarization (XLSum)
  • Question Answering (Belebele)

Selected Languages

tr>
Iso codeLanguageLanguage scriptLanguage class
amAmharicGe'ez2
teTeluguDevanagari1
myBurmeseBurmese1
neNepaliDevanagari 1
knKannadaKannada1
psPashtoArabic1
tgTajikCyrillic1
swSwahiliLatin2
yoYorubaLatin2
soSomaliLatin1
siSinhalaSinhala0
mrMarathiDevanagari2
paPunjabiGurmukhi2
kyKyrgyzCyrillic2

Selected Multilingual LLMs

tr>
ModelsTokenizer typeTask
LLama2SentencePiece (BPE)Translation , Summarization, QA
MistralSentencePiece (BPE)Translation , Summarization, QA
XGLMByte-Pair Encoding (BPE)QA
BLOOMByte-level BPETranslation , Summarization, QA
Qwentiktoken or SentencePieceQA
NLLBSentencePiece (BPE)Translation
mBARTSentencePiece (BPE)Translation
mT5SentencePiece (Unigram)Translation, Summarization, QA

Usage:

python scripts/eval_opus.py \
  --model meta-llama/Llama-2-7b-hf \
  --source_lang en\
  --target_lang mr

Tokenizer Evaluation

This script evaluates the token coverage of a tokenizer across multiple languages.

Usage:

python scripts/eval_tokenizer_coverage.py \
  --tokenizer meta-llama/Llama-2-7b-hf \
  --dataset Helsinki-NLP/opus-100 \
  --text_column text \
  --samples 1000 \
  --lang  mr am kn my \ 
  --output Llama2_tokenizer_coverage.csv

✅ Output

After running, you’ll get a file like tokenizer_coverage.csv with columns:

  • Language
  • Tokenizer
  • Samples
  • Total Tokens
  • UNK Tokens
  • Token Coverage (%): 100%: Does not mean good tokenization
  • Avg Token Length: ≈ 1.0–1.5: Suggests suboptimal handling of the script (i.e., character-level fallback)
  • Performance Flag

About

This is repository is about task-agnostic multilingual evaluation and benchmark flexibility over diverse script languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published