GitHub - Mehrdadghassabi/Gaokerena-V: Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

📃 Paper ｜🤗 huggingface repository | 🚀 quick start

📒 Table of Contents

📒 Table of Contents
📍 Overview
🌱 Our contribution
🕵🏼‍♀️ Features
📚 Base model
🏃 Training process
📊 Results
⚠️ Risks and Limitations
⛔️ License
🤝 Collaborators
🙏🏼 Acknowledgement

📍 Overview

Welcome to the Gaokerena Project! We’re excited to share an innovative initiative aimed at advancing natural language processing for the Persian-speaking medical community.
Gaokerena is designed to run even on home devices while keeping privacy and security—essential for medical use—at the forefront. We trained it on a new Persian medical dataset, including free-form Q&A, to make healthcare information more accessible and interactions safer.
AI has huge potential to improve medicine, and with Gaokerena, we’re working to bring that potential to the Persian-speaking world.

🌱 Our contribution

Introducing the first open source persian medical language model
Introducing high quality Persian Medical resources including:
1. 90M-token Persian medical corpus (crawled from diverse sources).
2. MF3QA: about 186k medical free form QA pairs(crawled from diverse sources) and 20k cleaned QA pairs.
3. Translation of K-QA benchmark into persian
4. Translation of medical portion of MMLU benchmark into persian

🕵🏼‍♀️ Features

First Open-Source Persian Medical Model: The only publicly available Persian language model fine-tuned specifically for medical applications. making it freely available for research and other applications.
The first small(sub 8 billion parameters) language model to pass the Iranian Basic Medical Sciences Entrance Exam in real world condition (کنکور علوم پایه پزشکی)
Great Results: Stands out by delivering better results than other related models, including those that pair English medical models with translation systems. It excels at accurately interpreting medical questions and providing clear, reliable answers in Persian, making it highly effective for healthcare needs.
Focus on Privacy and Ease: built upon a small language model it have local deployment capability, ensuring sensitive medical data remains secure and confidential.

📚 Base model

Gaokerena is built on aya-expanse-8b, a robust and efficient language model selected for its proven performance and adaptability. This base model was fine-tuned to address the specific requirements of Persian medical applications, ensuring optimal accuracy and performance.

🏃 Training process

The Gaokerena model was trained through a process that involved fine-tuning the aya-expanse-8b base model on 60% of our Persian medical corpus, using the LoRA method for efficiency. This was followed by instruction tuning on our free-form question-answering dataset MF3QA, optimizing it for Persian medical queries. The training was conducted on A100 PCIe 40G hardware via the Google Cloud Platform in the asia-east1 region, operating for 19 hours and resulting in a carbon footprint of 2.66 kg CO2 equivalent emissions.

📊 Results

We have fully published the results here. our model correctly answered about half of the questions in the medical portion of the MMLU dataset and successfully passed Iranian Basic Medical Sciences Entrance Exam - Sept 2017 (کنکور علوم پایه پزشکی شهریور ۱۴۰۲) while other alternatives failed to.

multiple choice qa

here it is the result against pipeline alternatives:

	Gaokerena-V (ours)	MedMobile + gemma2b-it	MedMobile + parsinlu
MMLU-anatomy(fa)	48.14	14.07	25.18
MMLU-medicalgenetics(fa)	53.0	20.0	35.0
MMLU-collegemedicine(fa)	43.93	19.08	27.17
MMLU-clinicalknowledge(fa)	55.47	27.54	31.70
MMLU-professionalmedicine(fa)	47.05	17.27	33.82
MMLU-collegebiology(fa)	47.22	18.75	31.25
MMLU(avg)	49.31	20.11	30.99
IBMSEE Sept 2023	38.69	24.40	32.73

here it is the result against general purpose language models:

	Gaokerena-V (ours)	aya_expanse8b (baseline)	Qwen2.5	PersianMind
MMLU-anatomy(fa)	48.14	40.74	41.48	25.18
MMLU-medicalgenetics(fa)	53.0	49.0	52.0	34.0
MMLU-collegemedicine(fa)	43.93	44.51	43.35	20.23
MMLU-clinicalknowledge(fa)	55.47	52.07	47.92	25.28
MMLU-professionalmedicine(fa)	47.05	45.58	43.01	23.89
MMLU-collegebiology(fa)	47.22	45.14	42.36	32.63
MMLU(avg)	49.31	46.64	45.17	25.89
IBMSEE Sept 2023	38.69	34.52	33.33	19.64

free form choice qa

win rate against pipeline alternatives:

win rate against general purpose language models:

⚠️ Risks and Limitations

While Gaokerena aims to provide relatively accurate information, it is not a substitute for professional medical advice. The model may have limitations in:

Handling medical emergencies.
Addressing highly specialized or rare medical conditions.
Offering region-specific guidance, as the training data does not include localized Persian medical practices.

⛔️ License

CC BY-NC-SA 4.0 (non-commercial use only)

🤝 Collaborators

Mehrdad Ghassabi
Pedram Rostami
Dr. Hamid Reza Baradaran Kashani
Amirhossein Poursina
Zahra Kazemi
Milad Tavakoli

🙏🏼 Acknowledgement

We would like to thank

Amir Jahani for his help with the data cleaning process.
journeyfree.ai and Skype for logo creation.
mohammad ghafghazian for crawling small portion of dryab site putting it here, we used his data in MF3QA.

Name		Name	Last commit message	Last commit date
Latest commit History 717 Commits
assets		assets
corpus		corpus
dataset		dataset
doc		doc
evaluation		evaluation
fine-tuning		fine-tuning
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📒 Table of Contents

📍 Overview

🌱 Our contribution

🕵🏼‍♀️ Features

📚 Base model

🏃 Training process

📊 Results

multiple choice qa

free form choice qa

⚠️ Risks and Limitations

⛔️ License

🤝 Collaborators

🙏🏼 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📒 Table of Contents

📍 Overview

🌱 Our contribution

🕵🏼‍♀️ Features

📚 Base model

🏃 Training process

📊 Results

multiple choice qa

free form choice qa

⚠️ Risks and Limitations

⛔️ License

🤝 Collaborators

🙏🏼 Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages