Skip to content

Mehrdadghassabi/Gaokerena-V

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

717 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📃 Paper |🤗 huggingface repository | 🚀 quick start

📒 Table of Contents


📍 Overview

Welcome to the Gaokerena Project! We’re excited to share an innovative initiative aimed at advancing natural language processing for the Persian-speaking medical community.
Gaokerena is designed to run even on home devices while keeping privacy and security—essential for medical use—at the forefront. We trained it on a new Persian medical dataset, including free-form Q&A, to make healthcare information more accessible and interactions safer.
AI has huge potential to improve medicine, and with Gaokerena, we’re working to bring that potential to the Persian-speaking world.

🌱 Our contribution

  • Introducing the first open source persian medical language model

  • Introducing high quality Persian Medical resources including:

    1. 90M-token Persian medical corpus (crawled from diverse sources).

    2. MF3QA: about 186k medical free form QA pairs(crawled from diverse sources) and 20k cleaned QA pairs.

    3. Translation of K-QA benchmark into persian

    4. Translation of medical portion of MMLU benchmark into persian

🕵🏼‍♀️ Features

  • First Open-Source Persian Medical Model: The only publicly available Persian language model fine-tuned specifically for medical applications. making it freely available for research and other applications.

  • The first small(sub 8 billion parameters) language model to pass the Iranian Basic Medical Sciences Entrance Exam in real world condition (کنکور علوم پایه پزشکی)

  • Great Results: Stands out by delivering better results than other related models, including those that pair English medical models with translation systems. It excels at accurately interpreting medical questions and providing clear, reliable answers in Persian, making it highly effective for healthcare needs.

  • Focus on Privacy and Ease: built upon a small language model it have local deployment capability, ensuring sensitive medical data remains secure and confidential.

📚 Base model

Gaokerena is built on aya-expanse-8b, a robust and efficient language model selected for its proven performance and adaptability. This base model was fine-tuned to address the specific requirements of Persian medical applications, ensuring optimal accuracy and performance.

🏃 Training process

The Gaokerena model was trained through a process that involved fine-tuning the aya-expanse-8b base model on 60% of our Persian medical corpus, using the LoRA method for efficiency. This was followed by instruction tuning on our free-form question-answering dataset MF3QA, optimizing it for Persian medical queries. The training was conducted on A100 PCIe 40G hardware via the Google Cloud Platform in the asia-east1 region, operating for 19 hours and resulting in a carbon footprint of 2.66 kg CO2 equivalent emissions.

📊 Results

We have fully published the results here. our model correctly answered about half of the questions in the medical portion of the MMLU dataset and successfully passed Iranian Basic Medical Sciences Entrance Exam - Sept 2017 (کنکور علوم پایه پزشکی شهریور ۱۴۰۲) while other alternatives failed to.

multiple choice qa

here it is the result against pipeline alternatives:

Gaokerena-V (ours) MedMobile + gemma2b-it MedMobile + parsinlu
MMLU-anatomy(fa) 48.14 14.07 25.18
MMLU-medicalgenetics(fa) 53.0 20.0 35.0
MMLU-collegemedicine(fa) 43.93 19.08 27.17
MMLU-clinicalknowledge(fa) 55.47 27.54 31.70
MMLU-professionalmedicine(fa) 47.05 17.27 33.82
MMLU-collegebiology(fa) 47.22 18.75 31.25
MMLU(avg) 49.31 20.11 30.99
IBMSEE Sept 2023 38.69 24.40 32.73

here it is the result against general purpose language models:

Gaokerena-V (ours) aya_expanse8b (baseline) Qwen2.5 PersianMind
MMLU-anatomy(fa) 48.14 40.74 41.48 25.18
MMLU-medicalgenetics(fa) 53.0 49.0 52.0 34.0
MMLU-collegemedicine(fa) 43.93 44.51 43.35 20.23
MMLU-clinicalknowledge(fa) 55.47 52.07 47.92 25.28
MMLU-professionalmedicine(fa) 47.05 45.58 43.01 23.89
MMLU-collegebiology(fa) 47.22 45.14 42.36 32.63
MMLU(avg) 49.31 46.64 45.17 25.89
IBMSEE Sept 2023 38.69 34.52 33.33 19.64

free form choice qa

win rate against pipeline alternatives:

image

win rate against general purpose language models:

fig4

⚠️ Risks and Limitations

While Gaokerena aims to provide relatively accurate information, it is not a substitute for professional medical advice. The model may have limitations in:

  • Handling medical emergencies.
  • Addressing highly specialized or rare medical conditions.
  • Offering region-specific guidance, as the training data does not include localized Persian medical practices.

⛔️ License

CC BY-NC-SA 4.0 (non-commercial use only)

🤝 Collaborators

  1. Mehrdad Ghassabi
  2. Pedram Rostami
  3. Dr. Hamid Reza Baradaran Kashani
  4. Amirhossein Poursina
  5. Zahra Kazemi
  6. Milad Tavakoli

🙏🏼 Acknowledgement

We would like to thank

  • Amir Jahani for his help with the data cleaning process.
  • journeyfree.ai and Skype for logo creation.
  • mohammad ghafghazian for crawling small portion of dryab site putting it here, we used his data in MF3QA.

About

Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors