Skip to content

Latest commit

 

History

History
449 lines (356 loc) · 14.4 KB

File metadata and controls

449 lines (356 loc) · 14.4 KB

Aarohan AI — Fine-Tuning Guide

🎯 Goal

Fine-tune Gemma 4 E4B-it on NCERT Class 11-12 Science content to make it an expert tutor that:

  • Explains concepts step-by-step from the textbook
  • Answers in Gujarati/Hindi/English
  • References specific NCERT chapters and page numbers
  • Generates NCERT-aligned quiz questions

📋 Prerequisites

Item Status
Gemma 4 E4B-it model (HF format) ✅ In models/ folder
NCERT PDFs (20 books, EN + GJ) ✅ In Data/NCERT Books/
JEE Papers (30 PDFs) ✅ In Data/JEE Previous Year Papers/
NEET Papers (6 PDFs) ✅ In Data/NEET Previous Year Papers/
Unsloth Studio ✅ Installed locally
Kaggle/Colab GPU Required for training

Step 1: Prepare the Dataset

1.1 Extract text from NCERT PDFs

Run this Python script to extract content from your PDFs:

# File: Data/extract_ncert.py
import os
import json

# pip install pymupdf
import fitz  # PyMuPDF

def extract_pdf_text(pdf_path):
    """Extract text from a PDF file, page by page."""
    doc = fitz.open(pdf_path)
    pages = []
    for page_num in range(len(doc)):
        page = doc[page_num]
        text = page.get_text("text").strip()
        if text and len(text) > 50:  # Skip mostly-empty pages
            pages.append({
                "page_number": page_num + 1,
                "text": text,
            })
    doc.close()
    return pages

def main():
    books_dir = "NCERT Books"
    output = []
    
    for filename in sorted(os.listdir(books_dir)):
        if not filename.endswith(".pdf"):
            continue
        
        filepath = os.path.join(books_dir, filename)
        print(f"Extracting: {filename}")
        
        # Parse metadata from filename
        parts = filename.replace(".pdf", "").split(" - ")
        language = parts[-1] if len(parts) > 1 else "Eng"
        
        pages = extract_pdf_text(filepath)
        for page in pages:
            output.append({
                "source": filename,
                "language": language,
                "page": page["page_number"],
                "content": page["text"],
            })
    
    with open("ncert_extracted.json", "w", encoding="utf-8") as f:
        json.dump(output, f, ensure_ascii=False, indent=2)
    
    print(f"\n✅ Extracted {len(output)} pages from {len(os.listdir(books_dir))} PDFs")

if __name__ == "__main__":
    main()

1.2 Create Training Conversations

Convert extracted text into the Gemma 4 chat format:

# File: Data/create_dataset.py
import json
import random

def load_extracted():
    with open("ncert_extracted.json", "r", encoding="utf-8") as f:
        return json.load(f)

def create_training_examples(pages):
    """Create Q&A training examples from NCERT content."""
    examples = []
    
    for page in pages:
        content = page["content"]
        source = page["source"]
        page_num = page["page"]
        lang = page["language"]
        
        if len(content) < 200:
            continue
        
        # Template 1: Explain a concept
        examples.append({
            "messages": [
                {
                    "role": "system",
                    "content": "You are Aarohan AI, an expert NCERT tutor for Class 11-12 Science students in India. "
                               "Always explain step-by-step, reference the textbook, and be encouraging."
                },
                {
                    "role": "user",
                    "content": f"Explain the concept from {source}, page {page_num}"
                },
                {
                    "role": "model",  # NOTE: Gemma uses "model", not "assistant"
                    "content": f"**From {source}, Page {page_num}:**\n\n{content[:1500]}\n\n"
                               f"**Key Takeaway:** Study this section carefully — it's important for your exams! 📚"
                }
            ]
        })
        
        # Template 2: Summarize for quick revision
        if len(content) > 500:
            summary = content[:800]
            examples.append({
                "messages": [
                    {
                        "role": "system", 
                        "content": "You are Aarohan AI. Provide concise summaries for exam revision."
                    },
                    {
                        "role": "user",
                        "content": f"Give me a quick summary of the content on page {page_num} of {source}"
                    },
                    {
                        "role": "model",
                        "content": f"**Quick Revision — {source}, Page {page_num}:**\n\n{summary}\n\n"
                                   f"📝 **Remember:** Focus on understanding the concepts, not memorizing!"
                    }
                ]
            })
    
    return examples

def main():
    pages = load_extracted()
    examples = create_training_examples(pages)
    
    # Shuffle and split
    random.shuffle(examples)
    split = int(len(examples) * 0.9)
    train = examples[:split]
    val = examples[split:]
    
    with open("train_dataset.json", "w", encoding="utf-8") as f:
        json.dump(train, f, ensure_ascii=False, indent=2)
    
    with open("val_dataset.json", "w", encoding="utf-8") as f:
        json.dump(val, f, ensure_ascii=False, indent=2)
    
    print(f"✅ Created {len(train)} training + {len(val)} validation examples")

if __name__ == "__main__":
    main()

Important

The dataset format uses "role": "model" (NOT "role": "assistant"). This is specific to Gemma.


Step 2: Fine-Tune with Unsloth

Option A: Using Unsloth Studio (Your Local Setup)

  1. Start Unsloth Studio:

    unsloth studio -H 0.0.0.0 -p 8888
  2. Open http://localhost:8888 in your browser

  3. In the Studio UI:

    • Model: Select unsloth/gemma-4-e4b-it (or point to your local models/gemma-4-transformers-gemma-4-e4b-it-v1/)
    • Dataset: Upload train_dataset.json
    • LoRA Rank: 16 (good balance of quality vs speed)
    • Learning Rate: 2e-4
    • Epochs: 3
    • Max Seq Length: 2048
    • Click Start Training

Option B: Using Kaggle GPU (Free T4/P100)

Tip

Kaggle gives you 30 hours/week of free GPU. Use this for training! If available, select 2x T4 (total ~30GB VRAM) instead of single T4 for significantly faster training.

If Kaggle gives you a P100 and you see the sm_60 / PyTorch incompatibility warning, install a CUDA 11.8 build of PyTorch before importing Unsloth. That wheel still supports the P100, while the newer CUDA 12.8 wheel does not.

# Cell 0: Fix P100-compatible PyTorch if needed
import torch
print(torch.__version__)

# If you see a CUDA 12.8 wheel or the sm_60 warning, run this cell first.
!pip uninstall -y torch torchvision torchaudio triton xformers
!pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu118 \
    torch torchvision torchaudio
!pip install --no-cache-dir unsloth datasets trl accelerate peft bitsandbytes

Create a new Kaggle notebook and run:

# Cell 1: Load Model (Multi-GPU Support)
from unsloth import FastLanguageModel
import torch

# For dual T4 GPUs, use device_map="auto" to distribute across both GPUs
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-4-e4b-it",  # Auto-downloads from HF
    max_seq_length=2048,
    dtype=None,              # Auto-detect (float16 on T4)
    load_in_4bit=True,       # QLoRA — 4-bit quantization
    device_map="auto",       # CRITICAL: Distribute model across available GPUs
)

print(f"✅ Model loaded!")
print(f"🖥️  Using {torch.cuda.device_count()} GPUs:")
for i in range(torch.cuda.device_count()):
    print(f"   GPU {i}: {torch.cuda.get_device_name(i)} ({torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB)")

# Cell 2: Add LoRA Adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=32,           # 2x rank is standard
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # 60% less VRAM!
)

print(f"✅ LoRA adapters added. Trainable params: {model.print_trainable_parameters()}")

# Cell 3: Load YOUR Dataset
from datasets import load_dataset

# Upload train_dataset.json to Kaggle first!
dataset = load_dataset("json", data_files="/kaggle/input/datasets/mananmonani/gemma-4-fine-tunning-curated-dataset/train_dataset.json", split="train")

print(f"✅ Dataset loaded: {len(dataset)} examples")

# Cell 4: Format for Training
def format_chat(example):
    """Apply Gemma 4 chat template to each example."""
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

dataset = dataset.map(format_chat)

# Cell 5: Training (Optimized for Dual T4 GPUs ~30GB VRAM)
from trl import SFTTrainer
from transformers import TrainingArguments
import torch

# With 2x T4 (30GB total), we can use much larger batch and sequence length
# Use ~80% of dataset for training, rest for validation
train_size = int(len(dataset) * 0.8)
train_data = dataset.select(range(train_size))
val_data = dataset.select(range(train_size, len(dataset)))

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_data,
    eval_dataset=val_data,  # Validation set
    dataset_text_field="text",
    max_seq_length=2048,  # Full sequence length (30GB allows this)
    args=TrainingArguments(
        output_dir="outputs",
        per_device_train_batch_size=4,  # Increased from 1 → 4 (dual T4s can handle it)
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=2,   # Effective batch = 8
        warmup_steps=20,
        num_train_epochs=3,  # Full epochs
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=50,
        eval_steps=200,  # Evaluate every 200 steps
        save_strategy="steps",
        save_steps=200,
        optim="adamw_8bit",
        seed=42,
        dataloader_num_workers=4,  # Faster data loading
        dataloader_pin_memory=True,
    ),
)

print(f"🚀 Training on {len(train_data)} examples with validation on {len(val_data)} examples...")
print(f"🖥️  Using {torch.cuda.device_count()} GPUs with {torch.cuda.get_device_name(0)}")
trainer.train()
print("✅ Training complete!")

# Cell 6: Save & Export
# Save LoRA adapter
model.save_pretrained("aarohan-ai-lora")
tokenizer.save_pretrained("aarohan-ai-lora")

# Merge and save full model (for Ollama)
model.save_pretrained_merged(
    "aarohan-ai-merged",
    tokenizer,
    save_method="merged_16bit",  # Full precision merged model
)

print("✅ Model saved!")

# Cell 8: Export to GGUF (for Ollama)
model.save_pretrained_gguf(
    "aarohan-ai-gguf",
    tokenizer,
    quantization_method="q4_k_m",  # Good quality-to-size ratio
)

print("✅ GGUF exported! Download aarohan-ai-gguf/ for Ollama")

If you do not want to downgrade PyTorch, the other fix is to switch the Kaggle accelerator from P100 to T4. That avoids the sm_60 compatibility issue entirely.

Option C: Using Google Colab (Free T4)

Same code as Kaggle, but:

  1. Upload train_dataset.json to your Google Drive
  2. Mount Drive: from google.colab import drive; drive.mount('/content/drive')
  3. Change dataset path to /content/drive/MyDrive/train_dataset.json

Step 3: Deploy the Fine-Tuned Model

3.1 For Ollama (Local Development)

# Create Modelfile
cat > Modelfile <<EOF
FROM ./aarohan-ai-gguf/unsloth.Q4_K_M.gguf
TEMPLATE """{{ if .System }}<start_of_turn>user
{{ .System }}
<end_of_turn>
{{ end }}<start_of_turn>user
{{ .Prompt }}
<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>"""
PARAMETER temperature 0.7
PARAMETER top_p 0.95
PARAMETER stop "<end_of_turn>"
SYSTEM "You are Aarohan AI, an expert NCERT tutor for Indian students."
EOF

# Import into Ollama
ollama create aarohan-ai -f Modelfile

# Test
ollama run aarohan-ai "Explain Newton's second law"

3.2 For On-Device (LiteRT-LM)

Convert to LiteRT-LM format using the ai-edge-torch tool:

pip install ai-edge-torch

python -c "
import ai_edge_torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('aarohan-ai-merged')
tokenizer = AutoTokenizer.from_pretrained('aarohan-ai-merged')

# Convert to LiteRT-LM format
ai_edge_torch.convert(model, tokenizer, output_path='aarohan-ai.task')
print('✅ LiteRT-LM model exported!')
"

Then install in the Flutter app:

await FlutterGemma.installModel(
  modelType: ModelType.gemma4,
).fromFile('/path/to/aarohan-ai.task').install();

⚠️ Common Errors & Fixes

Error Fix
OutOfMemoryError during training Reduce max_seq_length to 512–1024, reduce gradient_accumulation_steps to 1–2, or split dataset into smaller batches (start with 1000 examples for testing)
ValueError: Some modules are dispatched on the CPU or the disk Use device_map="auto" in from_pretrained() to properly distribute across all GPUs. Ensure you're using dual T4 on Kaggle.
use_cache=True gibberish output Unsloth handles this automatically. Don't set use_cache manually
tokenizer.apply_chat_template fails Ensure your messages use "role": "model" not "role": "assistant"
VRAM insufficient for E4B Use load_in_4bit=True (QLoRA) — needs ~10GB VRAM. Further reduce max_seq_length to 512 if still OOM.
sm_60 / P100 compatibility warning Use the CUDA 11.8 PyTorch install cell above, or switch Kaggle to a T4 GPU
Kaggle session timeout Save checkpoints each epoch, resume from last checkpoint

📊 Expected Results

Metric Before Fine-Tuning After Fine-Tuning
NCERT Accuracy ~60-70% ~85-95%
Gujarati Quality Generic NCERT-specific terminology
Exam Relevance Low High (JEE/NEET aligned)
Response Style Generic AI Step-by-step tutor format

🔄 Iteration Plan

  1. v1 — Extract → Train (500 examples) → Test manually → Fix issues
  2. v2 — Add JEE/NEET questions → Train (1000+ examples) → Benchmark
  3. v3 — Add Gujarati examples → Train with bilingual data → Ship

Tip

Start with a small dataset (500 examples) to validate the pipeline works, then scale up.