A GPT-style transformer model trained on Thirukkural, inspired by Andrej Karpathy's "Let's build GPT" tutorial. This project demonstrates building a character-level language model for Tamil text using the ancient wisdom of Thiruvalluvar.
Thirukkural is a classical Tamil text with 1,330 couplets covering ethics, politics, and love. Written by Thiruvalluvar 2,000+ years ago, each kural follows a specific 7-word meter, making it perfect for studying Tamil poetic patterns.
- Source: Web-scraped from thirukkural.gokulnath.com
- Size: 1,330 couplets (~7,840 Tamil words)
- Processing: Character-level tokenization with Tamil Unicode normalization
- Transformer-based with multi-head self-attention
- Character-level tokenization optimized for Tamil
- 5 layers, 4 attention heads, 124 embedding dimensions
- Generates Tamil text in Thirukkural's philosophical style
- Preserves classical Tamil meter and vocabulary
- Handles Tamil Unicode complexities
- Domain-specific training on ethical and moral themes
The model successfully learns:
- Thirukkural's poetic structure and rhythm
- Classical Tamil vocabulary patterns
Training: ~10 mins on T4 GPU, final loss ~2.9296
- Andrej Karpathy - GPT tutorial that inspired this project
- Thiruvalluvar - Original author of Thirukkural
- Thirukural Resource - Digital Thirukkural resource
- "Attention Is All You Need" - Original Transformer paper
⭐ Star if you found this helpful!
🔀 Fork to experiment with other classical texts!
Contact: santhoshrao95.2@gmail.com