This chapter covers attention mechanisms, which are the core engine of Large Language Models (LLMs).
- Source: Original Jupyter notebook from LLMs-from-scratch repository by Sebastian Raschka
- Language: English
- Format: Jupyter Notebook (.ipynb)
- Description: The original implementation covering all aspects of attention mechanisms for LLMs
- Source: Converted and translated version
- Language: Traditional Chinese (繁體中文)
- Format: Marimo notebook (.py)
- Description: Interactive marimo notebook version with Traditional Chinese translations
- How to run:
marimo edit marimo_ch03_zh_tw.py
- Language: English
- Format: Marimo notebook (.py)
- Description: English version in marimo format
- How to run:
marimo edit marimo_ch03.py
This chapter implements the following attention mechanisms:
-
Simple Self-Attention (3.3.1): A basic self-attention mechanism without trainable weights for illustration purposes
-
Self-Attention with Trainable Weights (3.4): Implementation of scaled dot-product attention with Query, Key, and Value weight matrices
-
Causal Attention (3.5): Self-attention with causal masking to prevent the model from accessing future tokens
- Implements causal masking using attention masks
- Adds dropout for regularization
-
Multi-Head Attention (3.6): Extends single-head attention to multiple attention heads
MultiHeadAttentionWrapper: Stacks multiple single-head attention modulesMultiHeadAttention: Efficient implementation with weight splitting
- Attention Scores vs Attention Weights: Unnormalized scores vs normalized weights (sum to 1)
- Query, Key, Value (Q, K, V): Three trainable projection matrices for computing attention
- Scaled Dot-Product: Scaling attention scores by √d_k for training stability
- Causal Masking: Masking future tokens to maintain autoregressive property
- Multi-Head Attention: Running multiple attention mechanisms in parallel to capture different aspects of relationships
pip install torch>=2.4.0
pip install marimo>=0.19.2 # For marimo notebooks- Chapter 2: Data preparation and tokenization (prerequisite)
- Chapter 4: Implementing the GPT architecture (uses attention mechanisms from this chapter)
- Book: Build a Large Language Model From Scratch by Sebastian Raschka
- Original Repository: https://github.com/rasbt/LLMs-from-scratch