Chapter 3: Coding Attention Mechanisms

This chapter covers attention mechanisms, which are the core engine of Large Language Models (LLMs).

Files in this Directory

Source: Original Jupyter notebook from LLMs-from-scratch repository by Sebastian Raschka
Language: English
Format: Jupyter Notebook (.ipynb)
Description: The original implementation covering all aspects of attention mechanisms for LLMs

Source: Converted and translated version
Language: Traditional Chinese (繁體中文)
Format: Marimo notebook (.py)
Description: Interactive marimo notebook version with Traditional Chinese translations
How to run:
```
marimo edit marimo_ch03_zh_tw.py
```

This chapter implements the following attention mechanisms:

Simple Self-Attention (3.3.1): A basic self-attention mechanism without trainable weights for illustration purposes
Self-Attention with Trainable Weights (3.4): Implementation of scaled dot-product attention with Query, Key, and Value weight matrices
Causal Attention (3.5): Self-attention with causal masking to prevent the model from accessing future tokens
- Implements causal masking using attention masks
- Adds dropout for regularization
Multi-Head Attention (3.6): Extends single-head attention to multiple attention heads
- MultiHeadAttentionWrapper: Stacks multiple single-head attention modules
- MultiHeadAttention: Efficient implementation with weight splitting

Attention Scores vs Attention Weights: Unnormalized scores vs normalized weights (sum to 1)
Query, Key, Value (Q, K, V): Three trainable projection matrices for computing attention
Scaled Dot-Product: Scaling attention scores by √d_k for training stability
Causal Masking: Masking future tokens to maintain autoregressive property
Multi-Head Attention: Running multiple attention mechanisms in parallel to capture different aspects of relationships

pip install torch>=2.4.0
pip install marimo>=0.19.2  # For marimo notebooks

Chapter 2: Data preparation and tokenization (prerequisite)
Chapter 4: Implementing the GPT architecture (uses attention mechanisms from this chapter)