ai-engineering-from-scratch/phases/03-deep-learning-core/04-activation-functions/quiz.json at main · rohitg00/ai-engineering-from-scratch · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[
  {
    "question": "Why can't you build a useful deep network using only linear layers (no activation functions)?",
    "options": ["Linear layers are too slow", "Stacking linear layers collapses to a single linear transformation, giving no benefit from depth", "Linear layers don't have biases", "Linear layers can't handle batched data"],
    "correct": 1,
    "explanation": "y = W2(W1*x + b1) + b2 simplifies to y = Ax + c. No matter how many linear layers you stack, the result is equivalent to one linear layer. Activation functions break this composability.",
    "stage": "pre"
  },
  {
    "question": "What is the output range of the ReLU activation function?",
    "options": ["(-1, 1)", "(0, 1)", "[0, infinity)", "(-infinity, infinity)"],
    "correct": 2,
    "explanation": "ReLU(x) = max(0, x). It outputs 0 for all negative inputs and passes positive inputs unchanged, giving a range of [0, infinity).",
    "stage": "pre"
  },
  {
    "question": "What is the 'dead neuron' problem in ReLU networks?",
    "options": ["Neurons that compute too slowly", "Neurons whose input is permanently negative, producing zero output and zero gradient forever", "Neurons that produce NaN values", "Neurons with zero bias"],
    "correct": 1,
    "explanation": "If a ReLU neuron's weighted input is always negative (due to bad initialization or large negative bias), it outputs 0 with gradient 0. It can never recover because zero gradient means zero update.",
    "stage": "post"
  },
  {
    "question": "Which activation function is the default for hidden layers in modern transformers like GPT and BERT?",
    "options": ["Sigmoid", "ReLU", "GELU", "Tanh"],
    "correct": 2,
    "explanation": "GELU (Gaussian Error Linear Unit) is the default activation in transformers. It provides smooth gradient flow, avoids dead neurons, and has been shown to outperform ReLU in language models.",
    "stage": "post"
  },
  {
    "question": "Why is softmax used only in the output layer and never in hidden layers?",
    "options": ["It's too slow for hidden layers", "It converts a vector into a probability distribution (values sum to 1), which is needed for classification output but not for intermediate representations", "It causes exploding gradients", "It only works with two classes"],
    "correct": 1,
    "explanation": "Softmax normalizes a vector so all values are in (0,1) and sum to 1, creating a probability distribution. Hidden layers need to preserve and transform information, not compress it into probabilities.",
    "stage": "post"
  }
]