-
-
Notifications
You must be signed in to change notification settings - Fork 561
Expand file tree
/
Copy pathquiz.json
More file actions
37 lines (37 loc) · 2.5 KB
/
quiz.json
File metadata and controls
37 lines (37 loc) · 2.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[
{
"question": "Why can't you build a useful deep network using only linear layers (no activation functions)?",
"options": ["Linear layers are too slow", "Stacking linear layers collapses to a single linear transformation, giving no benefit from depth", "Linear layers don't have biases", "Linear layers can't handle batched data"],
"correct": 1,
"explanation": "y = W2(W1*x + b1) + b2 simplifies to y = Ax + c. No matter how many linear layers you stack, the result is equivalent to one linear layer. Activation functions break this composability.",
"stage": "pre"
},
{
"question": "What is the output range of the ReLU activation function?",
"options": ["(-1, 1)", "(0, 1)", "[0, infinity)", "(-infinity, infinity)"],
"correct": 2,
"explanation": "ReLU(x) = max(0, x). It outputs 0 for all negative inputs and passes positive inputs unchanged, giving a range of [0, infinity).",
"stage": "pre"
},
{
"question": "What is the 'dead neuron' problem in ReLU networks?",
"options": ["Neurons that compute too slowly", "Neurons whose input is permanently negative, producing zero output and zero gradient forever", "Neurons that produce NaN values", "Neurons with zero bias"],
"correct": 1,
"explanation": "If a ReLU neuron's weighted input is always negative (due to bad initialization or large negative bias), it outputs 0 with gradient 0. It can never recover because zero gradient means zero update.",
"stage": "post"
},
{
"question": "Which activation function is the default for hidden layers in modern transformers like GPT and BERT?",
"options": ["Sigmoid", "ReLU", "GELU", "Tanh"],
"correct": 2,
"explanation": "GELU (Gaussian Error Linear Unit) is the default activation in transformers. It provides smooth gradient flow, avoids dead neurons, and has been shown to outperform ReLU in language models.",
"stage": "post"
},
{
"question": "Why is softmax used only in the output layer and never in hidden layers?",
"options": ["It's too slow for hidden layers", "It converts a vector into a probability distribution (values sum to 1), which is needed for classification output but not for intermediate representations", "It causes exploding gradients", "It only works with two classes"],
"correct": 1,
"explanation": "Softmax normalizes a vector so all values are in (0,1) and sum to 1, creating a probability distribution. Hidden layers need to preserve and transform information, not compress it into probabilities.",
"stage": "post"
}
]