Skip to content

Latest commit

 

History

History
205 lines (139 loc) · 8.67 KB

File metadata and controls

205 lines (139 loc) · 8.67 KB

The Weighted Sum

From the previous sections we now have two things:

  • values: the projected Value vector for every word in the sentence
  • attention_weights: a matrix of softmax weights, one row per word, each row summing to 1

The final step is to use those weights to blend the Value vectors. Each word gets a new vector that has pulled in information from the rest of the sentence in proportion to how relevant each word was.


Step 1: A weighted sum

A weighted sum takes a set of vectors and a set of weights and blends them:

output = w₁ × v₁ + w₂ × v₂ + w₃ × v₃ + ...

Work through a simple example by hand:

v1 = [1.0, 0.0, 0.0]
v2 = [0.0, 1.0, 0.0]
v3 = [0.0, 0.0, 1.0]

weights = [0.7, 0.2, 0.1]

output = 0.7 × [1, 0, 0]
       + 0.2 × [0, 1, 0]
       + 0.1 × [0, 0, 1]
       = [0.7, 0.0, 0.0]
       + [0.0, 0.2, 0.0]
       + [0.0, 0.0, 0.1]
       = [0.7, 0.2, 0.1]

The output is pulled towards v1 because it has the highest weight. v2 and v3 contribute, but less.

v1 = np.array([1.0, 0.0, 0.0])
v2 = np.array([0.0, 1.0, 0.0])
v3 = np.array([0.0, 0.0, 1.0])

w = np.array([0.7, 0.2, 0.1])

output_example = np.dot(w, np.array([v1, v2, v3]))
print("Weighted sum:", output_example)  # [0.7, 0.2, 0.1]

Step 2: The √d scaling

Before computing the real output, there is one ingredient missing from the scores we computed in file 03: dividing by √d before applying softmax.

When you compute the dot product of two vectors, you are summing d independent multiplications, one per dimension. If each component is drawn from a distribution with variance 1, each product also has variance 1. Summing d independent random variables adds their variances, so the dot product has variance d and standard deviation √d.

As d grows, scores don't just get larger on average. They get larger and more spread out. The gap between the highest and lowest scores grows with √d. Feed that into softmax and you get the collapse we saw in experiment 2 of the previous section: the highest score overwhelms everything else, the distribution peaks on a single word, and the model loses the ability to blend information from multiple sources.

Dividing by √d brings the variance back to 1 regardless of how long the vectors are. Scores stay in the same range across all architectures and softmax stays useful.

The following simulation takes a moment to run at higher dimensions:

# Demonstrate how variance of dot products grows with dimension
trials = 10000

for d_test in [3, 64, 256, 768]:
    sample_scores = []
    for _ in range(trials):
        a = np.random.randn(d_test)
        b = np.random.randn(d_test)
        sample_scores.append(np.dot(a, b))
    sample_scores = np.array(sample_scores)
    print(f"d={d_test:>4}  std of raw dot product: {sample_scores.std():.2f}  "
          f"std after /√d: {(sample_scores / np.sqrt(d_test)).std():.2f}")

You should see the standard deviation of the raw dot product tracking √d closely, while dividing by √d holds it near 1 across all dimensions.


Step 3: The complete attention formula

Scaled dot-product attention is:

Attention(Q, K, V) = softmax(QKᵀ / √d) V

Where:

  • Q is the matrix of Query vectors (one per word)
  • K is the matrix of Key vectors (one per word)
  • V is the matrix of Value vectors (one per word)
  • d is the vector dimension
  • QKᵀ produces the full score matrix, every Query dotted against every Key
def attention(queries, keys, values):
    d = queries.shape[1]
    scores  = queries @ keys.T / np.sqrt(d)
    weights = softmax_rows(scores)
    return weights @ values

output = attention(queries, keys, values)

print("Complete attention output (contextualised vectors):")
print(f"{'Word':<8} {'Value vector':>25} {'Contextualised':>25}")
print("-" * 62)
for i, word in enumerate(words):
    val = np.round(values[i], 3)
    ctx = np.round(output[i], 3)
    print(f"{word:<8} {str(val):>25} {str(ctx):>25}")

Every word now has a new vector that reflects not just what it meant in isolation, but what it means in this sentence, next to these words. With random matrices the shifts won't be semantically interpretable. In a trained model, "bank" would shift towards the geographical and liquid dimensions because "river" scored highly and contributed its Value vector in proportion to its weight.


Step 4: Have a play

Experiment 1: Attend to self only

Set the attention weights to the identity matrix — every word attends only to itself. The output for each word should be exactly its own Value vector. Verify this. What does it tell you about what attention is actually doing when it moves weight off the diagonal?

self_weights = np.zeros((len(words), len(words)))
np.fill_diagonal(self_weights, 1.0)
self_output = self_weights @ values

print("Self-attention only:")
for i, word in enumerate(words):
    match = np.allclose(self_output[i], values[i])
    print(f"  {word:<8} output == Value vector: {match}")

Experiment 2: Uniform attention

What if every word attends to every other word equally? Use uniform weights of 1/n for every row. Every word's output will be identical: the average of all Value vectors. What does this tell you about why selective attention matters?

uniform_weights = np.full((len(words), len(words)), 1.0 / len(words))
uniform_output  = uniform_weights @ values
print("Uniform attention output (same for every word):")
print(np.round(uniform_output[0], 3))

Experiment 3: Run attention twice

Take the output vectors and run them through attention again using the same W_Q, W_K, W_V. This is what a second transformer layer does. Do the vectors change? What does that suggest about why stacking layers adds value?

queries2 = output @ W_Q
keys2    = output @ W_K
values2  = output @ W_V
output2  = attention(queries2, keys2, values2)

print("After second attention pass:")
for i, word in enumerate(words):
    print(f"  {word:<8} {np.round(output2[i], 3)}")

Quiz Questions

1. In plain English, what does the weighted sum compute?

A new vector for each word that blends all the Value vectors in the sentence, weighted by relevance. Words that scored highly contribute more of their Value. Words that scored low contribute almost nothing. The result is a representation of each word that has pulled in contextual information from the most relevant surrounding words.

2. Why does "bank" shift towards river-like values after the weighted sum in a trained model?

Because "river" had a high attention weight. Its Value vector encodes geographical and liquid meaning, and was scaled by a large weight before being added to the output. The financial components of "bank"'s Value vector had no reinforcement from other words in this sentence, so they ended up diluted in the blend.

3. Why do we divide by √d before applying softmax?

The dot product of two vectors sums d independent terms. When d is large, those terms add up and the dot product has standard deviation √d. That means scores get larger and more spread out as the vectors get longer, which pushes softmax towards collapsing onto a single word. Dividing by √d normalises the variance back to 1, keeping softmax in a useful range regardless of the embedding dimension.

4. If attention weights are the identity matrix, what is the output?

Each word's output is exactly its own Value vector. No information flows between words. This is the degenerate case: attention is doing nothing useful. The whole purpose of learned attention weights is to move probability mass off the diagonal and onto contextually relevant words.

5. What would happen if you ran attention again on the output vectors?

A transformer does exactly this with multiple layers. The second layer runs attention over the contextualised vectors produced by the first, which are richer than the original embeddings. Each layer can pick up on different relationships. Early layers tend to resolve local ambiguities like word sense. Deeper layers build up structure about longer-range meaning and relationships between concepts.


Every operation in this formula is something you have now built from scratch: the projection matrices (W_Q, W_K, W_V), dot products for scoring, softmax to convert scores to weights, and a weighted sum to produce contextualised output. The complete mechanism fits in five lines of Python. Everything else in this module was the reasoning behind why those five lines look the way they do.