Words as Vectors

In the previous section we established the problem: to understand what "bank" means in a sentence, a model needs to understand how it relates to every other word. Before we can compute those relationships, we need a way to represent words as numbers. A model can't do maths on the string "bank". It needs something it can measure, compare, and manipulate.

The answer is a vector: a list of numbers. Each word gets its own vector, learned during training. The vectors aren't hand-crafted. They fall out of the data.

Prerequisites

Python 3.10+
numpy (pip install numpy)

Step 1: A word is a point in space

A vector is a list of numbers. For a word embedding, that list might be 512 or 768 numbers long. Each number is a coordinate in a high-dimensional space, and the position of the word in that space encodes something about its meaning.

To build intuition, we'll start with just three dimensions: financial, geographical, and liquid. We'll use the sentence "The bank by the river was steep" throughout this module, so let's define vectors for all the words in it, plus a few related words to build intuition.

import numpy as np

# Each word represented as [financial, geographical, liquid]
bank   = np.array([0.5, 0.5, 0.3])   # ambiguous: sits between both senses
river  = np.array([0.0, 0.9, 0.8])
money  = np.array([1.0, 0.0, 0.0])
water  = np.array([0.0, 0.1, 1.0])

print("bank:  ", bank)
print("river: ", river)
print("money: ", money)
print("water: ", water)

Notice that "bank" sits in the middle of the space. It has moderate scores across all three dimensions because, without context, it genuinely could go either way. "River" is strongly geographical and liquid. "Money" is purely financial. This is the problem attention is designed to solve: the starting vector for "bank" is ambiguous, and it needs to be shifted based on what surrounds it.

Step 2: Similarity lives in the numbers

Because words are vectors, you can measure how similar they are. Two vectors that point in roughly the same direction are similar. Two vectors that point in different directions are unrelated.

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("bank vs river: ", cosine_similarity(bank, river))
print("bank vs money: ", cosine_similarity(bank, money))
print("river vs money:", cosine_similarity(river, money))

"Bank" should score moderately against both "river" and "money", reflecting its ambiguity. "River" and "money" should score low against each other. The geometry is doing the work.

Step 3: Real embeddings have hundreds of dimensions

In the example above, we chose 3 dimensions by hand. Real word embeddings work the same way but at a much larger scale. A typical transformer uses vectors of 512, 768, or even 4096 numbers.

You don't choose what those dimensions represent. The model learns them. Some dimensions end up encoding grammatical role. Some encode sentiment. Some encode topic. Most encode combinations of things that don't have a clean human label. The result is the same: words that behave similarly in language end up near each other in the vector space.

# This is what a real word embedding looks like (random example)
real_scale_embedding = np.random.randn(768)

print("Shape:", real_scale_embedding.shape)
print("First 10 values:", np.round(real_scale_embedding[:10], 4))

The maths is identical to the 3-dimensional case. The vectors are just longer.

Step 4: The same word, multiple contexts

In a real model, every word starts with a single fixed embedding vector. "Bank" always starts as the same point in space, regardless of whether the sentence is about finance or rivers.

Attention is the mechanism that updates those vectors based on context. After attention runs, the representation of "bank" in "bank by the river" will have been pulled towards river-related vectors. The representation of "bank" in "bank account" will have been pulled towards finance-related vectors.

The starting vector is fixed. The contextualised vector is computed fresh for every sentence.

Have a Play

Change "bank"'s vector to be strongly financial: [0.9, 0.1, 0.0]. Run the cosine similarity comparisons. Does it pull closer to "money" and further from "river"?
Now make it strongly geographical: [0.0, 0.9, 0.8]. What happens?
Try adding "loan" at [0.8, 0.0, 0.0]. Where does it sit relative to "money" and "bank"?

bank_financial = np.array([0.9, 0.1, 0.0])
bank_river     = np.array([0.0, 0.9, 0.8])
loan           = np.array([0.8, 0.0, 0.0])

print("Financial bank vs money:", cosine_similarity(bank_financial, money))
print("River bank vs river:    ", cosine_similarity(bank_river, river))
print("Financial bank vs river:", cosine_similarity(bank_financial, river))
print("Loan vs money:          ", cosine_similarity(loan, money))

Quiz Questions

1. Why can't a model work directly with words as strings?

You can't do maths on strings. You can't measure the distance between "bank" and "river" as text. Converting words to vectors lets the model compute similarities, take weighted averages, and learn patterns through gradient descent. All of the operations that make neural networks work require numbers.

2. In our 3-dimensional space, why does "bank" sit in the middle rather than near one cluster?

Because without context, it genuinely is ambiguous. A real model stores one vector per word type, not per sense. That single vector ends up positioned somewhere that reflects the average of all the contexts in which the word appears across the training data. Attention then shifts it towards the appropriate region based on what surrounds it in each specific sentence.

3. What does it mean for two word vectors to be "close" in the embedding space?

It means the model has learned that those words appear in similar contexts across the training data. "River" and "stream" will be close because they appear near the same kinds of words: "bank", "water", "flow", "current". Closeness in vector space reflects statistical patterns in language, not any hand-coded rule about meaning.

4. If word embeddings are learned during training, what are they at the start?

Random numbers. The embeddings are initialised randomly and then updated through backpropagation as the model trains. Over millions of examples, the vectors get nudged into positions where they help the model make accurate predictions. By the end of training, the structure of the space reflects the structure of language.

The next section introduces the dot product: the operation we use to compute how similar two word vectors are.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Words as Vectors

Prerequisites

Step 1: A word is a point in space

Step 2: Similarity lives in the numbers

Step 3: Real embeddings have hundreds of dimensions

Step 4: The same word, multiple contexts

Have a Play

Quiz Questions

FilesExpand file tree

01_words_as_vectors.md

Latest commit

History

01_words_as_vectors.md

File metadata and controls

Words as Vectors

Prerequisites

Step 1: A word is a point in space

Step 2: Similarity lives in the numbers

Step 3: Real embeddings have hundreds of dimensions

Step 4: The same word, multiple contexts

Have a Play

Quiz Questions