Embeddings

In the RNN modules, we represented each token as a one-hot vector — a sparse vector of all zeros with a single 1. This works, but it has serious problems. Embeddings are the elegant fix.

The Problem with One-Hot

In earlier modules, we represented tokens as one-hot vectors — each word gets a unique position:

Token	Index	One-Hot Vector
”cat”	0	`[1, 0, 0, 0]`
”dog”	1	`[0, 1, 0, 0]`
”bird”	2	`[0, 0, 1, 0]`
”fish”	3	`[0, 0, 0, 1]`

This has three problems:

1. Dimension Explosion

With a real vocabulary of 50,000 tokens, each one-hot vector has 50,000 dimensions — that’s 200KB per token. A 1,000-token sequence would need 200MB just for inputs.

Representation	Dimensions	Memory per Token
One-hot (50K vocab)	50,000	200 KB
Embedding (256-dim)	256	1 KB
Savings	195× smaller	200× less memory

2. No Semantic Information

One-hot treats every pair of words as equally different. The distance between any two one-hot vectors is always $\sqrt{2}$ :

Pair	Semantic Relationship	Distance
cat ↔ dog	Both mammals, pets	$\sqrt{2}$ ≈ 1.41
cat ↔ fish	Less related	$\sqrt{2}$ ≈ 1.41
cat ↔ quantum	Completely unrelated	$\sqrt{2}$ ≈ 1.41

Every word is equally “far” from every other word — there’s no concept of similarity.

3. Wasteful Computation

When you multiply a weight matrix $W$ by a one-hot vector, most of the computation is wasted — multiplying by zeros. The result is just one column of $W$ :

$W \cdot \underbrace{[0, 0, 1, 0, 0]^T}_{\text{one-hot for token 2}} = W_{:, 2}$

Why do 50,000 multiplications when you could just look up column 2?

The Solution: Embedding Lookup

An embedding matrix $E$ stores a dense vector for each token. Instead of multiplying by a one-hot vector, we simply look up the row:

$\mathbf{e}_i = E[i, :] \in \mathbb{R}^{d}$

where $d$ is the embedding dimension (typically 256–1024).

Try it yourself — click on tokens below to see the one-hot vector and its corresponding dense embedding from the lookup table:

Token Embedding Lookup

Show One-Hot

Input text:

Vocabulary: the, cat, sat, on, mat, a, dog, ran, is, big

1. Tokens

Click a token above to see the embedding lookup process

How Embedding Lookup Works

Tokenize: Convert text to token IDs using vocabulary
One-Hot: Create sparse vector with 1 at token's position
Matrix Multiply: one_hot × embedding_matrix = dense vector
Result: This is equivalent to looking up row token_id

Real models use subword tokenization (BPE, WordPiece) to handle any text.

The key insight: the one-hot vector selects a row from the embedding matrix. So we skip the one-hot entirely and just index directly into the matrix.

🤔 Quick Check

What is an embedding lookup mathematically equivalent to?

How Embeddings Learn

The embedding matrix $E$ is not hand-designed — it’s initialized randomly and learned during training, just like any other weight matrix.

Gradient Flow

During backpropagation, gradients flow back through the network into the embedding vectors. But there’s a key difference: only the tokens that appeared in the current batch get updated.

What happens	Embedding rows affected
Batch contains tokens [5, 10, 15]	Only rows 5, 10, 15 get gradient updates
Token 5 appears 3 times in batch	Row 5 accumulates 3 gradient contributions
Token 42 not in batch	Row 42 stays unchanged this step

$\frac{\partial L}{\partial E[i, :]} = \sum_{\text{positions where token } i \text{ appears}} \frac{\partial L}{\partial \mathbf{e}_{\text{pos}}}$

What Emerges from Training

After seeing millions of sentences like “the cat sat”, “the dog sat”, “the cat ran”:

Word Pair	Similarity	Why
cat ↔ dog	High	Both appear in “the ___ sat”
sat ↔ ran	High	Both appear in “the cat ___“
cat ↔ ran	Lower	Different grammatical roles

Words that appear in similar contexts end up with similar embeddings — this is the distributional hypothesis: “You shall know a word by the company it keeps.”

✍️ Fill in the Blanks

Embeddings place words with similar close together in vector space, because the model learns that they can be for each other.

Semantic Arithmetic

Well-trained embeddings exhibit remarkable structure. The famous analogy result:

$\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$

This works because embeddings encode semantic relationships as directions in vector space — the “royalty” direction and the “gender” direction are separate, composable dimensions.

To visualize high-dimensional embeddings, use PCA or t-SNE to project to 2D — semantically similar words will cluster together.

The word2vec Connection

Word2vec (Mikolov et al., 2013) was the first to show this works at scale:

Skip-gram: Given a word, predict its surrounding context
CBOW: Given surrounding context, predict the center word

In modern RNN/Transformer training, embeddings learn the same kind of structure — but as part of the full model rather than a separate pre-training step.

Semantic Structure in Vector Space

Well-trained embeddings exhibit remarkable algebraic structure. The famous analogy:

$\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}$

This works because embeddings encode semantic relationships as directions:

The direction from “man” to “woman” encodes gender
The direction from “man” to “king” encodes royalty
These directions are composable — you can add and subtract them

Measuring Similarity

The standard way to compare embeddings is cosine similarity:

$\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$

Value	Meaning
+1	Identical direction (synonyms)
0	Orthogonal (unrelated)
−1	Opposite direction (antonyms, sometimes)

🤔 Quick Check

Why do we use learned embeddings instead of just one-hot vectors?

Embeddings in Transformers

In a Transformer, embeddings are the first layer — converting token IDs into the continuous vectors the model works with:

Token IDs:     [101,  2023,  2003,  1037,  6251]
                 ↓      ↓      ↓      ↓      ↓
Embedding:     E[101] E[2023] E[2003] E[1037] E[6251]
                 ↓      ↓      ↓      ↓      ↓
                 ┌──────────────────────────────┐
+ Positional:   │   + positional encoding       │
                 └──────────────────────────────┘
                 ↓      ↓      ↓      ↓      ↓
Transformer:   Self-attention → FFN → ... → Output

Weight Tying

A clever trick used in GPT, BERT, and many others: the same embedding matrix is used for both input and output:

Direction	Operation	Shape
Input (token → vector)	$\mathbf{e} = E[\text{token\_id}]$	Row lookup: $(V, d) \to (d,)$
Output (vector → logits)	$\text{logits} = \mathbf{h} \cdot E^T$	Matrix multiply: $(d,) \times (d, V) \to (V,)$

This halves the embedding parameters and ensures input/output live in the same semantic space — a token’s embedding is also its “target vector” for prediction.

Key Equations

Concept	Equation
Embedding lookup	$\mathbf{e}_i = E[i, :] \in \mathbb{R}^d$
Gradient (sparse)	$\frac{\partial L}{\partial E[i,:]} = \frac{\partial L}{\partial \mathbf{e}_i}$ (only for tokens in batch)
Cosine similarity	$\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\\|\mathbf{a}\\| \\|\mathbf{b}\\|}$
Weight tying output	$\text{logits} = \mathbf{h} \cdot E^T$

For those who want to see the code, here’s a complete embedding layer with forward and backward passes:

class Embedding:
    def __init__(self, vocab_size, embed_dim):
        self.W = np.random.randn(vocab_size, embed_dim) * 0.01
    
    def forward(self, token_ids):
        """Lookup: just index into the matrix."""
        return self.W[token_ids]
    
    def backward(self, token_ids, grad):
        """Sparse gradient: only update rows that were looked up."""
        dW = np.zeros_like(self.W)
        np.add.at(dW, token_ids, grad)  # accumulate if same token appears multiple times
        return dW

The forward pass is $O(1)$ per token (just an array index). The backward pass only touches the rows that were used — this is why embedding gradients are sparse.

Exercises

Memory calculation: For vocab_size = 50,000 and embed_dim = 768, how many parameters does the embedding matrix have? How much memory in float32?
Answer
50,000 × 768 = 38.4M parameters = 153.6 MB (at 4 bytes per float32).
Analogy test: If $\vec{\text{Paris}} - \vec{\text{France}} + \vec{\text{Japan}} \approx \vec{?}$ , what word should appear?
Answer
Tokyo — the “capital-of” direction is preserved: Paris is to France as Tokyo is to Japan.
Weight tying intuition: Why does sharing the embedding matrix between input and output make sense semantically?
Answer
If the model is predicting “cat” as the next token, the output logit for “cat” is computed as the dot product of the hidden state with the “cat” embedding. This means the model is essentially asking: “how similar is my current representation to the concept of ‘cat’?” — using the same vector space for both input representation and output prediction.

Next Steps

Now that we know how tokens become vectors, the next question is: how do we decide what a “token” is in the first place? Next: Tokenization →, where we’ll see how models break text into subword pieces.