Embeddings
In the RNN modules, we represented each token as a one-hot vector — a sparse vector of all zeros with a single 1. This works, but it has serious problems. Embeddings are the elegant fix.
The Problem with One-Hot
In earlier modules, we represented tokens as one-hot vectors — each word gets a unique position:
| Token | Index | One-Hot Vector |
|---|---|---|
| ”cat” | 0 | [1, 0, 0, 0] |
| ”dog” | 1 | [0, 1, 0, 0] |
| ”bird” | 2 | [0, 0, 1, 0] |
| ”fish” | 3 | [0, 0, 0, 1] |
This has three problems:
1. Dimension Explosion
With a real vocabulary of 50,000 tokens, each one-hot vector has 50,000 dimensions — that’s 200KB per token. A 1,000-token sequence would need 200MB just for inputs.
| Representation | Dimensions | Memory per Token |
|---|---|---|
| One-hot (50K vocab) | 50,000 | 200 KB |
| Embedding (256-dim) | 256 | 1 KB |
| Savings | 195× smaller | 200× less memory |
2. No Semantic Information
One-hot treats every pair of words as equally different. The distance between any two one-hot vectors is always :
| Pair | Semantic Relationship | Distance |
|---|---|---|
| cat ↔ dog | Both mammals, pets | ≈ 1.41 |
| cat ↔ fish | Less related | ≈ 1.41 |
| cat ↔ quantum | Completely unrelated | ≈ 1.41 |
Every word is equally “far” from every other word — there’s no concept of similarity.
3. Wasteful Computation
When you multiply a weight matrix by a one-hot vector, most of the computation is wasted — multiplying by zeros. The result is just one column of :
Why do 50,000 multiplications when you could just look up column 2?
The Solution: Embedding Lookup
An embedding matrix stores a dense vector for each token. Instead of multiplying by a one-hot vector, we simply look up the row:
where is the embedding dimension (typically 256–1024).
Try it yourself — click on tokens below to see the one-hot vector and its corresponding dense embedding from the lookup table:
The key insight: the one-hot vector selects a row from the embedding matrix. So we skip the one-hot entirely and just index directly into the matrix.
How Embeddings Learn
The embedding matrix is not hand-designed — it’s initialized randomly and learned during training, just like any other weight matrix.
Gradient Flow
During backpropagation, gradients flow back through the network into the embedding vectors. But there’s a key difference: only the tokens that appeared in the current batch get updated.
| What happens | Embedding rows affected |
|---|---|
| Batch contains tokens [5, 10, 15] | Only rows 5, 10, 15 get gradient updates |
| Token 5 appears 3 times in batch | Row 5 accumulates 3 gradient contributions |
| Token 42 not in batch | Row 42 stays unchanged this step |
What Emerges from Training
After seeing millions of sentences like “the cat sat”, “the dog sat”, “the cat ran”:
| Word Pair | Similarity | Why |
|---|---|---|
| cat ↔ dog | High | Both appear in “the ___ sat” |
| sat ↔ ran | High | Both appear in “the cat ___“ |
| cat ↔ ran | Lower | Different grammatical roles |
Words that appear in similar contexts end up with similar embeddings — this is the distributional hypothesis: “You shall know a word by the company it keeps.”
Semantic Arithmetic
Well-trained embeddings exhibit remarkable structure. The famous analogy result:
This works because embeddings encode semantic relationships as directions in vector space — the “royalty” direction and the “gender” direction are separate, composable dimensions.
To visualize high-dimensional embeddings, use PCA or t-SNE to project to 2D — semantically similar words will cluster together.
The word2vec Connection
Word2vec (Mikolov et al., 2013) was the first to show this works at scale:
- Skip-gram: Given a word, predict its surrounding context
- CBOW: Given surrounding context, predict the center word
In modern RNN/Transformer training, embeddings learn the same kind of structure — but as part of the full model rather than a separate pre-training step.
Semantic Structure in Vector Space
Well-trained embeddings exhibit remarkable algebraic structure. The famous analogy:
This works because embeddings encode semantic relationships as directions:
- The direction from “man” to “woman” encodes gender
- The direction from “man” to “king” encodes royalty
- These directions are composable — you can add and subtract them
Measuring Similarity
The standard way to compare embeddings is cosine similarity:
| Value | Meaning |
|---|---|
| +1 | Identical direction (synonyms) |
| 0 | Orthogonal (unrelated) |
| −1 | Opposite direction (antonyms, sometimes) |
Embeddings in Transformers
In a Transformer, embeddings are the first layer — converting token IDs into the continuous vectors the model works with:
Token IDs: [101, 2023, 2003, 1037, 6251]
↓ ↓ ↓ ↓ ↓
Embedding: E[101] E[2023] E[2003] E[1037] E[6251]
↓ ↓ ↓ ↓ ↓
┌──────────────────────────────┐
+ Positional: │ + positional encoding │
└──────────────────────────────┘
↓ ↓ ↓ ↓ ↓
Transformer: Self-attention → FFN → ... → Output
Weight Tying
A clever trick used in GPT, BERT, and many others: the same embedding matrix is used for both input and output:
| Direction | Operation | Shape |
|---|---|---|
| Input (token → vector) | Row lookup: | |
| Output (vector → logits) | Matrix multiply: |
This halves the embedding parameters and ensures input/output live in the same semantic space — a token’s embedding is also its “target vector” for prediction.
Key Equations
| Concept | Equation |
|---|---|
| Embedding lookup | |
| Gradient (sparse) | (only for tokens in batch) |
| Cosine similarity | |
| Weight tying output |
For those who want to see the code, here’s a complete embedding layer with forward and backward passes:
class Embedding:
def __init__(self, vocab_size, embed_dim):
self.W = np.random.randn(vocab_size, embed_dim) * 0.01
def forward(self, token_ids):
"""Lookup: just index into the matrix."""
return self.W[token_ids]
def backward(self, token_ids, grad):
"""Sparse gradient: only update rows that were looked up."""
dW = np.zeros_like(self.W)
np.add.at(dW, token_ids, grad) # accumulate if same token appears multiple times
return dWThe forward pass is per token (just an array index). The backward pass only touches the rows that were used — this is why embedding gradients are sparse.
Exercises
-
Memory calculation: For vocab_size = 50,000 and embed_dim = 768, how many parameters does the embedding matrix have? How much memory in float32?
Answer
50,000 × 768 = 38.4M parameters = 153.6 MB (at 4 bytes per float32). -
Analogy test: If , what word should appear?
Answer
Tokyo — the “capital-of” direction is preserved: Paris is to France as Tokyo is to Japan. -
Weight tying intuition: Why does sharing the embedding matrix between input and output make sense semantically?
Answer
If the model is predicting “cat” as the next token, the output logit for “cat” is computed as the dot product of the hidden state with the “cat” embedding. This means the model is essentially asking: “how similar is my current representation to the concept of ‘cat’?” — using the same vector space for both input representation and output prediction.
Next Steps
Now that we know how tokens become vectors, the next question is: how do we decide what a “token” is in the first place? Next: Tokenization →, where we’ll see how models break text into subword pieces.