BERT — Bidirectional Encoder Representations from Transformers
The Key Insight: Looking Both Ways
GPT reads text left-to-right — when predicting a word, it can only see what came before. But for understanding tasks (not generation), wouldn’t it help to see the whole sentence?
“The cat sat on the [___] because it was dirty.”
With only left context (“The cat sat on the”), you might guess “mat”, “floor”, or “chair”. But with the right context too (“because it was dirty”), “mat” becomes much more likely — you need both sides!
BERT’s innovation: bidirectional attention. Every token sees every other token — no causal mask.
Interactive: BERT vs GPT Predictions
Click any word to mask it. Compare how BERT (sees everything) and GPT (sees only left context) predict the masked word:
Masked Language Modeling
Architecture
BERT is simply the encoder stack from the original Transformer — no decoder, no causal masking:
| Stage | What Happens |
|---|---|
| Input | [CLS] + tokens + [SEP] |
| Embedding | Token + Position + Segment embeddings |
| Encoder | 12 (or 24) bidirectional self-attention layers |
| Output | Contextualized representation per token |
| Model | Layers | Hidden | Heads | Parameters |
|---|---|---|---|---|
| BERT-base | 12 | 768 | 12 | 110M |
| BERT-large | 24 | 1024 | 16 | 340M |
Pre-training: Masked Language Modeling (MLM)
BERT’s training objective: randomly mask 15% of tokens, then predict the originals.
The 80/10/10 Strategy
Of the 15% selected tokens:
| Treatment | Percentage | Example |
|---|---|---|
Replace with [MASK] | 80% | “The [MASK] sat on the mat” |
| Replace with random word | 10% | “The banana sat on the mat” |
| Leave unchanged | 10% | “The cat sat on the mat” |
The [CLS] Token
BERT prepends a special [CLS] token to every input. Its final representation acts as a sentence-level summary:
Because [CLS] attends to all other tokens via self-attention, it aggregates information from the entire sentence — perfect for classification tasks.
Fine-tuning BERT
BERT follows a two-stage paradigm:
Stage 1: Pre-train on massive unlabeled text (MLM objective)
Stage 2: Fine-tune on a small labeled dataset for your specific task
| Task | Input | Output From | Example |
|---|---|---|---|
| Sentiment | Single sentence | [CLS] | “Great movie!” → Positive |
| NER | Single sentence | Each token | ”Alice went to Paris” → PER, O, O, LOC |
| Q&A | Question + Passage | Each token | Find start/end of answer span |
| Similarity | Two sentences | [CLS] | “I like cats” ≈ “Cats are great” |
BERT vs GPT
| Aspect | BERT | GPT |
|---|---|---|
| Architecture | Encoder-only | Decoder-only |
| Attention | Bidirectional (sees all) | Causal (sees only left) |
| Pre-training | Masked LM (fill in blanks) | Next token prediction |
| Strength | Understanding tasks | Generation tasks |
| Can generate? | Not directly | Yes, autoregressively |
The Practical Impact
BERT dominated NLP benchmarks from 2018-2020. But the tide shifted:
- GPT-3 (2020) showed that large decoder-only models can do understanding tasks too — via in-context learning
- Modern LLMs (GPT-4, Claude, LLaMA) are all decoder-only but handle both understanding and generation
BERT-style models remain useful for:
- Efficient classification and search (smaller, faster)
- Sentence embeddings (e.g., all-MiniLM, E5)
- Token-level tasks like NER
RoBERTa (2019)
- Removed NSP (next sentence prediction) objective
- Trained longer with more data
- Dynamic masking (re-mask each epoch)
- Consistently outperforms BERT
DistilBERT (2019)
- 40% smaller, 60% faster
- Retains 97% of BERT’s performance
- Uses knowledge distillation
DeBERTa (2020)
- Disentangled attention (separate content and position)
- Enhanced mask decoder
- State-of-the-art on many benchmarks
Modern Embedding Models
- E5 and GTE: BERT-style models optimized for embeddings
- all-MiniLM: Compact model for semantic search
- These power most RAG (retrieval-augmented generation) systems today
Next: GPT
While BERT excels at understanding, GPT excels at generation. Next: GPT — Decoder-Only Transformers →