BERT — Bidirectional Encoder Representations from Transformers

The Key Insight: Looking Both Ways

GPT reads text left-to-right — when predicting a word, it can only see what came before. But for understanding tasks (not generation), wouldn’t it help to see the whole sentence?

“The cat sat on the [___] because it was dirty.”

With only left context (“The cat sat on the”), you might guess “mat”, “floor”, or “chair”. But with the right context too (“because it was dirty”), “mat” becomes much more likely — you need both sides!

BERT’s innovation: bidirectional attention. Every token sees every other token — no causal mask.

Interactive: BERT vs GPT Predictions

Click any word to mask it. Compare how BERT (sees everything) and GPT (sees only left context) predict the masked word:

Masked Language Modeling

Click any token to mask it

Click any word above to mask it and compare BERT vs GPT predictions

Architecture

BERT is simply the encoder stack from the original Transformer — no decoder, no causal masking:

Stage	What Happens
Input	`[CLS]` + tokens + `[SEP]`
Embedding	Token + Position + Segment embeddings
Encoder	12 (or 24) bidirectional self-attention layers
Output	Contextualized representation per token

Model	Layers	Hidden	Heads	Parameters
BERT-base	12	768	12	110M
BERT-large	24	1024	16	340M

Pre-training: Masked Language Modeling (MLM)

BERT’s training objective: randomly mask 15% of tokens, then predict the originals.

The 80/10/10 Strategy

Of the 15% selected tokens:

Treatment	Percentage	Example
Replace with `[MASK]`	80%	“The [MASK] sat on the mat”
Replace with random word	10%	“The banana sat on the mat”
Leave unchanged	10%	“The cat sat on the mat”

🤔 Quick Check

Why does BERT replace some masked tokens with random words instead of always using [MASK]?

The [CLS] Token

BERT prepends a special [CLS] token to every input. Its final representation acts as a sentence-level summary:

Because [CLS] attends to all other tokens via self-attention, it aggregates information from the entire sentence — perfect for classification tasks.

Fine-tuning BERT

BERT follows a two-stage paradigm:

Stage 1: Pre-train on massive unlabeled text (MLM objective)
Stage 2: Fine-tune on a small labeled dataset for your specific task

Task	Input	Output From	Example
Sentiment	Single sentence	[CLS]	“Great movie!” → Positive
NER	Single sentence	Each token	”Alice went to Paris” → PER, O, O, LOC
Q&A	Question + Passage	Each token	Find start/end of answer span
Similarity	Two sentences	[CLS]	“I like cats” ≈ “Cats are great”

BERT vs GPT

Aspect	BERT	GPT
Architecture	Encoder-only	Decoder-only
Attention	Bidirectional (sees all)	Causal (sees only left)
Pre-training	Masked LM (fill in blanks)	Next token prediction
Strength	Understanding tasks	Generation tasks
Can generate?	Not directly	Yes, autoregressively

The Practical Impact

BERT dominated NLP benchmarks from 2018-2020. But the tide shifted:

GPT-3 (2020) showed that large decoder-only models can do understanding tasks too — via in-context learning
Modern LLMs (GPT-4, Claude, LLaMA) are all decoder-only but handle both understanding and generation

BERT-style models remain useful for:

Efficient classification and search (smaller, faster)
Sentence embeddings (e.g., all-MiniLM, E5)
Token-level tasks like NER

RoBERTa (2019)

Removed NSP (next sentence prediction) objective
Trained longer with more data
Dynamic masking (re-mask each epoch)
Consistently outperforms BERT

DistilBERT (2019)

40% smaller, 60% faster
Retains 97% of BERT’s performance
Uses knowledge distillation

DeBERTa (2020)

Disentangled attention (separate content and position)
Enhanced mask decoder
State-of-the-art on many benchmarks

Modern Embedding Models

E5 and GTE: BERT-style models optimized for embeddings
all-MiniLM: Compact model for semantic search
These power most RAG (retrieval-augmented generation) systems today

Next: GPT

While BERT excels at understanding, GPT excels at generation. Next: GPT — Decoder-Only Transformers →