BERT — Bidirectional Encoder Representations from Transformers

The Key Insight: Looking Both Ways

GPT reads text left-to-right — when predicting a word, it can only see what came before. But for understanding tasks (not generation), wouldn’t it help to see the whole sentence?

“The cat sat on the [___] because it was dirty.”

With only left context (“The cat sat on the”), you might guess “mat”, “floor”, or “chair”. But with the right context too (“because it was dirty”), “mat” becomes much more likely — you need both sides!

BERT’s innovation: bidirectional attention. Every token sees every other token — no causal mask.

Interactive: BERT vs GPT Predictions

Click any word to mask it. Compare how BERT (sees everything) and GPT (sees only left context) predict the masked word:

Masked Language Modeling

Click any token to mask it
Click any word above to mask it and compare BERT vs GPT predictions

Architecture

BERT is simply the encoder stack from the original Transformer — no decoder, no causal masking:

StageWhat Happens
Input[CLS] + tokens + [SEP]
EmbeddingToken + Position + Segment embeddings
Encoder12 (or 24) bidirectional self-attention layers
OutputContextualized representation per token
ModelLayersHiddenHeadsParameters
BERT-base1276812110M
BERT-large24102416340M

Pre-training: Masked Language Modeling (MLM)

BERT’s training objective: randomly mask 15% of tokens, then predict the originals.

The 80/10/10 Strategy

Of the 15% selected tokens:

TreatmentPercentageExample
Replace with [MASK]80%“The [MASK] sat on the mat”
Replace with random word10%“The banana sat on the mat”
Leave unchanged10%“The cat sat on the mat”
🤔 Quick Check
Why does BERT replace some masked tokens with random words instead of always using [MASK]?

The [CLS] Token

BERT prepends a special [CLS] token to every input. Its final representation acts as a sentence-level summary:

Because [CLS] attends to all other tokens via self-attention, it aggregates information from the entire sentence — perfect for classification tasks.


Fine-tuning BERT

BERT follows a two-stage paradigm:

Stage 1: Pre-train on massive unlabeled text (MLM objective)
Stage 2: Fine-tune on a small labeled dataset for your specific task

TaskInputOutput FromExample
SentimentSingle sentence[CLS]“Great movie!” → Positive
NERSingle sentenceEach token”Alice went to Paris” → PER, O, O, LOC
Q&AQuestion + PassageEach tokenFind start/end of answer span
SimilarityTwo sentences[CLS]“I like cats” ≈ “Cats are great”

BERT vs GPT

AspectBERTGPT
ArchitectureEncoder-onlyDecoder-only
AttentionBidirectional (sees all)Causal (sees only left)
Pre-trainingMasked LM (fill in blanks)Next token prediction
StrengthUnderstanding tasksGeneration tasks
Can generate?Not directlyYes, autoregressively

The Practical Impact

BERT dominated NLP benchmarks from 2018-2020. But the tide shifted:

  • GPT-3 (2020) showed that large decoder-only models can do understanding tasks too — via in-context learning
  • Modern LLMs (GPT-4, Claude, LLaMA) are all decoder-only but handle both understanding and generation

BERT-style models remain useful for:

  • Efficient classification and search (smaller, faster)
  • Sentence embeddings (e.g., all-MiniLM, E5)
  • Token-level tasks like NER

RoBERTa (2019)

  • Removed NSP (next sentence prediction) objective
  • Trained longer with more data
  • Dynamic masking (re-mask each epoch)
  • Consistently outperforms BERT

DistilBERT (2019)

  • 40% smaller, 60% faster
  • Retains 97% of BERT’s performance
  • Uses knowledge distillation

DeBERTa (2020)

  • Disentangled attention (separate content and position)
  • Enhanced mask decoder
  • State-of-the-art on many benchmarks

Modern Embedding Models

  • E5 and GTE: BERT-style models optimized for embeddings
  • all-MiniLM: Compact model for semantic search
  • These power most RAG (retrieval-augmented generation) systems today

Next: GPT

While BERT excels at understanding, GPT excels at generation. Next: GPT — Decoder-Only Transformers →