GPT — Generative Pre-trained Transformer
The Big Idea: Just Predict the Next Word
GPT’s training objective is the simplest possible:
Given all previous tokens, predict the next one.
That’s it. No masking, no fill-in-the-blank, no special objectives. Just relentlessly predict the next token, billions of times, on massive text corpora.
The remarkable discovery: this simple objective, at sufficient scale, produces general intelligence.
Architecture
GPT is the decoder stack from the Transformer — with causal (masked) self-attention:
| Stage | What Happens |
|---|---|
| Input | Token sequence (e.g., “The cat sat on the”) |
| Embedding | Token + Position embeddings |
| Decoder layers | N layers of causal self-attention + FFN |
| Output projection | Hidden state → vocabulary probabilities |
| Prediction | Sample next token from distribution |
| Model | Year | Layers | d_model | Params | Context |
|---|---|---|---|---|---|
| GPT-1 | 2018 | 12 | 768 | 117M | 512 |
| GPT-2 | 2019 | 48 | 1600 | 1.5B | 1024 |
| GPT-3 | 2020 | 96 | 12288 | 175B | 2048 |
| GPT-4 | 2023 | ? | ? | ~1T? | 128K |
Interactive: Generation in Action
Experiment with autoregressive generation. Adjust temperature and top-k to see how sampling strategies change the output:
Autoregressive Generation
Understanding the Controls
Temperature scales the logits before softmax:
| Temperature | Distribution | Effect |
|---|---|---|
| T = 0.1 | Very peaked | Almost always picks the top token (deterministic) |
| T = 1.0 | Standard | Samples naturally from learned distribution |
| T = 2.0 | Flattened | More diverse, surprising choices |
Top-k limits sampling to the top k most likely tokens, filtering out low-probability noise.
Scaling Laws: More Compute = Better Model
GPT-3’s paper revealed that model performance follows predictable power laws:
| Factor | Scaling Relationship | Practical Meaning |
|---|---|---|
| Parameters () | 10× more params → ~20% loss reduction | |
| Data () | 10× more data → ~25% loss reduction | |
| Compute () | 10× more compute → ~15% loss reduction |
This means scaling works. If you want a better model, you can calculate exactly how much more compute, data, and parameters you need.
Emergent Abilities
At certain scales, new capabilities appear suddenly — they’re nearly absent in smaller models:
| Ability | Appears Around | Example |
|---|---|---|
| Few-shot learning | ~1B params | Learning from examples in the prompt |
| Arithmetic | ~10B params | ”What is 47 × 83?” |
| Code generation | ~50B params | Writing working Python functions |
| Chain-of-thought | ~100B params | ”Let me think step by step…” |
| Self-correction | ~100B+ params | ”Wait, that’s wrong. Let me reconsider…” |
In-Context Learning: GPT’s Superpower
GPT-3’s most surprising ability: learning from examples in the prompt without updating weights.
| Prompting Style | Example |
|---|---|
| Zero-shot | ”Translate to French: Hello world” → “Bonjour le monde” |
| One-shot | ”Good morning → Bonjour” then “Hello world →” → “Bonjour le monde” |
| Few-shot | Multiple examples then query → model identifies the pattern |
The model doesn’t update its parameters — it uses the attention mechanism to identify the pattern in the prompt and apply it. This is a form of meta-learning that emerges at scale.
From GPT to ChatGPT: RLHF
Pre-trained GPT is a great text predictor but not a great assistant. Reinforcement Learning from Human Feedback (RLHF) bridges this gap:
| Step | Process | Result |
|---|---|---|
| 1 | Pre-train on text (next-token prediction) | Raw language model |
| 2 | Supervised fine-tuning (SFT) on conversations | Follows instructions |
| 3 | Train reward model from human preferences | Knows what “good” looks like |
| 4 | Optimize with PPO/DPO using reward model | Helpful, harmless, honest |
This process transforms a raw text predictor into ChatGPT, Claude, or other chat models.
The Open-Source Revolution (2023+)
| Model | Organization | Params | Notable Feature |
|---|---|---|---|
| LLaMA | Meta | 7-70B | RoPE, RMSNorm, SwiGLU |
| LLaMA 2 | Meta | 7-70B | GQA, longer context |
| Mistral 7B | Mistral AI | 7B | Sliding window attention |
| Mixtral 8x7B | Mistral AI | 47B (12B active) | Mixture of Experts |
| Gemma | 2-7B | RMSNorm, GeGLU | |
| Qwen 2.5 | Alibaba | 0.5-72B | Multilingual, long context |
These models approach or match GPT-3.5 quality while being freely available. The combination of architectural innovations (RoPE, GQA, SwiGLU, MoE) with efficient training recipes has democratized large language models.
Summary
| Concept | Details |
|---|---|
| Architecture | Decoder-only transformer with causal masking |
| Training | Next-token prediction on massive text |
| Key innovation | Scale + simple objective = general intelligence |
| Sampling | Temperature, top-k, top-p control generation diversity |
| Emergent abilities | Few-shot learning, reasoning, code generation |
| RLHF | Aligns raw LLM with human preferences |
Next: LLM Agents
LLMs can do more than generate text — they can use tools, plan, and take actions. Next: LLM Agents →