GPT — Generative Pre-trained Transformer

The Big Idea: Just Predict the Next Word

GPT’s training objective is the simplest possible:

Given all previous tokens, predict the next one.

$P(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} P(x_i \mid x_1, \ldots, x_{i-1})$

That’s it. No masking, no fill-in-the-blank, no special objectives. Just relentlessly predict the next token, billions of times, on massive text corpora.

The remarkable discovery: this simple objective, at sufficient scale, produces general intelligence.

Architecture

GPT is the decoder stack from the Transformer — with causal (masked) self-attention:

Stage	What Happens
Input	Token sequence (e.g., “The cat sat on the”)
Embedding	Token + Position embeddings
Decoder layers	N layers of causal self-attention + FFN
Output projection	Hidden state → vocabulary probabilities
Prediction	Sample next token from distribution

Model	Year	Layers	d_model	Params	Context
GPT-1	2018	12	768	117M	512
GPT-2	2019	48	1600	1.5B	1024
GPT-3	2020	96	12288	175B	2048
GPT-4	2023	?	?	~1T?	128K

Interactive: Generation in Action

Experiment with autoregressive generation. Adjust temperature and top-k to see how sampling strategies change the output:

Autoregressive Generation

Prompt:

Temperature: 1.0

Top-k: 10

The cat ▌

Next token distribution:

"was"

30.3%

"sat"

21.1%

"ran"

13.5%

"is"

9.8%

"and"

5.2%

"small"

4.8%

"on"

4.3%

"your"

4.0%

"hat"

3.5%

"dog"

3.4%

"the"

0.0%

"a"

0.0%

Try: Set temperature to 0.1 (deterministic) vs 2.0 (random). Set top-k to 1 (greedy) vs 30 (diverse).

Understanding the Controls

Temperature scales the logits before softmax:

$P(x_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$

Temperature	Distribution	Effect
T = 0.1	Very peaked	Almost always picks the top token (deterministic)
T = 1.0	Standard	Samples naturally from learned distribution
T = 2.0	Flattened	More diverse, surprising choices

Top-k limits sampling to the top k most likely tokens, filtering out low-probability noise.

🤔 Quick Check

If you set temperature to 0.1 and top-k to 1, what type of generation do you get?

Scaling Laws: More Compute = Better Model

GPT-3’s paper revealed that model performance follows predictable power laws:

Factor	Scaling Relationship	Practical Meaning
Parameters ( $N$ )	$L(N) \propto N^{-0.076}$	10× more params → ~20% loss reduction
Data ( $D$ )	$L(D) \propto D^{-0.095}$	10× more data → ~25% loss reduction
Compute ( $C$ )	$L(C) \propto C^{-0.050}$	10× more compute → ~15% loss reduction

This means scaling works. If you want a better model, you can calculate exactly how much more compute, data, and parameters you need.

Emergent Abilities

At certain scales, new capabilities appear suddenly — they’re nearly absent in smaller models:

Ability	Appears Around	Example
Few-shot learning	~1B params	Learning from examples in the prompt
Arithmetic	~10B params	”What is 47 × 83?”
Code generation	~50B params	Writing working Python functions
Chain-of-thought	~100B params	”Let me think step by step…”
Self-correction	~100B+ params	”Wait, that’s wrong. Let me reconsider…”

In-Context Learning: GPT’s Superpower

GPT-3’s most surprising ability: learning from examples in the prompt without updating weights.

Prompting Style	Example
Zero-shot	”Translate to French: Hello world” → “Bonjour le monde”
One-shot	”Good morning → Bonjour” then “Hello world →” → “Bonjour le monde”
Few-shot	Multiple examples then query → model identifies the pattern

The model doesn’t update its parameters — it uses the attention mechanism to identify the pattern in the prompt and apply it. This is a form of meta-learning that emerges at scale.

🤔 Quick Check

During in-context learning (few-shot prompting), what happens to the model's weights?

From GPT to ChatGPT: RLHF

Pre-trained GPT is a great text predictor but not a great assistant. Reinforcement Learning from Human Feedback (RLHF) bridges this gap:

Step	Process	Result
1	Pre-train on text (next-token prediction)	Raw language model
2	Supervised fine-tuning (SFT) on conversations	Follows instructions
3	Train reward model from human preferences	Knows what “good” looks like
4	Optimize with PPO/DPO using reward model	Helpful, harmless, honest

This process transforms a raw text predictor into ChatGPT, Claude, or other chat models.

The Open-Source Revolution (2023+)

Model	Organization	Params	Notable Feature
LLaMA	Meta	7-70B	RoPE, RMSNorm, SwiGLU
LLaMA 2	Meta	7-70B	GQA, longer context
Mistral 7B	Mistral AI	7B	Sliding window attention
Mixtral 8x7B	Mistral AI	47B (12B active)	Mixture of Experts
Gemma	Google	2-7B	RMSNorm, GeGLU
Qwen 2.5	Alibaba	0.5-72B	Multilingual, long context

These models approach or match GPT-3.5 quality while being freely available. The combination of architectural innovations (RoPE, GQA, SwiGLU, MoE) with efficient training recipes has democratized large language models.

Summary

Concept	Details
Architecture	Decoder-only transformer with causal masking
Training	Next-token prediction on massive text
Key innovation	Scale + simple objective = general intelligence
Sampling	Temperature, top-k, top-p control generation diversity
Emergent abilities	Few-shot learning, reasoning, code generation
RLHF	Aligns raw LLM with human preferences

Next: LLM Agents

LLMs can do more than generate text — they can use tools, plan, and take actions. Next: LLM Agents →