GPT — Generative Pre-trained Transformer

The Big Idea: Just Predict the Next Word

GPT’s training objective is the simplest possible:

Given all previous tokens, predict the next one.

P(x1,x2,,xn)=i=1nP(xix1,,xi1)P(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} P(x_i \mid x_1, \ldots, x_{i-1})

That’s it. No masking, no fill-in-the-blank, no special objectives. Just relentlessly predict the next token, billions of times, on massive text corpora.

The remarkable discovery: this simple objective, at sufficient scale, produces general intelligence.


Architecture

GPT is the decoder stack from the Transformer — with causal (masked) self-attention:

StageWhat Happens
InputToken sequence (e.g., “The cat sat on the”)
EmbeddingToken + Position embeddings
Decoder layersN layers of causal self-attention + FFN
Output projectionHidden state → vocabulary probabilities
PredictionSample next token from distribution
ModelYearLayersd_modelParamsContext
GPT-1201812768117M512
GPT-220194816001.5B1024
GPT-320209612288175B2048
GPT-42023??~1T?128K

Interactive: Generation in Action

Experiment with autoregressive generation. Adjust temperature and top-k to see how sampling strategies change the output:

Autoregressive Generation

The cat
Next token distribution:
"was"
30.3%
"sat"
21.1%
"ran"
13.5%
"is"
9.8%
"and"
5.2%
"small"
4.8%
"on"
4.3%
"your"
4.0%
"hat"
3.5%
"dog"
3.4%
"the"
0.0%
"a"
0.0%
Try: Set temperature to 0.1 (deterministic) vs 2.0 (random). Set top-k to 1 (greedy) vs 30 (diverse).

Understanding the Controls

Temperature scales the logits before softmax:

P(xi)=ezi/Tjezj/TP(x_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}

TemperatureDistributionEffect
T = 0.1Very peakedAlmost always picks the top token (deterministic)
T = 1.0StandardSamples naturally from learned distribution
T = 2.0FlattenedMore diverse, surprising choices

Top-k limits sampling to the top k most likely tokens, filtering out low-probability noise.

🤔 Quick Check
If you set temperature to 0.1 and top-k to 1, what type of generation do you get?

Scaling Laws: More Compute = Better Model

GPT-3’s paper revealed that model performance follows predictable power laws:

FactorScaling RelationshipPractical Meaning
Parameters (NN)L(N)N0.076L(N) \propto N^{-0.076}10× more params → ~20% loss reduction
Data (DD)L(D)D0.095L(D) \propto D^{-0.095}10× more data → ~25% loss reduction
Compute (CC)L(C)C0.050L(C) \propto C^{-0.050}10× more compute → ~15% loss reduction

This means scaling works. If you want a better model, you can calculate exactly how much more compute, data, and parameters you need.


Emergent Abilities

At certain scales, new capabilities appear suddenly — they’re nearly absent in smaller models:

AbilityAppears AroundExample
Few-shot learning~1B paramsLearning from examples in the prompt
Arithmetic~10B params”What is 47 × 83?”
Code generation~50B paramsWriting working Python functions
Chain-of-thought~100B params”Let me think step by step…”
Self-correction~100B+ params”Wait, that’s wrong. Let me reconsider…”

In-Context Learning: GPT’s Superpower

GPT-3’s most surprising ability: learning from examples in the prompt without updating weights.

Prompting StyleExample
Zero-shot”Translate to French: Hello world” → “Bonjour le monde”
One-shot”Good morning → Bonjour” then “Hello world →” → “Bonjour le monde”
Few-shotMultiple examples then query → model identifies the pattern

The model doesn’t update its parameters — it uses the attention mechanism to identify the pattern in the prompt and apply it. This is a form of meta-learning that emerges at scale.

🤔 Quick Check
During in-context learning (few-shot prompting), what happens to the model's weights?

From GPT to ChatGPT: RLHF

Pre-trained GPT is a great text predictor but not a great assistant. Reinforcement Learning from Human Feedback (RLHF) bridges this gap:

StepProcessResult
1Pre-train on text (next-token prediction)Raw language model
2Supervised fine-tuning (SFT) on conversationsFollows instructions
3Train reward model from human preferencesKnows what “good” looks like
4Optimize with PPO/DPO using reward modelHelpful, harmless, honest

This process transforms a raw text predictor into ChatGPT, Claude, or other chat models.

The Open-Source Revolution (2023+)

ModelOrganizationParamsNotable Feature
LLaMAMeta7-70BRoPE, RMSNorm, SwiGLU
LLaMA 2Meta7-70BGQA, longer context
Mistral 7BMistral AI7BSliding window attention
Mixtral 8x7BMistral AI47B (12B active)Mixture of Experts
GemmaGoogle2-7BRMSNorm, GeGLU
Qwen 2.5Alibaba0.5-72BMultilingual, long context

These models approach or match GPT-3.5 quality while being freely available. The combination of architectural innovations (RoPE, GQA, SwiGLU, MoE) with efficient training recipes has democratized large language models.


Summary

ConceptDetails
ArchitectureDecoder-only transformer with causal masking
TrainingNext-token prediction on massive text
Key innovationScale + simple objective = general intelligence
SamplingTemperature, top-k, top-p control generation diversity
Emergent abilitiesFew-shot learning, reasoning, code generation
RLHFAligns raw LLM with human preferences

Next: LLM Agents

LLMs can do more than generate text — they can use tools, plan, and take actions. Next: LLM Agents →