Transformer

The neural network architecture behind every major large language model, which processes input in parallel using a mechanism called self-attention rather than reading sequentially.

A transformer is a neural network architecture that processes all input tokens simultaneously rather than one at a time. Its core mechanism — self-attention — lets the model weigh how every token relates to every other token in parallel. This is the architecture behind GPT, Claude, Gemini, and essentially every LLM that matters today. It was introduced in the 2017 paper "Attention Is All You Need," and it quietly changed the economics of machine intelligence.

Before transformers, language models processed sequences step by step, which meant more GPUs didn't help much — you were bottlenecked by the sequence. Transformers parallelize the workload, mapping cleanly onto modern GPU hardware. That's why training scales predictably with compute investment and why scaling laws hold. AI capability is now largely a capital allocation problem: a foundation model's quality is roughly proportional to the compute, data, and engineering talent behind it.

The transformer architecture also explains the context window: self-attention lets the model see all input at once, but the cost grows quadratically with input length. Doubling the context doesn't double the compute — it quadruples it. That's why larger windows are both valuable and expensive, and why the industry has spent years trying to make long-context attention cheaper without breaking what makes it work.

You don't need to understand transformers to build on top of them. But understanding why they scale the way they do helps you reason about model costs, capability curves, and why every frontier lab is doing the same thing with slightly different training recipes.

Related Terms