Inference Cost Curve

The rapidly declining per-token cost of running AI models, driven by hardware improvements, quantization, and competition — falling roughly 10x per year since 2023.

The inference cost curve is the rapid, sustained decline in per-token costs for running machine intelligence models. Since 2023, frontier-class inference pricing has dropped roughly 10x per year — a pace that makes last year's cost estimates unreliable for planning next year's product.

Multiple forces compound the drop simultaneously: faster hardware (Nvidia's Blackwell generation over Hopper), distillation producing smaller models that match larger ones on specific tasks, quantization reducing memory footprint, and fierce competition among model providers compressing margins. None of these forces are slowing down.

This matters for architecture decisions. Features that were uneconomic to build in 2023 — deep reasoning chains, large context windows, per-request embeddings at scale — are routine now and will be cheaper still in 12 months. The right response is to design systems where reasoning depth, context length, or model quality can be dialed up as prices fall, rather than locking in constraints based on current pricing. Treat inference cost as a moving input, not a fixed one.

Scaling laws say models keep improving. The cost curve says they keep getting cheaper. Building as if either trend has plateaued is a bet against a very consistent track record.

Related Terms