Inference
The process of running a trained AI model to generate outputs — as opposed to training. This is what you're paying for every time you call an AI API.
Inference is running a trained model on an input to produce an output. Every API call, every classification, every image generation request — that's inference. Training happens once; inference happens continuously, at scale, on your bill.
Unlike training, inference cost is a recurring line item that scales with usage. The variables that drive it: model size, context window consumption, output length, and request volume. These compound quickly. A large context, a verbose output, and high traffic is a combination that can dominate a product's operating costs before anyone notices.
Model inference spend deserves the same rigor as cloud infrastructure costs — model the unit economics early, not after you've shipped. If costs climb, fine-tuning a smaller model on your specific task is a proven path to reducing spend without giving up quality. Paying frontier model prices for a task a smaller model handles just as well is an expensive habit.