Post-Training

Post-Training

The suite of techniques applied after a model's initial training to make it useful, safe, and aligned with human preferences — including RLHF, instruction tuning, and safety training.

Post-training is everything that happens after pre-training: instruction tuning, RLHF, safety training, and alignment methods like constitutional AI and DPO. It transforms a raw next-token predictor into something deployable — and something people actually want to use.

This is where frontier labs differentiate. Anthropic, OpenAI, and Google share similar architectures and data scale. Their post-training choices are what create meaningfully different behavior in production. The evidence is stark: the original InstructGPT paper showed a 1.3B model trained with RLHF producing outputs humans preferred over a raw 175B GPT-3. Parameters are table stakes; post-training quality is the product.

It also explains a failure mode teams hit regularly: a model update changes how your application behaves without any change in model size or architecture. You're seeing the downstream effect of a post-training decision you had no visibility into. If your application is sensitive to tone, refusal behavior, or instruction-following style, track model versions explicitly and treat updates as you would a dependency upgrade — test before you ship.

Related Terms