Glossary

A

Agentic Workflow

AI Engineering

A system design where an AI model operates in a loop — planning, executing actions, observing results, and iterating — rather than generating a single response.

An agentic workflow is a system design where a language model runs in a loop: plan, act via tools, observe results, revise, repeat — until the task is done or the system escalates to a human. It's the difference between a model that answers questions and one that gets things done.

Loops compound both capability and risk. The same property that lets a model work through a multi-step process also means errors accumulate, costs multiply with each iteration, and failure modes multiply too. A single-shot prompt that hallucinates produces one bad output. An agent that hallucinates can take a chain of irreversible actions before anyone notices.

This is why evals and guardrails stop being nice-to-haves the moment you go agentic. You need to define success criteria before you build, not after you've discovered something went wrong in production.

In practice, agentic workflows show up wherever a task can't be completed in one shot: multi-step document processing, anything requiring conditional branching based on intermediate results, and processes that need tool calls plus retry logic. Technically, they're usually built on function calling or MCP, wrapped in an explicit loop with logging and escalation paths.

The hype around agents often skips past the engineering discipline they require. An agent with broad tool access and weak guardrails isn't powerful — it's a liability. Scope the tools tightly, log every action, and build in human checkpoints for anything consequential.

Further reading:

The Shift from Models to Compound AI Systems (Berkeley AI Research) — Makes the case that the future of AI is systems, not models, with agentic workflows as a core pattern.
Agentic Design Patterns Part 1 (Andrew Ng / The Batch) — Introduces reflection, tool use, planning, and multi-agent collaboration as composable agentic patterns.

AI Agent

Industry

An AI system that can independently plan, use tools, and take multi-step actions to accomplish a goal — moving beyond single-response chatbots to autonomous task execution.

An AI agent is a system that pursues a goal across multiple steps: it plans, uses tools, checks results, and keeps going until it finishes or decides to escalate. That's the meaningful distinction from a chatbot — a chatbot answers; an agent acts.

In practice, "agent" usually means a loop around an LLM that can call tools via function calling (or a protocol like MCP), observe the outcome, and decide what to do next. A prompt chain with hardcoded steps is useful, but it's not the same thing. An agent reasons about its next action at each step.

Agents change your risk profile significantly. A bad answer from a chatbot is annoying. A bad action from an agent can move money, delete records, or send emails to the wrong people. The lethal trifecta — cost, latency, and reliability — compounds across every step in a long task. A system that's 95% reliable per step is 60% reliable across ten steps. That math matters when you're deciding what to hand off to autonomous execution.

This means guardrails aren't optional once you introduce tool use. It also means the scope of what you hand to an agent should scale with how confident you are in your evals and your fallback handling. Start narrow, instrument everything, and expand from there.

Further reading:

Andrew Ng on AI Agentic Workflows (March 2024) — Introduced the four agentic design patterns: reflection, tool use, planning, and multi-agent collaboration.
Building Effective Agents (Anthropic) — Research on composable patterns for building reliable agent systems.
What are AI Agents? (Google Cloud) — Overview covering autonomous systems that use LLMs for reasoning and action.

AI-Native

Industry

Software or organizations designed from the ground up with AI as a core capability rather than an add-on — where AI shapes the architecture, UX, and business model from day one.

AI-native describes software, products, or organizations built from the start assuming machine intelligence exists — not bolted onto existing workflows, but woven into the architecture, user experience, and business model from day one. The conceptual leap mirrors "cloud-native" a decade ago: not running old software on AWS, but rethinking what's possible when elastic compute is a given.

The distinction that actually matters: a product that uses AI versus a product that couldn't exist without it. A CRM that adds a "summarize this account" button is AI-enhanced. A system that autonomously researches prospects, drafts personalized outreach, monitors engagement signals, and surfaces the three deals worth focusing on today — that's AI-native. The AI isn't a feature; it's the product. The architecture assumes agentic workflows, the UX is built around model outputs, and the data pipeline feeds foundation models from the start.

Be skeptical with this label. "AI-native" is becoming what "cloud-native" became in the worst way — vendors slapping it on existing products that got a ChatGPT wrapper last quarter. The test is simple: could this product exist without the AI? If yes, and the experience would be roughly the same, it's AI-enhanced at best. That's not a failure, but it's a different build-vs-buy calculus and a weaker competitive moat.

The real question for leadership isn't whether your next product should be AI-native. It's whether your competitors' will be. When a machine-intelligence-first entrant rebuilds your category from scratch — no legacy UI, no retrofitted workflows — incumbents who bolted AI onto existing stacks tend to learn that "AI-enhanced" wasn't a defensible position.

Further reading:

Generative AI's Act o1: The Reasoning Era Begins (Sequoia) — Sequoia's analysis of how reasoning-capable models shift what's possible for AI-native products, moving beyond pattern matching to multi-step problem solving.
Generative AI: A Creative New World (Sequoia) — The foundational Sequoia piece mapping generative AI's impact across industries, with the application layer as the battleground for AI-native entrants.

Alignment

Industry

The challenge of ensuring AI systems behave in accordance with human values and intentions — encompassing safety research, behavioral constraints, and the question of who decides what \"aligned\" means.

Alignment is the problem of getting AI systems to do what humans actually want — not what they literally asked for, not what's statistically plausible, and not what technically satisfies the objective while violating the spirit entirely. It's a research field, a safety discipline, and increasingly a regulatory flashpoint.

If you're deploying machine intelligence in your business, alignment is already your problem whether you call it that or not. Your customer-facing AI recommends a competitor's product because "helpful" wasn't scoped correctly. Your internal agent finds a creative loophole in your business rules and approves transactions it shouldn't. Your chatbot hallucinates medical advice because nobody defined what "out of scope" means for health questions. These are alignment failures — the system optimized for something adjacent to what you intended, and nobody caught it before it shipped.

RLHF is the primary technique vendors use to steer models toward aligned behavior during training. It's a blunt instrument. It cannot anticipate every context your application creates. Production systems need application-layer guardrails — runtime constraints that enforce your specific business rules, liability boundaries, and brand voice regardless of what the underlying model wants to do.

The academic framing focuses on existential risk: as models scale, how do you keep them controllable? That's a real question. But the political dimension is just as real. Alignment encodes values, and values are contested. Who decides what "helpful" means? What counts as "harmful"? These aren't engineering questions — they're governance questions driving AI regulation worldwide. The answer your vendor chose is already embedded in your product.

Further reading:

Concrete Problems in AI Safety (Amodei et al., 2016) — foundational paper framing five practical alignment problems including reward hacking and safe exploration.
Core Views on AI Safety (Anthropic, 2023) — Anthropic's public position on why alignment research is central to their strategy and how they approach it.

C

Compute Economics

Generative AI

The study of how GPU costs, training budgets, and inference pricing shape AI strategy — the financial physics of who can build what and at what price point.

Compute economics is the study of how GPU costs, training budgets, and inference pricing shape machine intelligence strategy. It's the financial physics underneath every AI product decision — determining who can build what, at what price point, and whether any of it makes money.

The numbers set the stakes. Training GPT-4 cost an estimated $100M+. An H100 cluster capable of frontier training starts around $500M with 18-month wait times. Running inference at scale — millions of API calls per day — can run $2–5M per month. These aren't engineering footnotes. They're the constraints that determine whether your AI strategy is viable or aspirational.

Three forces define the current landscape. Training costs are concentrating power: only a handful of organizations can afford frontier model development, which is why the foundation model oligopoly exists. Inference costs are falling roughly 10x per year, which keeps expanding the universe of profitable AI features. And GPU supply remains structurally constrained, making compute access itself a competitive moat.

The decision framework this creates is straightforward. If your use case needs frontier intelligence, buy API access — don't try to train your own model. If you're running high-volume inference, watch the cost curve and evaluate smaller distilled models or edge inference to protect margin. Model your AI spend as a variable cost with a deflationary trend, not a fixed line item.

The companies getting this right treat compute economics the way sharp operators treated cloud economics a decade ago. The ones who built a discipline around it captured margin. Everyone else just paid the bill.

Further reading:

AI and Compute (OpenAI, 2018) — Original analysis showing 300,000x growth in compute since 2012.
Epoch AI Compute Trends — Authoritative tracking of ML compute trends.
AI Server Cost Analysis (SemiAnalysis, 2023) — Deep dive into AI infrastructure economics.

Context Window

Generative AI

The maximum amount of text a language model can consider at once — both your input and its output must fit within this limit.

A context window is the total amount of text an LLM can hold in memory at once. Your prompt, any retrieved documents, conversation history, and the model's output all have to fit inside the same window — measured in tokens. Exceed it and content gets cut off, or your request errors entirely.

This is a hard product constraint, not a soft limitation to engineer around later. If the data you need fits in the window, you can often ship a straightforward prompt and be done. If it doesn't, you're building retrieval (RAG), chunking logic, summarization pipelines, and caching — each one adding latency, cost, and new failure modes. The engineering surface area grows fast once you're stitching context together manually.

Context windows have expanded dramatically. Early models topped out at 4K tokens. Current frontier models support 128K to 1M tokens. That sounds like the problem is solved. It isn't. Larger windows are expensive — transformer attention cost grows quadratically with length — and models don't attend equally well to everything in a long context. Important information buried in the middle of a 200K-token prompt often gets underweighted compared to content at the start or end. Longer doesn't mean better-read.

Match your context strategy to your actual data sizes and retrieval patterns before defaulting to the largest available window. Bigger windows cost more per call. Sometimes chunking and retrieval is cheaper and more reliable than stuffing everything in.

Further reading:

Anthropic Claude Models Documentation — context window specs for Claude models
Google Gemini 1.5 Announcement — the 1M token context window introduction

Context Window Management

AI Engineering

The engineering discipline of deciding what information to feed into an AI model's limited context window to maximize output quality within token limits.

Context window management is the discipline of deciding what goes into a model's context window — and what gets left out. Every token is a choice. The right context produces precise, grounded output. The wrong context produces vague, distracted output that misses the point. This is the unglamorous plumbing that separates AI demos from AI products.

Bigger windows don't solve the problem. Research from Stanford and Berkeley ("Lost in the Middle," 2023) showed that models perform worst on information placed in the middle of long contexts — they attend to the start and end, losing signal in between. Throwing your entire knowledge base into a 200K-token window isn't a strategy. It's a hope.

The practical techniques aren't exotic. Chunking strategies control how documents get split before retrieval. Embedding-based relevance scoring ensures only the most pertinent chunks make it into the prompt. RAG pipelines retrieve on demand rather than preloading everything. Priority hierarchies decide what gets dropped when space runs out. This is standard systems engineering applied to a new constraint.

What's underappreciated is the leverage here. Two teams using the identical model can get wildly different results based purely on how carefully they curate context. Get it right and a mid-tier model outperforms a frontier model on sloppy context. Get it wrong and no amount of model spend fixes it.

Further reading:

Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023) — Key research showing models degrade when relevant information is buried in long contexts
Anthropic Prompt Engineering: Long Context Tips — Practical guidance on structuring long-context prompts for Claude

Copilot Pattern

Industry

A product design where AI assists a human professional in real-time rather than replacing them — the dominant go-to-market strategy for enterprise AI that preserves judgment while multiplying throughput.

The copilot pattern is a product design where machine intelligence works alongside a human professional in real-time — suggesting, drafting, accelerating — but never acting without approval. It's the dominant go-to-market motion in enterprise AI for one reason: it threads the needle between value and risk. Telling a VP of Engineering "this replaces your developers" triggers every alarm in the building. Telling them "this makes your developers 2x faster" gets a purchase order.

GitHub Copilot is the canonical example and the reason the word stuck. GitHub's research found developers completed tasks up to 55% faster and reported higher satisfaction and flow state. Microsoft then applied the brand to everything — 365 Copilot, Dynamics Copilot, Security Copilot — because the framing works. "Copilot" implies the human is still flying the plane. That's not just marketing. It's a deliberate architectural choice that sidesteps the two hardest problems in enterprise AI deployment: reliability and change management. If the model suggests something wrong, the human catches it. If the workflow changes, the human adapts.

The strategic question is knowing when the copilot pattern is the right architecture versus when it's a crutch. If a human approves 99% of suggestions without changes, you've built an expensive rubber stamp — human-in-the-loop theater, not useful oversight. Some workflows should graduate to a full AI agent that handles the task end-to-end. Others — medical diagnosis, legal review, financial compliance — keep humans in the loop permanently, and that's correct. If your shadow AI audit reveals employees already vibe coding with unvetted tools, a sanctioned copilot is the fastest path to governance.

Full autonomy gets the headlines. Copilots get the contracts. For many enterprise workflows, that's the permanent architecture, not a stepping stone.

Further reading:

Research: Quantifying GitHub Copilot's Impact on Developer Productivity and Happiness (GitHub) — GitHub's study showing 55% faster task completion and measurable quality-of-life improvements for developers using Copilot.
Building Effective Agents (Anthropic) — Anthropic's framework for when to use augmentation patterns versus full agentic workflows.

D

Deepfake

Industry

AI-generated synthetic media — video, audio, images — convincingly depicting real people saying or doing things they never did, creating novel risks for fraud, reputation, and trust.

A deepfake is synthetic media — video, audio, or images — generated by machine intelligence to convincingly depict a real person saying or doing something they never did. The underlying technology is typically a diffusion model or generative adversarial network trained on enough examples of a target to fool the human eye and ear. What started as a research curiosity and a Reddit problem is now an active enterprise threat.

The numbers are hard to ignore. Sensity AI's threat intelligence tracks a doubling of deepfake content online roughly every six months. In 2024, an employee at a Hong Kong engineering firm wired $25 million after a video call with what appeared to be the company's CFO and several colleagues — all deepfakes. CEO voice fraud, where attackers clone an executive's voice from earnings calls and phone in wire transfer requests, has hit companies across finance, manufacturing, and tech. These aren't hypothetical scenarios. They're line items in incident reports.

The bigger strategic concern isn't any single fraud — it's trust erosion. When anyone can generate a convincing video of your CEO saying anything, "I never said that" becomes plausible deniability for statements that actually happened. The evidentiary value of recorded media is degrading in real time. Your legal and communications teams need a position on this before they need an incident response plan.

Defense is layered, not silver-bullet. Media provenance standards like C2PA embed cryptographic signatures at capture time, giving recipients a chain of custody. Detection tools flag statistical artifacts that human reviewers miss. Internal protocols — callback verification for financial requests, multi-party authorization for wire transfers — remain the most reliable backstop because they don't depend on winning a technical arms race against generation models. Treat this as a responsible AI and governance problem, not just a cybersecurity one. Your adversaries already have the tools; the question is whether your risk posture has caught up.

Further reading:

Sensity AI Threat Intelligence — Tracks deepfake proliferation trends, attack vectors, and detection capabilities across industries.
C2PA Content Provenance Standard — The cross-industry coalition (Adobe, Microsoft, Intel, others) building cryptographic provenance into media files at the point of creation.

E

Embeddings

AI Engineering

Numerical representations that capture the semantic meaning of text, images, or other data as vectors, enabling machines to measure how similar two pieces of content are.

An embedding is a list of numbers — a vector — that represents the meaning of a piece of content. Text, images, audio: anything you can feed into an embedding model comes out as a fixed-length array of floats, typically 768 to 3,072 dimensions. Semantically similar inputs land near each other in that space. "Refund policy" and "how do I get my money back" are completely different strings, but their embeddings are close neighbors. That's the trick that makes modern search and retrieval actually work.

Every RAG pipeline starts here. Embed the query, find the nearest stored vectors in a vector database, pull the relevant context, then let the model generate a response. Bad embeddings mean bad retrieval; bad retrieval means hallucinations regardless of how capable your model is. The embedding layer is the quality ceiling for your entire retrieval system.

The practical considerations: embedding models are cheap — fractions of a cent per thousand chunks — but you embed your entire corpus upfront and re-embed when content changes. More importantly, your choice of embedding model is a commitment. You can't swap models without re-embedding everything, because different models produce incompatible vector spaces. Pick one that matches your domain, benchmark it against your actual queries, and treat the embedding pipeline as infrastructure from day one.

Further reading:

OpenAI Embeddings Guide — Official documentation on embeddings, including model options and best practices
What are Embeddings? by Vicki Boykis — Deep dive into embeddings from mathematical fundamentals through production use

Emergent Capabilities

Generative AI

Abilities that appear unexpectedly in large models — like reasoning, code generation, or translation — that weren't explicitly trained for and only manifest above certain scale thresholds.

Emergent capabilities are abilities that appear in AI models at scale that nobody explicitly programmed in. A small language model can barely string a sentence together. Make it ten times larger, and it still can't do arithmetic. Make it a hundred times larger, and suddenly it can solve math problems, write working code, and translate between languages it was never specifically trained on. These abilities don't appear gradually — they're effectively absent below a certain model size and then show up abruptly, often surprising even the researchers who built the model.

The landmark paper from Wei et al. at Google documented dozens of these abilities across foundation models, including multi-step reasoning, word unscrambling, and chain-of-thought problem solving. The pattern was consistent: flat performance for smaller models, then a sharp jump once a scale threshold was crossed. There's genuine academic debate about whether this is a true phase transition in capability or a measurement artifact — Schaeffer et al. argued convincingly that some "emergent" behaviors look less dramatic when you change how you score them. The honest answer is probably both: some capabilities genuinely appear abruptly, and some we were just measuring badly.

For strategic planning, here's what matters. Emergent capabilities mean next year's model might suddenly do things no one predicted. That makes rigid three-year AI roadmaps unreliable. The company that builds flexible architecture — modular systems that can swap in new model capabilities as they appear — will outmaneuver the one that bet everything on today's limitations being permanent. Plan for surprise. Budget for iteration. Treat your AI strategy as a series of options, not a fixed blueprint.

Further reading:

Emergent Abilities of Large Language Models (Wei et al., 2022) — the foundational paper documenting dozens of abilities that appear abruptly at scale across multiple model families.
Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al., 2023) — the important counterargument that some emergent abilities may be artifacts of metric choice rather than true phase transitions.

Evals (Evaluations)

AI Engineering

Systematic tests that measure how well an AI system performs on specific tasks — the AI equivalent of a test suite, used to catch regressions and compare models.

Evals are systematic tests that measure how well an AI system performs on specific tasks. Think of them as the test suite for a non-deterministic system: you define inputs, expected behaviors, and scoring criteria, then run your model against them to get a quantified performance score. Swap a model, change a prompt, update your RAG pipeline — evals tell you whether things got better, worse, or broke entirely.

This is the most underinvested area in AI development, and it's not close. Teams will spend weeks on prompt engineering and retrieval tuning, then evaluate the results by running a few examples and eyeballing the output. That's not engineering — that's shipping on vibes. Without evals, you have no idea if Tuesday's prompt change regressed the 15% of edge cases you stopped manually testing. You find out when a customer does.

A solid eval framework has three components: a dataset of representative inputs, a task that runs those inputs through your system, and scorers that grade the outputs. Scorers can be deterministic (exact match, regex, JSON schema validation) or model-graded (using an LLM to judge quality, relevance, or factual accuracy against a rubric). Model-graded evals are particularly valuable for open-ended generation where there's no single right answer — exactly the cases where hallucination risk is highest and guardrails alone aren't sufficient.

The gap between a demo and a product is measurement. Evals are how you close it.

Further reading:

OpenAI Evals GitHub — Open-source framework for evaluating LLMs with a growing library of community-contributed benchmarks.
Braintrust Evaluation Quickstart — Comprehensive guide to building evals using the data/task/scorers pattern.
Your AI Product Needs Evals (Hamel Husain) — Practical deep dive on why unsuccessful AI products consistently fail to build robust evaluation systems.

F

Fine-Tuning

Generative AI

The process of further training a pre-built foundation model on your own data to specialize its behavior for a specific domain or task.

Fine-tuning is additional training on a pre-trained foundation model using your data to change how it behaves — shaping it for a specific domain, task, or output style.

It's also the most over-prescribed technique in the machine intelligence toolkit. Before you invest in data curation, training runs, and ongoing maintenance, prove that RAG plus solid prompt engineering can't get you close enough. In most cases, they can.

Fine-tuning makes sense when the behavior you need can't be reliably stuffed into a context window: a consistent output style across thousands of generations, stable classification that doesn't drift with prompt wording, or domain-specific patterns that retrieval alone doesn't solve. These are real use cases. They're also the minority.

The operational reality is often undersold. Fine-tuning requires clean, representative training data and rigorous evals — otherwise you're optimizing toward noise. It also creates a maintenance commitment: when the base model improves, you have to decide whether to retrain, evaluate the new base against your fine-tuned version, and absorb that cost. It's not a one-time project.

Use it when you've exhausted cheaper options and have the data quality to do it right.

Further reading:

Foundation Model

Generative AI

A large AI model trained on broad data at scale that can be adapted to a wide range of downstream tasks — GPT-4, Claude, Gemini, and Llama are all foundation models.

A foundation model is a large machine intelligence system trained on broad data at scale, then adapted to downstream tasks. GPT-4, Claude, Gemini, and Llama are all foundation models.

For most companies, the real decision is not "train a model" — it's "which foundation model do we build on, and how do we adapt it?" The adaptation stack is usually prompting, retrieval, and occasionally fine-tuning. Training from scratch is almost never the right call unless you have dataset advantages that no foundation model provider can replicate.

Foundation models also make AI a platform game. Vendor differences show up in cost, latency, capability ceiling, and policy behavior — and those differences matter at the margins. The bigger risk: if you build too tightly around one provider's quirks, switching later gets expensive. Design your system so the model is a swappable component, not a load-bearing wall.

The term was coined in Stanford's 2021 paper mapping the opportunities and risks of large, general-purpose models. It's since become the default framing for how organizations think about deploying machine intelligence.

Further reading:

On the Opportunities and Risks of Foundation Models (Stanford HAI, 2021) — the paper that coined the term and mapped the landscape of risks and opportunities.
Stanford CRFM — the research center dedicated to foundation model study, benchmarking, and transparency.

Function Calling (Tool Use)

AI Engineering

A model capability that lets language models request the execution of external functions — like querying a database, calling an API, or running code — rather than just generating text.

Function calling — also called tool use — lets a language model invoke external functions instead of only generating text. You define a set of functions with names, descriptions, and parameter schemas. The model decides when to call one, outputs a structured request, and your application handles execution. The result feeds back into the conversation. The model never runs code directly; it asks your code to do it.

This is the dividing line between a chatbot and a useful system. Without function calling, a model can only talk about doing things. With it, it can query your database, check inventory, create a support ticket, call a payment API, or trigger a deployment. Language models stop being text generators and start functioning as orchestration layers on top of existing infrastructure.

The implementation pattern is consistent across providers: register functions in your API call, receive a structured call with arguments back from the model, execute it in your code, pass the result back. That loop — reason, call, incorporate — is the core mechanic behind AI agents and is formalized in protocols like MCP.

Two things determine whether this works well in production: schema design and validation. Vague function descriptions produce unreliable calls. No parameter validation means the model can pass malformed inputs straight to your production systems. Treat function definitions the way you'd treat an API gateway contract — explicit, typed, and documented. The model will push every edge case you didn't define.

Further reading:

OpenAI Function Calling Guide — Official documentation on connecting models to external tools.
Anthropic Tool Use — Claude's tool use documentation.

G

Guardrails

AI Engineering

Programmable safety controls that constrain what an AI system can say, do, and access — preventing off-topic responses, harmful outputs, and data leakage.

Guardrails are programmable constraints that control what a machine intelligence system can say, do, and access in production. They sit at the boundary between the model and the outside world — filtering inputs before the model runs, validating outputs before users see them, and limiting which tools or data the model can touch at all.

Input guardrails catch prompt injections, off-topic requests, and toxic content. Output guardrails check for PII leakage, business-rule violations, and hallucinated claims. Behavioral guardrails govern what the model can actually do — which functions it can invoke via function calling, which records it can query, which actions it can take without human approval.

Treating guardrails as optional polish is one of the more common failure modes in production deployments. For any customer-facing system, they're table stakes — the same category as authentication and input validation, not a nice-to-have. The OWASP LLM Top 10 lays out the risk landscape clearly, and guardrails address a meaningful portion of it.

They don't replace evals. Evals tell you how the system performs across a distribution of inputs; guardrails catch individual failures before they reach users. You need both. Start with guardrails on the output layer and work backwards — it's faster to ship something safe than to retrofit safety into something that's already broken trust.

H

Hallucination

Generative AI

When a language model generates confident-sounding output that is factually wrong, fabricated, or unsupported by its training data.

Hallucination is when a language model generates output that sounds authoritative but is factually wrong or fabricated. The model has no awareness that it's wrong — it's not lying, it's just confidently mistaken.

This is a structural property of how LLMs work, not a bug that will be patched away. These models predict statistically plausible next tokens. Plausible and true overlap a lot, but not always. In the gap between those two things live invented citations, wrong dates, nonexistent APIs, and confident nonsense about your specific domain.

Mitigation is an engineering discipline, not a setting you toggle. RAG grounds model answers in real retrieved data rather than parameterized memory. Evals quantify your actual error rates so you know what you're shipping. Guardrails catch bad outputs before users see them. None of these eliminates hallucination — they bound and monitor it.

Teams that treat hallucination as a dealbreaker never ship. Teams that treat it as no big deal ship systems that erode user trust. The professional position is in the middle: measure it, bound it operationally, and design the product so the cost of an error is manageable when it happens. Because it will.

Human-in-the-Loop (HITL)

AI Engineering

A system design where AI handles the bulk of a workflow but routes edge cases, low-confidence outputs, or high-stakes decisions to a human reviewer before taking action.

Human-in-the-loop is a system design where machine intelligence handles routine work but routes edge cases, low-confidence outputs, or high-stakes decisions to a human reviewer before action is taken. It's the difference between full automation and reliable automation.

A typical pattern is an AI agent or agentic workflow that processes inputs within defined boundaries, then pauses and escalates when a risk threshold is crossed. The routing logic is where the work actually lives: too aggressive and you ship failures; too conservative and you've built an expensive notification system that nobody trusts.

HITL is often a permanent architectural decision, not a temporary phase you graduate out of. Medical diagnosis, financial compliance, and code review keep humans in the loop because removing them increases liability — and because the cost of a wrong call is asymmetric. Over time, human decisions can feed back into the system to expand automation coverage, but the human doesn't fully leave. Guardrails define what the AI cannot do; HITL defines what it should not do alone. Both matter. Conflating them leads to systems that feel safe but aren't.

I

Inference

Generative AI

The process of running a trained AI model to generate outputs — as opposed to training. This is what you're paying for every time you call an AI API.

Inference is running a trained model on an input to produce an output. Every API call, every classification, every image generation request — that's inference. Training happens once; inference happens continuously, at scale, on your bill.

Unlike training, inference cost is a recurring line item that scales with usage. The variables that drive it: model size, context window consumption, output length, and request volume. These compound quickly. A large context, a verbose output, and high traffic is a combination that can dominate a product's operating costs before anyone notices.

Model inference spend deserves the same rigor as cloud infrastructure costs — model the unit economics early, not after you've shipped. If costs climb, fine-tuning a smaller model on your specific task is a proven path to reducing spend without giving up quality. Paying frontier model prices for a task a smaller model handles just as well is an expensive habit.

Inference Cost Curve

Industry

The rapidly declining per-token cost of running AI models, driven by hardware improvements, quantization, and competition — falling roughly 10x per year since 2023.

The inference cost curve is the rapid, sustained decline in per-token costs for running machine intelligence models. Since 2023, frontier-class inference pricing has dropped roughly 10x per year — a pace that makes last year's cost estimates unreliable for planning next year's product.

Multiple forces compound the drop simultaneously: faster hardware (Nvidia's Blackwell generation over Hopper), distillation producing smaller models that match larger ones on specific tasks, quantization reducing memory footprint, and fierce competition among model providers compressing margins. None of these forces are slowing down.

This matters for architecture decisions. Features that were uneconomic to build in 2023 — deep reasoning chains, large context windows, per-request embeddings at scale — are routine now and will be cheaper still in 12 months. The right response is to design systems where reasoning depth, context length, or model quality can be dialed up as prices fall, rather than locking in constraints based on current pricing. Treat inference cost as a moving input, not a fixed one.

Scaling laws say models keep improving. The cost curve says they keep getting cheaper. Building as if either trend has plateaued is a bet against a very consistent track record.

L

LLM (Large Language Model)

Generative AI

A neural network trained on massive text datasets that can generate, summarize, and reason about language.

A large language model is a neural network — almost always a transformer — trained on massive text datasets to predict the next token in a sequence. That one objective, applied at sufficient scale, produces a system that can generate, summarize, translate, answer questions, and follow complex instructions.

The "next token prediction" framing sounds limiting. It isn't. At scale, the model has to build a useful internal representation of the world to predict text well, which is why these systems can reason about things they were never explicitly taught to reason about. The capability is real. The hype around it is also real, which means calibration matters.

For practical purposes, you rarely need to care about architecture details. What you do need to evaluate: context window size, reliability on your specific task, latency, and cost per token. A smaller, cheaper model that handles 95% of your cases reliably beats a frontier model that occasionally hallucinates on 5% of them — depending on what that 5% costs you.

The most consequential choice isn't which LLM to use. It's whether to use a general-purpose model as-is, fine-tune one for your domain, or build retrieval on top of it with RAG. Those decisions determine cost, maintainability, and how hard it is to recover when the model gets something wrong.

M

MCP (Model Context Protocol)

AI Engineering

An open standard that gives AI models a universal way to connect to external tools, data sources, and services through a single protocol — like USB-C for AI integrations.

Model Context Protocol (MCP) is an open standard that defines how machine intelligence connects to external tools, data sources, and services through a single protocol — one adapter instead of a custom wire for every pair.

Before MCP, integrating a model with a tool meant bespoke code on both sides. Add five tools and three models and you have fifteen integration points to maintain. MCP collapses that to N-plus-M: tool providers implement one server, model providers implement one client, and any combination works. For teams building AI agents or systems that need to act on real infrastructure, this is the difference between a composable stack and a pile of duct tape.

Anthropic open-sourced MCP in November 2024. Adoption was fast — OpenAI, Google, and others followed, and governance moved to the Linux Foundation's Agentic AI Foundation. That last move matters: it signals the protocol is shared infrastructure, not a proprietary moat. Betting your integration layer on MCP is now a reasonable call.

The practical upside is portability. An MCP server you build today works with whatever model is best next year. That's the real value — not the protocol itself, but the option value it preserves.

Model Collapse

Industry

The degradation that occurs when AI models are trained on AI-generated data, causing the model to lose diversity and accuracy over successive generations — like a photocopy of a photocopy.

Model collapse is what happens when AI models train on data generated by other AI models. Each generation loses fidelity: statistical outliers disappear, diversity shrinks, and the learned distribution narrows until it no longer reflects reality. A photocopy of a photocopy — useful analogy, depressing outcome.

Researchers at Oxford and Cambridge demonstrated the effect across generations of model training, published in Nature in 2024. Earlier work by Shumailov et al. (2023) described the recursive degradation mechanism. The math is unsurprising once you think about it: generative models sample from learned distributions. Train the next generation on those samples and you amplify the center while eroding the edges. Rare but real patterns vanish. The model gets more confident and less accurate.

The web is increasingly filled with slop — machine-generated text published without review. Future training corpora will contain more of it, which makes collapse a practical risk for any model trained on broad web data. Labs know this. It's why they're signing large data licensing deals with publishers, why synthetic data strategies try to generate diversity intentionally rather than inherit it from the open web, and why proprietary human-generated data — customer interactions, internal documentation, domain expertise — is becoming a genuine competitive asset.

If your organization generates a lot of AI-assisted content that feeds back into your own systems, you're running a small-scale version of this experiment. Worth knowing what you're training toward.

Multi-Agent System

AI Engineering

An architecture where multiple AI agents with distinct roles collaborate, delegate, and coordinate to accomplish tasks that exceed the capability of any single agent.

A multi-agent system is an architecture where multiple AI agents with distinct roles collaborate on a task. One agent researches, another writes, another reviews — and an orchestration layer coordinates the flow and tool access. The premise is specialization: don't ask one generalist to do everything.

In practice, the tradeoffs are significant. Multi-agent setups can run in parallel, keep narrower context windows, and assign role-specific tools. They are also expensive and fragile. Each handoff is a failure point. Context degrades between agents. Mistakes propagate fast and often quietly. For most workflows, a well-designed prompt chain is cheaper, simpler, and far easier to debug.

Multi-agent systems earn their complexity only when tasks genuinely require parallel specialization or when different safety and tooling boundaries make a single agent impossible to build safely. That's a real need — just rarer than the hype suggests. Reach for this architecture when you've already exhausted what a single well-prompted agent can do, not as a starting point.

Multimodal AI

Generative AI

AI systems that can process and generate multiple types of media — text, images, audio, video — within a single model.

Multimodal AI refers to machine intelligence systems that process and generate more than one type of media — text, images, audio, video — within a single model. One model, one interaction, multiple input types.

Most business data isn't plain text. Teams deal with screenshots, scanned documents, recorded calls, dashboards, and spreadsheets. A text-only LLM requires preprocessing pipelines that are expensive to build and lossy by nature — converting an image to a text description always throws information away. A multimodal model skips that step entirely. Feed it the image, the audio file, the chart. It handles the parsing.

The practical payoff is pipeline collapse. What previously required a computer vision model, an audio transcription service, and a text model stitched together can often be a single API call. Fewer failure points, less infrastructure, faster iteration.

The capability isn't magic — it flows from the transformer architecture. The same self-attention mechanism that works on text tokens can operate on image patches and audio representations. Multimodality is architecturally native, not bolted on as an afterthought. That's why capability has improved so quickly once labs started training on mixed-media datasets at scale.

The hype tends to outrun the reality in specific domains — video understanding and audio generation are still rougher than text and images — so test against your actual use case before committing.

O

Open-Source vs. Closed-Source Models

Industry

The strategic divide between AI models with publicly available weights (Llama, Mistral) and proprietary API-only models (GPT-4, Claude) — with implications for cost, customization, privacy, and vendor dependency.

Open models publish their weights. You can run them yourself, fine-tune on proprietary data, apply quantization to cut infrastructure costs, and avoid vendor lock-in entirely. Closed models are API-only: you get frontier capabilities with no infrastructure burden, but you're dependent on the provider's roadmap, pricing, and continued existence.

Open-weight models hand you control and cost leverage at the price of operational load. Your team manages deployment, scaling, and updates. Closed models eliminate that burden and tend to move faster on capability and safety improvements — but the dependency is real and compounds over time as you build around their APIs and data formats.

The pragmatic approach is almost always mixed. Use closed models for prototyping, capability-heavy tasks, and anything where time-to-value outweighs cost. Use open models for high-volume or regulated workloads where inference cost and data residency dominate. This is a direct application of the build-vs-buy framework — no universal right answer, but a clear set of variables to evaluate per use case.

One thing worth watching: the capability gap between open and closed models has been narrowing faster than most expected. The assumption that frontier intelligence requires a closed API is less true every six months.

P

Post-Training

Generative AI

The suite of techniques applied after a model's initial training to make it useful, safe, and aligned with human preferences — including RLHF, instruction tuning, and safety training.

Post-training is everything that happens after pre-training: instruction tuning, RLHF, safety training, and alignment methods like constitutional AI and DPO. It transforms a raw next-token predictor into something deployable — and something people actually want to use.

This is where frontier labs differentiate. Anthropic, OpenAI, and Google share similar architectures and data scale. Their post-training choices are what create meaningfully different behavior in production. The evidence is stark: the original InstructGPT paper showed a 1.3B model trained with RLHF producing outputs humans preferred over a raw 175B GPT-3. Parameters are table stakes; post-training quality is the product.

It also explains a failure mode teams hit regularly: a model update changes how your application behaves without any change in model size or architecture. You're seeing the downstream effect of a post-training decision you had no visibility into. If your application is sensitive to tone, refusal behavior, or instruction-following style, track model versions explicitly and treat updates as you would a dependency upgrade — test before you ship.

Pre-Training

Generative AI

The massive, expensive initial training phase where a foundation model learns language patterns from terabytes of text data, typically costing millions of dollars and weeks of compute.

Pre-training is the initial, industrial-scale phase where a foundation model learns language patterns by ingesting terabytes of text. It is the most expensive step in building a machine intelligence system — hundreds of millions of dollars and weeks of compute at frontier scale.

Most companies will never pre-train a model. The cost and infrastructure put it out of reach for everyone except a handful of labs. That makes your foundation model choice a genuine strategic dependency. The architectural decisions, data mix, and tradeoffs baked in during pre-training are locked in. You inherit them.

This matters more than it looks. Fine-tuning and prompt engineering feel cheap precisely because pre-training already paid the bill. And because training costs keep rising, this layer of the stack is concentrating further — fewer providers, more leverage over the ecosystem. When you pick a model vendor, you're picking a long-term dependency on decisions you had no hand in making.

Prompt Engineering

AI Engineering

The practice of designing and iterating on the instructions given to a language model to reliably produce the desired output quality, format, and behavior.

Prompt engineering is the practice of designing, testing, and refining the instructions you give an LLM to get reliable output — the right quality, format, and behavior, consistently.

The gap between a vague prompt and a structured one is often 30 percentage points of task accuracy. That's not a marginal improvement. Techniques like few-shot examples, chain-of-thought prompting, explicit output schemas, and well-scoped system prompts all reduce the model's room for ambiguity. Less ambiguity means more predictable results.

The word "engineering" earns its place here. Good prompt work isn't writing clever sentences — it's iterative: version your prompts, test against representative inputs, measure with evals, and refine based on failures. Teams that skip this step end up chasing inconsistent model behavior with no systematic way to improve.

It's also the cheapest lever available before you reach for fine-tuning. Fine-tuning takes time, money, and data. A better prompt takes an afternoon. Exhaust prompt engineering first — many problems that look like they need fine-tuning are actually prompt design problems in disguise.

One honest caveat: prompt engineering is somewhat model-specific. A prompt tuned for one model may behave differently on another. When you switch models, retest.

Prompt Injection

Industry

An attack where malicious input manipulates an LLM into ignoring its system instructions, revealing internal prompts, or performing unauthorized actions — the SQL injection of the AI era.

Prompt injection is an attack where crafted input tricks a language model into ignoring its system prompt, leaking internal instructions, or taking actions it was never supposed to take. The root cause is structural: models cannot reliably distinguish between instructions and data. Everything is text.

This makes the attack surface uncomfortably broad. Direct injection comes from the user — they simply tell the model to forget its instructions. Indirect injection is more insidious: malicious content embedded in a document, web page, or API response that the model reads and then obeys. An agent summarizing emails could be instructed by an email to exfiltrate contacts. That's not theoretical; it's happened.

The uncomfortable reality is that this problem is unsolved at the model layer. No amount of prompting makes a model reliably immune. Defense has to be architectural: sanitize inputs before they reach the model, validate outputs before they reach your systems, restrict what tools the model can call, and apply guardrails that operate independently of the model's own judgment. OWASP lists prompt injection as the top risk in their LLM Top 10 — not because it's the flashiest attack, but because it's the most reliably exploitable.

If your product accepts free-form user input and feeds it to a model with tool access, you have a prompt injection surface. Design accordingly.

R

RAG (Retrieval-Augmented Generation)

AI Engineering

A pattern that grounds language model responses in your actual data by retrieving relevant documents before generating an answer, reducing hallucination and keeping responses current.

Retrieval-Augmented Generation (RAG) is a pattern where a system retrieves relevant documents from your data before a language model generates a response. Instead of relying solely on what the model learned during training, it pulls from your content at query time.

LLMs don't know your proprietary data, and they hallucinate. RAG addresses both problems at once. It's the most practical path to a model that knows your business without retraining or a fine-tuning cycle that takes weeks and a significant budget.

The catch: retrieval quality is the actual limiting factor. If your documents are poorly chunked, badly indexed, or irrelevant to the query, the model confidently answers with garbage. Most failed RAG implementations fail at the retrieval step, not the generation step. Getting this right means investing in embeddings, chunk strategy, and reranking — not just wiring up a vector database and calling it done.

The pattern was formalized in a 2020 Meta AI paper that framed retrieval as a way to ground generation in external knowledge. It has since become the default architecture for knowledge-intensive applications: internal search, customer support, document Q&A. For most teams, it's also the right first move before considering fine-tuning.

RAG doesn't eliminate hallucination — it reduces it. Keep evals running in production so you know when it drifts.

Reasoning Model

Generative AI

A class of language models that allocate extra compute at inference time to think step by step before answering, trading speed for accuracy on complex problems.

A reasoning model is an LLM that spends extra compute at inference time to decompose problems into intermediate steps before producing a final answer. It thinks before it speaks, at measurable cost.

The tradeoff is real: reasoning tokens run up your bill and add latency. On a simple question, a reasoning model is the wrong tool — you're paying for computation you don't need. On multi-step problems where wrong answers are expensive — complex code generation, legal analysis, financial modeling, agentic planning — the accuracy improvement is often worth it.

The practical pattern is selective deployment. Use reasoning models where mistakes matter. Route everything else to faster, cheaper models. This is increasingly how agentic systems work: a reasoning model handles planning and verification while standard models execute the routine steps.

The category launched with OpenAI's o1 in late 2024 and expanded quickly to o3, Anthropic's Claude with extended thinking, and DeepSeek's R1. They approach the inference-time compute idea differently, but the core bet is the same: you can buy better answers by spending more at inference rather than only at training. Whether that's more efficient than simply training a bigger model is still an open question — and the answer probably depends on the task.

Pick the model for the job. Reasoning models are a specialized tool, not a universal upgrade.

Responsible AI

Software Strategy

The practices — bias testing, safety evaluations, transparency, data governance, human oversight — that ensure AI systems behave ethically and reduce organizational risk.

Responsible AI is the set of practices that keep machine intelligence systems from causing harm — to users, to your organization, and to third parties who never agreed to be part of the experiment. Bias testing, safety evaluations, transparency documentation, data governance, and human oversight all fall under the umbrella.

The framing matters. This isn't ethics theater — it's risk management. The cost of remediating a bias incident, a regulatory fine, or a public failure is orders of magnitude higher than building safeguards before launch. Running evals that test for harmful outputs, adding guardrails in production, and documenting where humans can intervene: these aren't optional decorations on a working system. They're part of what makes it work in the real world.

Regulation is catching up fast. The NIST AI Risk Management Framework and the EU AI Act are making these practices mandatory for high-risk use cases. If your system touches hiring, credit, healthcare, or legal decisions, you're likely already in scope. Organizations that treat responsible AI as a compliance checkbox will scramble when enforcement arrives. Organizations that build it into their development process won't notice the transition.

The companies that get this right early build something more valuable than compliance: they build alignment between what their systems do and what their customers actually trust them to do.

RLHF (Reinforcement Learning from Human Feedback)

Generative AI

A training technique where human evaluators rank model outputs to steer AI behavior toward being more helpful, harmless, and honest.

RLHF is a training technique where humans evaluate and rank model outputs, and those preferences become a training signal. It's how a raw foundation model — fluent but unpredictable — gets steered toward being helpful, harmless, and honest rather than just statistically likely.

The same underlying model can feel notably different depending on who did the RLHF and what they optimized for. Bold versus cautious. Terse versus verbose. Willing to refuse versus willing to engage. That's largely RLHF policy, not architecture. This matters for procurement: many behavioral differences between model versions — or between providers running the same base model — come from tuning decisions, not capability jumps. When a model "improves" or "gets worse" between versions, RLHF changes are often the cause.

It also explains why guardrails at the application layer remain necessary even with heavily tuned models. RLHF shapes general tendencies; it doesn't enforce application-specific constraints. A model tuned to be broadly safe can still behave badly in your specific context.

RLHF is not something most teams interact with directly — it happens at the model vendor level. But understanding it is useful context for evaluating model behavior, interpreting version changelogs, and knowing where application-layer controls actually belong.

S

Scaling Laws

Generative AI

Empirically observed relationships showing that model performance improves predictably as you increase compute, data, and parameter count.

Scaling laws are empirically observed relationships showing that machine intelligence model performance improves predictably — following a power law — as you increase compute, data, and parameter count. More training compute buys more capability, and you can model the curve in advance.

This is the thesis behind the current AI arms race. The landmark 2020 paper from Kaplan et al. at OpenAI plotted these curves across seven orders of magnitude with no visible plateau. If the curves hold, the path to better AI is straightforward: spend more on bigger training runs, and the results compound. Labs have largely acted on this belief, with training runs now costing hundreds of millions of dollars.

The 2022 Chinchilla paper (Hoffmann et al.) complicated the picture usefully. It showed the original scaling approach was wasteful — labs were making models too large relative to their training data. Compute-optimal training scales data and parameters together. The result: smaller, better-trained models can match or beat oversized ones at a fraction of the cost. That's why model size stopped being a reliable proxy for quality.

The unresolved question is whether the curves continue to hold, or whether we're approaching diminishing returns that require qualitatively different approaches. If they flatten, the game shifts from "who spends the most" to "who engineers the best" — where data quality, evaluation rigor, and domain expertise matter more than raw compute budget.

Either way, you don't need to pick sides. Build on the foundation models that exist today and stay architecture-flexible enough to adopt what comes next.

Shadow AI

Software Strategy

The unauthorized use of AI tools by employees — pasting company data into ChatGPT, using unvetted coding assistants, building personal automations — outside IT and security oversight.

Shadow AI is the unauthorized use of machine intelligence tools inside your organization — employees pasting customer data into ChatGPT, engineers using unvetted coding assistants, ops teams building personal automations on free-tier platforms — all outside the view of IT, security, and compliance. It's the AI version of shadow IT, with a bigger blast radius: every interaction can send proprietary data to a third-party model.

It's already happening at scale. Microsoft's 2024 Work Trend Index found 78% of AI users bring their own tools to work, and most aren't telling their managers. The gap between "official AI strategy" and what people actually do is enormous. Employees aren't being malicious — they're being productive. That's what makes blanket bans so ineffective.

Prohibition drives the behavior underground, where you have zero visibility and zero control. The better response: provide sanctioned alternatives with proper guardrails, acceptable-use policies people will actually follow, and enterprise platforms with data-loss prevention, approved providers, and audit trails. The real risk isn't that employees use machine intelligence. It's that they use it through channels where customer data trains someone else's model, and your compliance posture is one screenshot away from a problem.

AI-native organizations treat shadow AI as a signal to channel, not a fire to smother. If your people are racing ahead of your policies, the policies are the problem.

Slop

Industry

Low-quality, AI-generated content published without meaningful human review — the AI equivalent of spam, now flooding search results, social media, and inboxes.

Slop is low-quality, machine-generated content published without meaningful human review. Technically coherent, contextually useless — the AI equivalent of spam.

If Google results feel worse, if LinkedIn reads like a single author with a thousand accounts, if customer support emails answer questions nobody asked — you've been swimming in it. The term was popularized by Simon Willison in 2024 and named Merriam-Webster's Word of the Year in 2025, drawing a deliberate parallel to spam: mass-produced, unwanted, and corrosive to every channel it floods.

Machine intelligence made content generation nearly free. When production cost hits zero with no quality gate, volume explodes and average quality collapses. The output isn't wrong the way a hallucination is wrong — it's just empty. SEO articles, social posts, product descriptions, support emails: all slop candidates.

For your organization, slop is a spectrum. Every AI-generated output sits somewhere between "machine-assisted and human-refined" and "shipped without review." The difference is the gate: editorial standards, evals, and the discipline not to publish everything the model produces. Companies that treat generation as the start of the workflow build trust. Companies that treat it as the end train their audience to ignore them.

At the ecosystem level, slop feeds model collapse — future models training on low-quality AI output and degrading as a result. Vibe coding creates the same risk in engineering. Generation without judgment is just expensive noise.

Stochastic Parrot

Industry

A critical framing of LLMs as systems that produce statistically plausible text without genuine understanding, coined in a 2021 paper arguing the risks of large language models were being underestimated.

A stochastic parrot is a language model that generates fluent, convincing text by stitching together statistical patterns from its training data — without understanding any of it. The output can sound expert while being nothing more than high-probability word sequences.

The term comes from Bender et al.'s 2021 paper "On the Dangers of Stochastic Parrots," which argued that LLMs are pattern-matching engines and that the risks of scaling them were being underestimated. The paper became a flashpoint partly because Google fired two of its co-authors, igniting debate over machine intelligence safety and corporate research independence.

Whether you accept the framing or not, it's a useful corrective against treating model output as inherently trustworthy. LLMs produce text that looks right. Looking right and being right are different problems, and the gap is where production failures live — hallucinations, confidently wrong recommendations, fabricated citations. Teams that deploy machine intelligence successfully build verification into the pipeline and treat model output as a draft, not a source of truth.

Critics of the framing, including many working on emergent capabilities, argue it undersells what large models can demonstrably do. That's fair. The philosophy isn't settled; the engineering lesson is. Build systems that assume the model will be wrong sometimes, and you'll be fine either way.

Structured Output

AI Engineering

Constraining an AI model to return responses in a specific format — JSON, XML, or a predefined schema — making outputs reliably parseable by downstream systems.

Structured output is constraining a model to return machine-readable data — JSON, XML, a predefined schema — instead of freeform prose. It's the difference between an impressive demo and a working integration.

Models are trained to produce plausible text, not valid data structures. Ask for JSON and you'll usually get something close. "Usually" and "close" are what break production pipelines. A missing comma or a hallucinated field is just as fatal as a 500 error. In practice, "mostly works" is the same as "doesn't work."

Modern providers enforce structure at the generation layer. OpenAI's Structured Outputs use constrained decoding to guarantee schema compliance. Anthropic and others expose similar guarantees through tool use and function calling, where the schema is the contract. Libraries like Instructor add a validation layer on top, returning typed objects rather than raw strings.

For any machine intelligence feature that writes to a downstream system — CRM updates, database writes, workflow triggers, report generation — structured output is not optional. It's what makes guardrails enforceable and prompt engineering testable. Without it, you're shipping a system whose correctness you cannot verify programmatically. That's a bad place to be.

Synthetic Data

Generative AI

Training data generated by AI models rather than collected from real-world sources — used to augment datasets, fill gaps, and reduce reliance on expensive human-labeled data.

Synthetic data is training data generated by machine intelligence rather than collected from the real world. Instead of paying humans to label examples or scraping the web, you use an existing model to produce new training samples — question-answer pairs, code snippets, classified documents — shaped for the downstream task.

The upside is real. Microsoft's Phi series is the proof of concept: "Textbooks Are All You Need" showed that a small model trained primarily on high-quality synthetic textbook content could match or beat models 10x its size on coding and reasoning benchmarks. The key wasn't volume — it was curation. GPT-4 generated structured educational content, and the dataset was aggressively filtered for quality. Size lost to signal.

The downside is model collapse. Train on synthetic data generated by a model trained on synthetic data, and each generation loses fidelity. The distribution narrows, rare patterns disappear, and you end up with a model that sounds fluent but has forgotten the edges of the real world. This isn't theoretical — it's been demonstrated across multiple model families.

When a vendor says their model was trained on "proprietary data," ask whether that data is synthetic. It's not inherently disqualifying — Phi proves the approach can work — but it demands scrutiny: which source model generated it, how was quality filtered, and what evals validate the result? Synthetic data without rigorous evaluation is just noise with a higher price tag.

System Prompt

Generative AI

A set of instructions provided to a language model before the user's message that defines the model's persona, constraints, and behavioral rules for the entire conversation.

A system prompt is the set of instructions a model receives before it ever sees a user message. It defines persona, tone, formatting rules, constraint boundaries, and how the model should handle things it doesn't know. It's the product's constitution — the document that governs every interaction.

This is the cheapest, most powerful lever for shaping model behavior in production. Before you fine-tune or build retrieval pipelines, get the system prompt right. A strong prompt enforces formatting, reduces hallucination outside defined scope, establishes brand voice, and cuts the need for downstream guardrails. A weak one gives you a generic chatbot that drifts and makes things up.

Length is a genuine tradeoff. Longer prompts consume context window budget, increase latency, and add cost per request. The best prompts are as short as possible while remaining unambiguous — with explicit edge cases handled and minimal examples that actually carry weight.

System prompts are also the primary target of prompt injection attacks. They are necessary but not sufficient for safety. You still need architectural defenses that validate inputs and outputs independently of what the model was told, because a model can be instructed to ignore its own instructions. The best teams treat system prompts like code: versioned, tested with evals, reviewed in PRs, and iterated continuously. If yours lives in a spreadsheet or someone's head, that's the problem to fix first.

T

Temperature

Generative AI

A parameter that controls how random or deterministic a language model's output is — lower values produce more predictable responses, higher values produce more creative ones.

Temperature is a single number — typically between 0 and 2 — that controls how random an LLM's output is. At temperature 0, the model always picks the highest-probability next token: deterministic, consistent, repeatable. Increase it and the model samples more broadly across possible tokens, producing more varied and less predictable responses.

The rule of thumb is simple. Low temperature for tasks where there's a right answer: extraction, classification, structured data, code generation, contract summaries. Higher temperature for tasks where variety is the point: brainstorming, creative copy, generating a list of diverse options. Above 1.0 is rarely useful in production — output starts getting incoherent in ways that are hard to predict and harder to debug.

This is one of the easiest parameters to get wrong. Teams refine a solid prompt, test it manually at the API default, ship it, then file bugs when outputs are inconsistent across identical inputs. The fix is almost always the same: set temperature explicitly. Don't leave it at whatever the provider defaults to. That default is a compromise designed for no one in particular.

If your task is deterministic, set temperature to 0 and stop guessing. If your task benefits from variety, experiment in a range of 0.7–1.0 and eval the outputs rather than eyeballing a few examples. Temperature isn't a magic creativity dial — it's a sampling parameter, and like every parameter, it should be set deliberately.

The Lethal Trifecta

Industry

The dangerous combination of an AI agent that has access to private data, processes untrusted external content, and can communicate with the outside world — coined by Simon Willison in 2025.

The Lethal Trifecta is the combination of access to private data, processing of untrusted external content, and the ability to communicate with the outside world. Any two of these are manageable. All three together create a serious prompt injection risk.

The term was coined by Simon Willison in 2025 as a framework for reasoning about AI agent security — and it’s a good one, because it makes the danger concrete. If an agent reads sensitive data, processes untrusted inputs, and can send outputs externally, a single malicious prompt embedded in external content can exfiltrate information. The model cannot reliably distinguish your instructions from the attacker’s. That’s not a model failure; it’s a category property of LLMs. The mitigation must be architectural, not prompt-based.

In practice: limit what data each agent can access, sandbox external actions, and separate the agent that reads untrusted input from the agent that touches sensitive systems. If you genuinely need all three capabilities combined, treat the system like high-risk production code — audit logs, output filtering, rate limits, and human approval for any sensitive operation. Guardrails aren’t optional here. The trifecta is a useful checklist for any agent design review: if you’re checking all three boxes, slow down.

The Scaling Hypothesis

Industry

The belief that continuing to increase model size, training data, and compute will be sufficient to achieve artificial general intelligence — the thesis underpinning the current AI investment boom.

The scaling hypothesis is the belief that intelligence is fundamentally a function of scale. Train a bigger model on more data with more compute, and you don't just get better autocomplete — you get qualitatively new capabilities: reasoning, planning, and eventually something that looks like artificial general intelligence. OpenAI, Anthropic, and Google have collectively raised over a hundred billion dollars on some version of this thesis.

The idea crystallized around 2020 when scaling laws showed smooth, predictable performance gains as compute increased. Then emergent capabilities showed up — multi-step reasoning, code generation — appearing abruptly at certain size thresholds. Predictable curves plus surprise capability jumps made the case compelling. It still drives the bulk of frontier model investment today.

The hypothesis is genuinely contested. If it holds, capabilities keep accelerating and the cost of waiting is enormous. If it stalls — and there are serious researchers who think current architectures are approaching a ceiling — then the current generation of foundation models is close to the best you'll get, and execution quality matters more than raw model access.

You don't need to pick a side to act well. The right move is systems that work either way: modular architectures that can swap in better models as they arrive, clean data pipelines that give any model better inputs, and evaluation frameworks that tell you when a new release actually moves the needle for your use case. Bet on flexibility, not prophecy. The hypothesis may be right. The companies that succeed won't be the ones who believed hardest — they'll be the ones who built for optionality.

Token

Generative AI

The basic unit of text that language models read and generate — roughly three-quarters of an English word on average.

A token is the smallest chunk of text a large language model processes. Not a word — a piece of one. Common words like "the" are a single token. Longer words get split: "understanding" becomes "under" + "standing." One token is roughly 0.75 English words, or about four characters. A typical business email runs around 200 tokens. A 200-page document is about 75,000.

Tokens are how you pay. Every major provider — OpenAI, Anthropic, Google — prices on input and output token count separately. A support system processing 10,000 conversations a day at 500 tokens each burns 5 million tokens daily. Whether that's noise or a real budget line depends entirely on the model tier you picked.

Tokens also define what the model can work with at once. A model's context window is measured in tokens. Exceed it and you need to chunk the input, which introduces engineering complexity and new failure modes — losing context at chunk boundaries, inconsistent retrieval, harder debugging. The transformer architecture processes all tokens in the window simultaneously, which is why larger windows cost more compute and more money.

Treat token usage like any other cloud resource. Know your per-transaction counts, right-size models to your workload, and watch for runaway costs from verbose system prompts or bloated context loads. A 10x cost difference between models is common. The model that's "good enough" for your use case is almost always the right choice.

Training Data

Generative AI

The massive corpus of text, code, images, and other content used to teach a foundation model its capabilities — the raw material that determines what the model knows and how it thinks.

Training data is the raw material fed into a foundation model during pre-training — terabytes of text, code, images, and other content that determine what the model knows and, critically, what it doesn't.

The problem is opacity. OpenAI, Anthropic, and Google don't publish full training corpora. Audits of public datasets like C4 and The Pile found real gaps: underrepresented languages, shallow coverage of specialized domains, and a heavy skew toward the English-language web. When a model fumbles your industry's terminology or misses a regulatory nuance, you're usually looking at a training data gap, not a reasoning failure. The model isn't confused — it was never taught.

This reframes model selection. "Which model is smartest?" matters less than "which model was trained on data most relevant to my problem?" A model with deep legal text in its training mix can outperform a nominally stronger general model on contract analysis. Domain fit beats benchmark scores more often than teams expect.

There's also legal exposure. The New York Times, Getty Images, and thousands of authors have filed suits arguing that training on copyrighted material without permission is infringement. These cases remain unresolved, but the direction of travel is clear: if licensing mandates follow, training costs rise and flow downstream. Your vendor's data sourcing practices are now part of your risk profile — even if you'll never see the receipts.

Transformer

Generative AI

The neural network architecture behind every major large language model, which processes input in parallel using a mechanism called self-attention rather than reading sequentially.

A transformer is a neural network architecture that processes all input tokens simultaneously rather than one at a time. Its core mechanism — self-attention — lets the model weigh how every token relates to every other token in parallel. This is the architecture behind GPT, Claude, Gemini, and essentially every LLM that matters today. It was introduced in the 2017 paper "Attention Is All You Need," and it quietly changed the economics of machine intelligence.

Before transformers, language models processed sequences step by step, which meant more GPUs didn't help much — you were bottlenecked by the sequence. Transformers parallelize the workload, mapping cleanly onto modern GPU hardware. That's why training scales predictably with compute investment and why scaling laws hold. AI capability is now largely a capital allocation problem: a foundation model's quality is roughly proportional to the compute, data, and engineering talent behind it.

The transformer architecture also explains the context window: self-attention lets the model see all input at once, but the cost grows quadratically with input length. Doubling the context doesn't double the compute — it quadruples it. That's why larger windows are both valuable and expensive, and why the industry has spent years trying to make long-context attention cheaper without breaking what makes it work.

You don't need to understand transformers to build on top of them. But understanding why they scale the way they do helps you reason about model costs, capability curves, and why every frontier lab is doing the same thing with slightly different training recipes.

V

Vector Database

AI Engineering

A database optimized for storing and querying high-dimensional vectors (embeddings), enabling fast similarity search across millions of documents, images, or other data.

A vector database stores and searches high-dimensional vectors — the numerical embeddings that encode meaning. Where a traditional database answers "give me the row where id = 4827," a vector database answers "give me the 10 items most similar to this query." That shift in what a database does is the foundation of modern machine intelligence retrieval.

The core operation is approximate nearest neighbor (ANN) search. Documents, product descriptions, support tickets — all become embeddings at index time. At query time, the database compares your query vector against millions of stored vectors and returns the closest matches, typically in single-digit milliseconds. This is the retrieval layer that powers RAG.

The vendor landscape is noisy but the decision is usually simpler than the marketing suggests. Pinecone, Weaviate, Qdrant, and Chroma are purpose-built options. For most teams, the pragmatic starting point is pgvector — a Postgres extension that handles millions of vectors on infrastructure you already run. Purpose-built databases earn their keep at scale: hundreds of millions of vectors, sub-10ms latency requirements, complex hybrid filtering. For an MVP or a corpus under a few million records, Postgres is almost always enough.

The real decision isn't which vector database to pick. It's whether your embeddings and chunking strategy are good enough that retrieval returns the right context. The database is fast plumbing. Bad inputs just mean you're finding the wrong documents very quickly.

Vibe Coding

Industry

A style of AI-assisted programming where the developer describes intent in natural language and accepts the AI-generated code without deeply reviewing it — coined by Andrej Karpathy in February 2025.

Vibe coding is a style of machine intelligence-assisted programming where the developer describes what they want in natural language, accepts the generated code, and moves on without deeply reviewing it. Coined by Andrej Karpathy in February 2025, describing his own workflow: see a problem, ask the model, run the result, copy-paste what works.

It has a legitimate place. Prototypes, internal scripts, throwaway tools, and personal projects where the cost of a subtle bug is low and speed is what matters — fine. The problem starts when vibe-coded output reaches production without anyone understanding what it does. You accumulate technical debt you can't describe, bugs become opaque because no human wrote the logic, and security vulnerabilities hide in code nobody has actually read.

Simon Willison drew a useful line with "vibe engineering": machine intelligence-assisted development where you still review, understand, and take ownership of the output. Same tools, different discipline, very different risk profile. The leadership question is straightforward: which one is your team actually doing? If you don't know, assume the riskier one.

Responsible AI-assisted coding is a genuine competitive advantage. Rubber-stamping output is a liability that only surfaces when production breaks — or when something worse happens first.

W

Wrapper Discourse

Industry

The ongoing industry debate about whether applications built on top of foundation model APIs are 'just wrappers' with no defensibility, or whether integration, UX, and domain expertise constitute real value.

"It's just a wrapper" is a common dismissal in machine intelligence circles, implying that any product built on a foundation model API has no defensibility and is one upstream feature launch away from extinction. The critique misunderstands how software value is created.

By that logic, every SaaS product is "just a wrapper" around a database. Salesforce wraps PostgreSQL. Figma wraps a rendering engine. Stripe wraps bank APIs. The entire history of software is layers of abstraction that turn raw capability into something people actually use. The model is the engine, not the car.

That said, critics aren't entirely wrong. Some products truly are thin wrappers: a prompt template, a text box, and a prayer. No proprietary data. No workflow integration. No feedback loops. If a competitor can recreate your product in a weekend using the same API, you don't have a product — you have a demo. The build-vs-buy calculus is brutal in that case.

The real question isn't "is it a wrapper?" but "does it compound?" Defensible products build data flywheels, integrate deeply into workflows, tune to domain-specific needs, and create switching costs that emerge naturally from doing the job well. The model layer is a commodity input; everything above it is where value lives.

Wrapper discourse is a vendor lock-in question in disguise. If your value proposition evaporates when the upstream provider ships a new feature, you were never building a product — you were renting a demo.