Output Contract: Why You Must Define What Your Claude Code Skill Produces

You ran the skill. The output looked good. You ran it again with the same input. Different format.

That is not a bug in Claude. It is a specification problem — and it is the first thing AEM, a skill-as-a-service platform for building production Claude Code tools, addresses before any skill ships. Claude produces exactly what its instructions describe. If the instructions leave the output format unspecified, Claude fills in the blanks — and those blanks get filled differently each run, because statistical prediction is not the same as deterministic execution.

TL;DR: Without a defined output format, a Claude Code skill produces consistent output within a session but drifts across sessions. Fields appear and disappear. Formats shift. The scope expands. Defining what the skill produces, the output contract, is what makes a skill a repeatable tool rather than an unpredictable one.

What Is Output Drift and Why Does It Happen?

Output drift is when the same skill, given the same input on different occasions, produces outputs with different structure, scope, or format: not different content but different structure, because Claude fills every unspecified formatting decision with whatever fits the current session context, and that context shifts between runs.

A competitive analysis skill might produce a JSON object on Monday and a markdown table on Thursday. Both are "competitive analyses." Neither run violated any instruction, because no instruction specified the format. Claude chose the format that fit the context. Context is different enough across sessions that the choice lands differently.

Three mechanisms cause drift:

Format underspecification — If the output contract says "produce a competitive analysis" without naming the format, Claude selects one based on context. Different contexts, different formats.
Scope underspecification — If the contract does not list what the skill does not produce, Claude makes a best-effort attempt at adjacent requests. A competitive analysis skill asked to "also include a SWOT summary" will include one, because nothing says it should not.
Length underspecification — A deliverable described as "detailed" will vary in length across runs because "detailed" is relative to context. A deliverable described as "a JSON object with five fields, each field max 40 words" will not.

"When you give a model an explicit output format with examples, consistency goes from around 60% to over 95% in our benchmarks." — Addy Osmani, Engineering Director, Google Chrome (2024)

That 35-point gap is the cost of an undefined output format. It is not unique to Google's benchmarks. OpenAI's structured outputs evaluation found that gpt-4o-2024-08-06 with an explicit schema achieves 100% schema compliance, while the same model without schema enforcement scored under 40% on complex JSON schema tasks (OpenAI, 2024). In our builds, the same pattern holds: skills without explicit output contracts produce consistent output in roughly 60-65% of runs on out-of-session inputs.

What Does Output Drift Look Like in Practice?

Output drift in practice looks like a skill that runs correctly every time but produces a different structure each time: JSON on Monday, markdown on Thursday, JSON again next week but with field names that changed, so the downstream consumer breaks not because the skill failed but because the format was never locked into the contract. Consider a skill built to produce a competitor pricing table, run 20 times over two weeks. The outputs:

Runs 1-5: JSON array, each competitor as an object with name, price, tier fields
Runs 6-10: Markdown table with the same columns
Runs 11-14: JSON array, but the tier field is sometimes called plan_name instead
Runs 15-20: Markdown table, with an extra notes column that was not in runs 1-5

No instruction changed between runs. The skill file was not edited. The input was similar across all runs. The output format drifted because the input prompts were slightly different, the session context varied, and Claude made different formatting choices in response.

If another process is consuming this skill's output — a script, a spreadsheet import, a downstream agent — runs 6-20 break that process. The consumer was built for the JSON format from runs 1-5. The markdown tables error. The plan_name field goes unrecognized. The notes column creates parse errors.

Output drift does not look like failure. It looks like inconsistency. That is harder to catch, harder to debug, and more expensive to fix. An analysis of 2 million API calls found that LLM JSON responses fail parsing 8-15% of the time without structured output enforcement, and that number compounds across every downstream system reading the output (TokenMix, 2026).

How Does Defining the Output Prevent Drift?

A defined output contract gives Claude an explicit format to work within, converting every formatting decision from a statistical prediction into a detectable pass/fail check: if the output deviates from the named structure, fields, or length limits, that is a measurable failure rather than an acceptable variation. The process step that generates the output reads: "Format the output as a JSON array. Each element must contain exactly three fields: name (string), price (string, e.g., '$49/month'), and tier (string, e.g., 'Starter'). No additional fields. No markdown." With that instruction, the format cannot drift — any deviation from it is a detectable failure, not a statistical outcome.

The scope boundary prevents adjacent-request creep. If the contract says "does not include SWOT analysis" and the rules section says "If the user requests SWOT analysis, respond: 'This skill produces a 3-field pricing table only. For SWOT analysis, use a different skill'" — the skill cannot expand its scope regardless of what the user asks.

Two things a defined output contract does not prevent:

Content variation. The specific content in each field (what pricing data Claude finds, how it words the tier name) will vary across runs. Output contracts constrain structure, not substance.
Input-driven variation. If your skill handles multiple output types based on input (e.g., "For 1 competitor: return a single JSON object. For multiple competitors: return a JSON array"), the contract must document each variant. An undocumented variant is an unspecified output.

The improvement is measurable. In our builds at Agent Engineer Master, adding an explicit output format to a previously unspecified skill reduced the average revision cycles before production bar pass from 6 to under 2 (Agent Engineer Master, internal tracking, 40+ builds, 2025).

What Does Retroactive Repair Look Like?

Retroactive output contract repair takes at least as long as writing the contract in Phase 1 would have taken, because you have to discover what the skill actually produces before you can specify what it should produce, and discovery requires examining every output variant the skill has emitted in production. The steps are:

Documenting what the skill currently produces (which requires examining existing outputs to find all the variants)
Deciding which variant is correct and writing that as the contract
Updating the SKILL.md rules to enforce the correct format
Retesting against the existing outputs to verify the rules work
Communicating the format change to anything consuming the output

That process takes roughly the same time as writing the contract in Phase 1. The difference: you also have to fix the downstream systems that broke during the drift period, and you have to handle the question of what to do with the inconsistent historical outputs. Unmanaged format changes average 15-20 hours per incident in emergency repair across affected integration points (Theneo, 2026).

In our experience building skills for clients, every retroactive contract is more work than the original would have been. Not because the process is different, but because the discovery phase — mapping all the ways the output drifted — takes time that the original spec would have made unnecessary.

"The failure mode isn't that the model is bad at the task — it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." — Simon Willison, creator of Datasette and llm CLI (2024)

For the full five-phase process including where output contracts fit, see From Prompt to Production: The Five-Phase Skill Engineering Process.

How Specific Does the Definition Need to Be?

The output contract definition needs to be specific enough that you can write a mechanical pass/fail check for it: if your definition cannot be evaluated by a script without human judgment, it is not specific enough to prevent drift, because Claude faces the same ambiguity your validator does and resolves it differently each run. The threshold question is: can you write a pass/fail check for the output format?

"A JSON object with five string fields" is testable. "A thorough JSON analysis" is not.

"Each field max 40 words" is testable. "Concise answers" is not.

"Does not include SWOT analysis" is testable. "Stays on topic" is not.

If your output definition cannot be used to write a mechanical validation check, it is not specific enough to prevent drift. The goal is not to constrain Claude's intelligence — it is to give Claude's intelligence a defined target to work toward instead of an open-ended goal to interpret.

For the full breakdown of output contract structure, see What Is an Output Contract in a Claude Code Skill?.

Frequently Asked Questions

My skill produces different outputs — is that a problem?

Structural differences are a problem: different field names, different formats, different scope. Content differences are not: the skill naming a competitor's price differently on two runs because the phrasing was slightly different is expected variation. The output contract constrains structure, not content.

Does defining the output make the skill less flexible?

Only in the sense that "not drifting" reduces flexibility. Structural consistency is not a constraint on Claude's capabilities — it is a constraint on the output format. Claude can still exercise full judgment on what content to put in each field. The format just stays predictable.

Can I use a loose definition and tighten it later?

You can, but the cost scales with how long the skill runs before you tighten it. Every run on a loose definition is an output that either needs to be accepted as-is or corrected. Tighten the definition before the first production run, not after.

What if I don't know what output format I want yet?

Run the skill manually 3-5 times. Look at what Claude produces. Pick the format you like best and make it the contract. If you do not pick one, Claude will keep cycling through the formats it tried in those early runs.

How do I make Claude output JSON from my skill?

Two requirements: a clear deliverable spec in the output contract (naming the fields and types), and an explicit instruction in the process step that generates the output: "Format the response as a JSON object with the following fields: [list]." The spec alone is not enough — the step instruction is what tells Claude to format the output as JSON rather than describing it in prose.

Last updated: 2026-04-17