A Claude Code skill without an output contract produces something different every run. The format shifts. Fields appear and disappear. The scope boundary is wherever Claude decides it is that day.
An output contract fixes this. It is a written specification of exactly what the skill produces, what inputs it requires, and what it does not produce. You write it before touching SKILL.md. Everything in the skill file — the description, the steps, the rules — is an implementation of the contract. At AEM, the output contract is the first deliverable we produce on every commission: before a single SKILL.md line is written, the contract exists.
TL;DR: An output contract is the specification layer that sits before SKILL.md. It defines the deliverable (what the skill produces and in what format), the inputs (what the user provides), and the scope boundary (what the skill explicitly does not do). Without it, your skill runs on vibes and produces a different artifact each time.
What Does an Output Contract Actually Look Like?
An output contract is a short plain-text document, half a page or less, with three sections: a deliverable (what the skill produces and in what format), an inputs block (what the caller provides), and a scope boundary (what the skill explicitly does not produce). No special syntax required. You can write it in a text file in under ten minutes.
Plain text works fine.
Here is a minimal output contract for a competitive analysis skill:
Deliverable
-----------
A JSON object with five fields:
- name: string (company name as written)
- pricing: string (e.g., "$29/month per seat")
- differentiator: string (max 40 words, one key differentiator)
- weakness: string (max 40 words, one specific weakness)
- market_share: string (e.g., "12% US SMB market") or null if unknown
Inputs
------
- company_name: plain text, no URL
- target_market: plain text, one sentence describing the segment being analyzed
Scope Boundary (does NOT produce)
----------------------------------
- Does not source or verify live pricing data
- Does not update or merge with existing analyses
- Does not produce written prose reports or executive summaries
- If the user requests anything outside this scope, responds: "This skill produces a
5-field JSON object only. For [requested output], use a different skill."
That contract is testable. You can run the skill three times and check whether the output matches the deliverable specification each time. A contract that says "produces a good competitive analysis" is not testable — good means different things on different runs.
Without schema enforcement, prompt-only JSON extraction has failure rates of 5–20% depending on schema complexity. Failures cluster in production at the worst times: when input volume is high and field-name or structure variation breaks downstream processing (agenta.ai, LLM structured output benchmarks, 2024).
"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." — Boris Cherny, TypeScript compiler team, Anthropic (2024)
"When you give a model an explicit output format with examples, consistency goes from ~60% to over 95% in our benchmarks." — Addy Osmani, Engineering Director, Google Chrome (2024)
Why Do the Three Sections Matter Individually?
Each section of an output contract closes a specific failure mode: the deliverable constrains what Claude produces so the format does not drift between runs, the inputs section defines what Claude validates before starting so partial inputs cannot slip through, and the scope boundary specifies what Claude refuses so users cannot route tasks the skill was never built to handle.
- The deliverable section is where most output contracts fail. "A report" and "a JSON object with five named fields" are both deliverables — but only one of them constrains Claude's output format. Without field names, Claude invents them. Without length limits, Claude writes 500-word analyses when you wanted 40 words. The deliverable section is not a goal. It is a spec.
Concrete test: can you write a pass/fail check for the output? If your deliverable says "a JSON object with five fields: name, pricing, differentiator, weakness, market_share" — you can write code that validates the output in 10 lines. If your deliverable says "a thorough competitive report" — you cannot. The deliverable is not specific enough to test.
The inputs section determines what validation Claude can perform before starting. A skill that takes "a company name" will start running on partial input — a URL, a description, an abbreviation — and produce an artifact that drifts from the deliverable. Specify the format. Specify the constraints. "Plain text, no URL" is two words that prevent a category of input errors.
The scope boundary is where the output contract earns its keep in production. Users try things outside scope. They always do. Without a documented boundary, Claude makes a best effort. That best effort is usually wrong in a way that is hard to catch — the output looks plausible but includes data the skill was never designed to produce reliably.
The scope boundary also becomes the rules section of SKILL.md. Every "does not" in the contract maps to a "Never..." rule in the body. If the contract says "does not source live data," the SKILL.md rule says: "Never attempt to fetch or verify live pricing data. If the user requests this, respond with: [specific refusal text]." The contract is the source of truth. The rule is the enforcement.
Research on structured generation confirms that constraining LLM output format simultaneously increases mean accuracy and reduces variance across prompt variations. In one evaluation, structured generation eliminated all model-ranking swaps across n-shot setups: in unstructured mode, rankings between models flipped depending on the prompt; in structured mode, the winner was consistent every time (HuggingFace, "Improving Prompt Consistency with Structured Generations," 2024).
How Do You Write a Deliverable That Is Specific Enough?
A deliverable is specific enough when you can write a mechanical pass/fail test against it: name the format (JSON, markdown, plain text), list every field by name, set a constraint on each (type, length limit, required or optional), and leave nothing unspecified. Anything you omit, Claude decides at runtime, and those runtime decisions vary across sessions and across users.
The test is reproducibility. Given the same input, two runs should produce structurally equivalent output. If the format changes between runs, the deliverable is not specific enough.
Three questions to ask about any deliverable:
- What format? JSON, markdown, plain text, a numbered list, a table. Name it explicitly.
- What fields or sections? List each one by name. If the output has structure, name the structure.
- What constraints? Field types, length limits, required vs optional. Anything Claude might vary, specify.
For skills that produce prose — a summary, a brief — the deliverable needs structural constraints even when format is flexible. "A three-paragraph summary: context (2 sentences), finding (2 sentences), recommendation (1 sentence)" is specific. "A good summary" is not.
In our builds, deliverable underspecification is the single most common reason a skill produces inconsistent output. The model is not guessing; it is filling in the specification you left blank. Give it nothing to fill in. Across 40+ tracked builds at Agent Engineer Master, skills where the deliverable section named specific fields and types passed their first production bar check in under 2 revision cycles. Skills where the deliverable said "a report" or "an analysis" averaged 6 revision cycles before reaching the same standard (Agent Engineer Master, internal tracking, 2025).
Anthropic's own documentation on structured outputs notes that without schema enforcement, "field names, optional values or entire structures varied from one call to the next." The problem is not model quality. It is the absence of a closed spec (Anthropic, "Structured outputs on the Claude Developer Platform," 2025).
The scale of the gap is measurable. When OpenAI introduced strict JSON Schema enforcement in August 2024, gpt-4o scored 100% on their complex JSON schema evals. The prior model, gpt-4-0613, scored under 40% without schema enforcement. The only variable was the presence of a closed output spec (OpenAI, "Introducing Structured Outputs in the API," 2024).
What Happens if You Skip the Output Contract?
Skipping the output contract does not prevent the skill from running, but it guarantees the skill produces inconsistent results: without a closed format spec, Claude treats each run as a fresh interpretation problem, the output format drifts across sessions, fields appear and disappear on identical inputs, and the retroactive repair work costs as much time as writing the contract upfront would have.
The skill runs. On simple inputs, the output looks correct. You ship it.
By the 10th run, some runs produce five fields, some produce six. The pricing field sometimes includes a range, sometimes a single number. The scope boundary is wherever Claude decided it was on that run. That drift is accurate and unfixable without going back to Phase 1.
Fixing output consistency in a deployed skill means writing the output contract retroactively, then rewriting the SKILL.md rules to implement it, then retesting. That process takes about the same time as writing the contract in Phase 1. The difference: you also have to fix all the outputs that drifted before you noticed.
This is not a Claude problem. Claude does exactly what the instructions say. If the instructions leave the format unspecified, Claude uses its best judgment. Best judgment is consistent within a session. Across sessions, it drifts.
The compounding effect is significant. An agent with just a 5% per-step error rate in a 20-step workflow succeeds only 36% of the time overall, because failures at each step multiply rather than add (Latitude.so, "Why AI Agents Break in Production," 2025). A separate analysis from Patronus AI found that AI agents fail on 63% of complex multi-step tasks in real-world conditions, primarily because failure modes appear in the interactions between steps, not at any single step (VentureBeat, December 2025). Output contract gaps are precisely this kind of per-step failure: individually small, catastrophic at production volume.
Is an Output Contract the Same as an Eval?
No: an output contract defines what the skill should produce and is written before any SKILL.md content exists, while an eval is a set of structured test cases that verifies the built skill actually produces what the contract specifies, run only after the skill is complete. The contract is the design artifact; the eval is the test suite. You cannot write a good eval without a complete output contract to test against.
An eval test case consists of: an input, the expected output (derived from the contract), and a pass/fail condition. If the contract says the deliverable is a 5-field JSON object, the eval can check: "Does the output contain exactly five fields? Are all fields strings? Is the differentiator field under 40 words?" Those are mechanical checks. Without the contract, you're checking "does this look right?" — which is subjective and not repeatable.
Anthropic's engineering team documents that after an agent is launched, evals with high pass rates "graduate" to become a regression suite run continuously to catch drift. That suite only works if the original output contract gave it something to test against (Anthropic Engineering, "Demystifying evals for AI agents," 2025).
For the relationship between output contracts and evaluation-first development, see From Prompt to Production: The Five-Phase Skill Engineering Process.
One limitation of output contracts: they define expected behavior for inputs within scope. They do not prevent users from providing out-of-scope inputs, and they do not automatically generate the SKILL.md rules that enforce the scope boundary. The contract is a design artifact. Translating it into SKILL.md instructions is still your job.
Frequently Asked Questions
Does every Claude Code skill need an output contract?
Every skill that runs more than once needs one, because without a contract the deliverable format is whatever Claude decides to produce on each run, which varies across sessions even for identical inputs. A genuine one-off skill you will never repeat can skip the formal contract. Any skill you intend to reuse, share, or hand off needs one.
How long should an output contract be?
Half a page or less is the right target: a deliverable block naming the output format and fields, an inputs block listing what the user provides, and a scope boundary listing what the skill does not produce. If your contract runs longer than one page, the skill is doing more than one thing and should be split.
Where does the output contract live?
The output contract does not need a permanent home, because it is a pre-build planning artifact rather than a runtime component: write it in a text file, confirm both parties agree on the deliverable, then use it to write the SKILL.md description and rules. Some teams keep it as contract.md in the skill folder. The only constraint is that it must exist before SKILL.md is opened.
Can the output contract change after the skill is built?
Yes, but every contract change requires a corresponding change to the SKILL.md rules that enforce it, plus a full retest against the revised deliverable spec. The contract and the implementation must stay in sync at all times; a contract that describes a 5-field JSON object while the SKILL.md produces a 6-field one is a contract in name only.
What if my skill produces different outputs depending on the input?
Variable outputs are fine as long as every variant is documented in the deliverable section with its own specification: "For single-company inputs, produces a 5-field JSON object. For multi-company comparisons, produces a JSON array where each element is a 5-field object." The contract can have branches. What it cannot have is branches you did not write down.
Is an output contract the same as SKILL.md documentation?
No, and the distinction matters for timing: the output contract is the pre-build spec you write before touching SKILL.md, while the SKILL.md description field and rules section are the implementation that enforces the contract's requirements at runtime. The contract is what you agreed to build; SKILL.md is the evidence you built it.
Last updated: 2026-04-17