Most prompts are written once and abandoned. A production skill is different: a reusable, testable artifact that solves a class of problems reliably across inputs it has never seen.

The gap between those two outcomes is not creativity. It is process. At Agent Engineer Master, every commission runs through the same five-phase build sequence — whether the skill takes 2 hours or 2 weeks. The sequence exists because we have watched engineers skip phases and regret it, usually around the third production failure. The broader picture confirms the pattern: 88% of AI agent projects never reach production (Digital Applied, AI Agent Scaling Analysis, 2025). Most fail before the build, not during it.

TL;DR: Building a production-ready Claude Code skill has five phases: Commission (define what it produces), Describe (write the description and frontmatter), Build (write the SKILL.md body), Support (add reference files, assets, and checkpoints), and Test (verify against real inputs). Most skills fail at Phase 1, because the commission is too vague to test against.


What Are the Five Phases of Skill Engineering?

The five phases are commission (the output contract), describe (the trigger and frontmatter), build (the SKILL.md body with overview, process steps, and rules), support (reference files, asset templates, and approval checkpoints), and test (an eval set of at least three cases run against the commission spec). Each phase produces a concrete artifact that becomes the input to the next phase: the commission feeds the description, the description constrains the body, the body determines what support infrastructure is needed, and the support files define what the eval set must cover. Skipping a phase is not a shortcut. It is deferred rework.

Phase Name Artifact
1 Commission Output contract: deliverable, inputs, scope boundary
2 Describe SKILL.md frontmatter: name, description, tool permissions
3 Build SKILL.md body: overview, process steps, rules
4 Support Reference files, asset templates, approval checkpoints
5 Test Eval set: 3+ test cases with expected outputs

Phases 1 and 2 have a hard dependency: you cannot write a reliable description without knowing the output contract. Phase 3 depends on Phase 2 for the same reason — the scope you locked in the description determines which instructions belong in the body. Phases 4 and 5 have more flexibility. Small skills sometimes run them in parallel.


Phase 1: How Do You Define the Commission?

The commission is a one-page spec written before you open SKILL.md: it names what the skill produces, what inputs it requires in what format, what the skill explicitly does not do, and why each boundary exists, so the SKILL.md body has a concrete target to be tested against from the start. Without a commission, you are writing instructions aimed at a target nobody has specified, and the SKILL.md body cannot be tested against anything concrete.

A well-formed commission has three parts:

  1. The deliverable: the specific artifact this skill produces, described in enough detail that two people reading it independently reach the same mental picture. Vague: "A competitive analysis report." Specific: "A JSON object with five fields: name (string), pricing (string, e.g., '$29/month per seat'), differentiator (string, max 40 words), weakness (string, max 40 words), market_share (string, e.g., '12% of the US SMB market')." The vague version produces a different artifact every run. The specific version produces the same structure every run. Same skill. Different commission.
  2. The inputs: what the user provides and in what format. "A company name" is not enough. "A company name (plain text, no URL) and a target market segment (plain text, one sentence max)" gives Claude something to validate before starting.
  3. The scope boundary: what the skill explicitly does not produce. "Does not source live pricing data. Does not update existing analyses. Does not produce written prose." Every item on this list corresponds to something a user will eventually try, and the skill needs to handle it gracefully, which means the skill body needs a rule for it.

We have tracked commission quality across 40+ builds at Agent Engineer Master. In builds where Phase 1 produced a vague output contract, 7 of 10 skills failed their first production bar check. The failure was not in the SKILL.md instructions — those looked correct. The failure was that the instructions were aimed at a target nobody had clearly specified.

The test for a good commission: describe the expected output to someone who did not write the skill. If they ask "but in what format?" or "does it do X too?" — the commission is incomplete.

For the full output contract specification, see What Is an Output Contract in a Claude Code Skill?.

"The failure mode isn't that the model is bad at the task — it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." — Simon Willison, creator of Datasette and llm CLI (2024)


Phase 2: How Do You Write the Description and Frontmatter?

The description field controls whether Claude selects the skill at all: it is the one line Claude's classifier reads at session startup to decide relevance, and a description that drifts from the actual output contract produces a skill that either fails to trigger on valid prompts or triggers on prompts it cannot serve correctly. Write it immediately after locking the commission, while scope is still fresh, because a description written from memory tends to drift from the actual output contract.

---
name: analyzing-competitors
description: "Analyze a company's competitive position: takes a company name and target market, produces a 5-field JSON object. Does NOT source live pricing data or update existing records."
---

Three constraints that separate working descriptions from broken ones:

  1. One line only: Code formatters like Prettier break multiline descriptions onto separate lines. A multiline description is invisible to Claude's classifier at session startup. The skill will not trigger, silently, with no error. This is the most common reason a skill "stops working" after the first commit.
  2. Under 1,024 characters: The runtime truncates longer descriptions. Truncation does not always happen at a sentence boundary. Your scope boundary, the "does NOT" clause, is at the end. That is the first part to disappear.
  3. Directive language: Testing across 650 activation trials showed directive descriptions ("Analyze a company's competitive position") achieve 100% activation reliability. Passive descriptions ("This skill can help analyze competitive positions when needed") sit at 77%. The 23-point gap is not theoretical.

Each description consumes approximately 100 tokens at session startup (Claude Code documentation, 2025), regardless of whether the skill runs. Build a library of 30 skills and that is 3,000 tokens consumed before the first message. Keep descriptions precise.

For the full analysis of activation mechanics, see The SKILL.md Description Field: The One Line That Makes or Breaks Your Skill.


Phase 3: How Do You Build the SKILL.md Body?

The SKILL.md body defines what Claude does when the skill runs: it has three sections, overview, process steps, and rules, and nothing else belongs in the body, because every addition beyond those three sections either duplicates the commission, conflicts with a rule, or introduces instruction surface that cannot be tested against a clear pass/fail condition. The overview is one sentence naming the deliverable. The process steps are imperative commands Claude executes in sequence. The rules enforce the scope boundary from the commission. That structure is the whole body.

## Overview
[One sentence: what this skill produces, for whom, under what conditions.]

## Process
1. [Imperative command]
2. [Imperative command]
...

## Rules
- Never [specific constraint]
- Always [specific constraint]

On process steps, three rules matter:

  1. Imperative commands, not guidelines: "Read the input file" executes. "The model should consider reading the input file when appropriate" becomes a suggestion Claude weighs against everything else in context. Commands run. Suggestions compete.
  2. One action per step: A step that reads "Read the file, parse the competitor names, and generate a preliminary ranking" contains three actions with no individual success condition. Split it into three steps. You can test each one independently.
  3. Explicit parallelism: Claude follows the sequence you give. If two steps can run in parallel, mark them: "Steps 3 and 4 can run in parallel." If you say nothing, Claude runs them in sequence, which is slower and sometimes wrong.

On the rules section: rules enforce the scope boundary from Phase 1. Every item in the scope boundary ("does not source live data") becomes a rule in this section ("Never attempt to fetch live pricing data. If the user requests this, respond with: 'This skill does not source live data. Provide pricing as plain text or skip the pricing field.'"). Without the corresponding rule, the scope boundary is documentation. With it, it is an instruction. Scope creep is the leading cause of production failures in agent systems: agents that accumulate incremental additions transform from bounded tools into open-ended reasoning systems that become too complex to test and too difficult to debug (Digital Applied, AI Agent Scaling Gap, 2026).

The SKILL.md body has a practical ceiling of roughly 500 lines before instruction fidelity drops. This is consistent with the finding that model attention degrades for instructions placed past the 60% depth of a long context (Liu et al., "Lost in the Middle," Stanford NLP Group, 2023, ArXiv 2307.03172). Research on instruction-following at scale reinforces this: DBRX's failure rate to follow instructions increases from 5.2% at 8k tokens to 50.4% at 32k tokens (Databricks Long Context RAG Performance Study, 2024). Keep the body short. Move domain knowledge to reference files rather than embedding it.


Phase 4: How Do You Add Supporting Infrastructure?

Supporting infrastructure is everything the skill reads at runtime outside the SKILL.md body: reference files carrying domain knowledge, asset templates for consistent output format, and human-in-loop checkpoints that pause execution before irreversible actions, and it is Phase 4 that determines whether a skill is safe to run against real data or just a well-written SKILL.md file. Phase 4 is optional for simple skills with no external dependencies and no irreversible actions, but any skill that publishes, sends, or deletes must have it.

  • Reference files carry domain knowledge that changes independently of the skill's process steps. The competitive analysis skill needs an output template — the exact JSON structure with field names and value format examples. That template lives in references/output-template.md, not in SKILL.md. When the format changes, you update one file. The SKILL.md body stays unchanged. Reference files load on demand, not at startup. They do not count against the 100-token-per-skill startup cost. The tradeoff: you must explicitly instruct Claude to read a specific file in the relevant process step, or that file never gets read. Research on long-context retrieval confirms the value of keeping loaded context small: tests across 18 frontier LLMs showed performance degrades at every increase in input length, with accuracy drops exceeding 30% when relevant information is buried past the midpoint of a long context (Chroma Research, Context Rot, 2025).
  • The one-level-deep rule: reference files must be self-contained. If output-template.md points to field-definitions.md, Claude will not follow the chain. It reads the file you name in the step and stops. Design reference files to be complete without external dependencies.
  • Human-in-loop checkpoints define where the skill pauses. Three types: approve/decline (Claude presents the artifact, user confirms or rejects before the skill continues), choose-from-N (Claude presents 3-5 variants, user selects one), and open-field (Claude requests specific input before proceeding).

Any skill that publishes, deletes, sends, or otherwise takes an action that cannot be undone without external coordination requires an approve/decline gate before that action. A skill that executes without showing the user what it is about to do is a design defect. The gate goes after artifact generation and before execution. Approval controls remain underprioritized in practice: only 42% of regulated enterprises plan to introduce approval and review controls in their AI agent stacks (Cleanlab, AI Agents in Production 2025, 95 production teams surveyed). The 58% without them are running skills that can act before the user has seen the output.

Completed skill folder structure after Phase 4:

.claude/skills/
  analyzing-competitors/
    SKILL.md
    references/
      output-template.md
      source-notes.md
    assets/
      approved-examples/
        example-saas-competitor.md

Phase 5: How Do You Test and Ship a Skill?

Testing means running the skill against real inputs and comparing the output to the commission spec from Phase 1, which is why a strong commission is prerequisite to a meaningful test: if Phase 1 was vague, there is no exact deliverable to compare against, and the eval set measures nothing checkable. If Phase 1 was done correctly, you have an exact deliverable to compare against: format, fields, scope. The test either passes or it surfaces what the commission or the rules section got wrong, which is the point.

The minimum viable test set has three cases:

  1. Common case — a standard input the skill was designed for, well within scope
  2. Edge case — an unusual but valid input: ambiguous data, partial information, or an input that tests the scope boundary
  3. Out-of-scope case — an input the commission explicitly excludes

Run all three. If the skill attempts the out-of-scope case instead of declining it, the scope boundary from Phase 1 was not implemented in the rules section of Phase 3. Fix it there. Human review remains the dominant evaluation method in production: 59.8% of teams rely on it as their primary quality check, with LLM-as-judge approaches used by 53.3% as a scaling complement (LangChain State of AI Agents Report, 2024). For a three-case eval set, human review is both sufficient and fast.

The formal version of this is evaluation-first development — writing test cases before writing the SKILL.md body, so the build has a spec to work toward. See Evaluation-First Skill Development for the full approach. Evaluation practices remain sparse across the industry: only 52.4% of organizations run offline evaluations on test sets before deployment (LangChain State of AI Agents Report, 2024, 1,300+ professionals surveyed). The other 47.6% ship first and discover edge cases from users.

One clear limitation of Phase 5: it tests the skill against inputs you designed. It does not test inputs you have not yet imagined. The learnings loop — documenting observed failures in learnings.md and running the improvement process every 10 entries — handles the rest. Phase 5 is the starting line, not the finish line.

When the skill passes all three cases, set status: active in frontmatter and install it. The improvement loop begins at first real use.


Why Does the Production Bar Exist and What Are Its Four Checkpoints?

The production bar is the AEM standard for a deployable skill, defined as four specific checkpoints a skill must pass before it is installed: commission clarity (two readers agree on format and scope), activation reliability (triggers correctly on varied prompts), output consistency (same structure across 5 runs), and edge case handling (declines out-of-scope inputs rather than attempting them poorly). Not "it ran." Not "the output looks right." Each checkpoint has a concrete test you can run in under 2 hours on a well-commissioned skill.

  1. Commission clarity — two independent readers of the output contract agree on format, fields, and scope
  2. Activation reliability — the skill triggers on 10 varied prompts that should activate it; does not trigger on 5 that should not
  3. Output consistency — the same input produces structurally equivalent output across 5 independent runs
  4. Edge case handling — the skill declines or redirects out-of-scope inputs instead of attempting them poorly

A skill that passes checkpoints 1-3 but fails checkpoint 4 is a fair-weather skill. It works on the inputs you tested. It makes a questionable best effort on everything else.

The four checkpoints take under 2 hours for a well-commissioned skill. For a poorly commissioned skill, they surface the problems before production. That is the point of having a bar. Most teams skip this entirely and discover the gaps in production: fewer than 1 in 3 teams working with AI agents in production report being satisfied with their observability and guardrail coverage, and 63% plan to improve evaluations within the year (Cleanlab, AI Agents in Production 2025, 95 production teams surveyed). The production bar is the pre-deployment version of the work those teams are doing reactively.


Frequently Asked Questions

The most common failure is an underspecified commission: 7 of 10 failed bar checks in our builds trace back to Phase 1, where the output contract was too vague to test against and the SKILL.md body was effectively aimed at a target no one had specified clearly enough to verify. The questions below address the specific scenarios where engineers are most likely to skip a phase, misjudge scope, or ship without a usable eval set.

What is the most common failure in skill engineering?

An underspecified commission in Phase 1. Skills that enter Phase 3 without a clear output contract produce inconsistent output — not because the SKILL.md instructions are wrong, but because the target they were built to hit was never specified precisely enough to be measurably correct. In our builds, 7 of 10 failed bar checks trace back to Phase 1.

How long does the five-phase process take for a typical skill?

A single-domain skill covering a familiar task takes 2-4 hours end-to-end: 30-45 minutes for the commission, 30 minutes for the description and frontmatter, 30-60 minutes for the SKILL.md body, 30-60 minutes for reference files and checkpoints, and 30-60 minutes for testing. Multi-domain skills with complex output contracts take longer because Phase 1 takes more work.

Can I skip Phase 4 for a simple skill?

Phase 4 is optional when the skill has no domain knowledge that changes independently and no irreversible actions. The other phases are not optional. A skill without a commission is a prompt in a SKILL.md file. That is a different thing.

What is a fair-weather skill?

A fair-weather skill passes tests on expected inputs but fails on adjacent or out-of-scope inputs — it either attempts out-of-scope requests poorly rather than declining them, or it produces structurally inconsistent output when inputs deviate from the common case. The fix is in Phase 1 (tighten the scope boundary) or Phase 3 (add explicit rules for out-of-scope handling).

Do I need to write evals before the SKILL.md?

Not required, but it forces you to specify the output contract precisely enough that each test case has a clear pass/fail condition. Teams that skip this tend to discover the underspecification when users report inconsistent output in production.

How many reference files is too many for one skill?

More than 5-6 is a signal the skill is doing too many things. At that point, split it. Two narrow skills with 3 reference files each are more maintainable than one broad skill with 8. Narrow skills also compose more cleanly when chained.

When does a skill need an approval gate?

Any skill that publishes, sends, deletes, or takes any action that cannot be undone without external coordination needs an approve/decline gate before that action. No gate is a design defect, not a feature.


Last updated: 2026-04-17