What's the Typical Workflow for Developing a Claude Code Skill from Scratch?

Most developers start building a skill by writing SKILL.md. That is the third step, not the first.

TL;DR: The Agent Engineer Master Claude Code skill engineering workflow runs five phases: commission brief, design, build, test, deploy. The brief comes before any code. It defines exactly what the skill produces, what it explicitly does not produce, and what triggers it. Skipping the brief phase is the single most reliable way to guarantee two or three rounds of rework.

Why start with a brief instead of code?

The failure mode for most skills is not bad instructions. It is misspecified scope. A developer who skips the brief writes a SKILL.md that answers a question nobody asked: three capabilities bundled together, no clear trigger, and Claude with no reliable signal for when to stop. The brief forces you to answer four questions before touching a file.

A developer needs a code review skill, so they start writing the review logic. Two hours later they have a SKILL.md that checks for syntax issues, suggests refactors, flags security problems, and comments on architecture. Each individual capability is reasonable. Together, they form a skill nobody knows how to invoke consistently, and Claude doesn't know when to stop. A 2026 ETH Zurich evaluation found that overly detailed agent context files increased inference costs by over 20% without meaningfully improving task completion rates (ETH Zurich, "Evaluating AGENTS.md," arXiv 2602.11988, 2026). The same dynamic applies to unbounded SKILL.md scope.

"The failure mode isn't that the model is bad at the task, it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." -- Simon Willison, creator of Datasette and llm CLI (2024)

The brief forces the scope question before you write a line. What does this skill produce? What does it explicitly not produce? Who invokes it, and when? Answering these before building makes every subsequent decision faster.

The five-phase workflow is overhead for throwaway skills you will use once. If the task is exploratory and the requirements will change with every run, a quick conversation prompt costs less than a structured skill. The workflow pays off for any skill you plan to invoke more than a handful of times.

How do I write the commission brief?

Write a two-paragraph brief before touching SKILL.md. The brief covers exactly two things: what the skill produces and what it does not produce. Both paragraphs are required. The first constrains Claude's output format. The second constrains the scope. Together, they eliminate the two most common reasons a skill fails its first test session.

Paragraph 1 (what the skill produces): Be specific about format, length, and structure. "A structured code review" is not specific. "A numbered list of findings, each with: the affected file and line range, the issue description in one sentence, and a concrete suggested fix" is specific.
Paragraph 2 (what the skill does not produce): This is the most underrated constraint in skill design. Naming what the skill excludes eliminates a class of outputs that are technically correct but scope-creeping. "Does not perform security audits. Does not suggest architectural changes. Does not run tests." Three exclusions that prevent three ways the skill can drift.

The brief takes ten minutes to write and saves two hours of rework. In Agent Engineer Master commissions, skills that arrive with a completed brief ship in half the iterations of skills without one. Research on structured prompt specification found that adding explicit output constraints reduces iteration cycles from six to three (PO2G optimization study, arXiv 2501.10868, 2025).

What does the design phase cover?

The design phase produces three artifacts in order: the frontmatter description, the trigger condition list, and the output contract. Together they ensure Claude invokes the skill reliably and produces exactly what the brief specifies. Write them before touching SKILL.md. Each artifact feeds the next phase directly.

Description first: Write the frontmatter description before the skill body. This one line controls when Claude invokes the skill. A weak description produces a fair-weather skill that triggers sometimes and misses other times. Get the description precise before writing anything else. Anthropic caps the combined description and when_to_use text at 1,536 characters in the skill listing (Claude Code docs, 2025), so every character needs to earn its place.
Trigger conditions: List three to five specific prompts that should activate the skill. Then list two or three that should not. These become your test cases in Phase 4.
Output contract: Define the deliverable precisely. Refer back to the brief. If your brief says "a numbered list of findings," the output contract specifies the exact fields in each finding. This becomes the reference your test assertions check against.

The first version of a skill is usually a prompt wearing a trenchcoat. The description is what turns it into the real thing.

The Claude Code pillar From Prompt to Production: The Five-Phase Skill Engineering Process covers the design phase in more detail for teams building production skills at scale.

How do I build the SKILL.md?

Now write SKILL.md. The build phase has three ordered parts, and each one has a distinct job. Getting all three right in the first build reduces the test iterations needed in Phase 4. Miss one and the test phase finds it the hard way.

Frontmatter: controls discovery. Name, description, and schema_types determine when Claude invokes the skill.
Process steps: control execution. Each numbered step is one action, with any required decision rule inline.
Output contract: tells Claude what to produce and what to refuse to produce.

Start with the frontmatter: name, description, and schema_types. Then write the process steps. Number them. One step, one action. If a step requires a judgment call, add the decision rule inline.

Reference files go outside SKILL.md, not inside. If the skill needs a style guide, a list of approved patterns, or domain knowledge longer than 30 lines, create a separate reference file and point to it from the relevant step. The three-layer progressive disclosure architecture keeps startup costs low precisely because most content lives in reference files, not in the core SKILL.md.

Write the output contract section. Copy it from your brief. This section tells Claude what to produce and what to refuse to produce. Both halves matter.

A 150-line SKILL.md with clean structure beats a 400-line SKILL.md that embeds everything inline. Shorter loads faster and follows more reliably. Anthropic recommends keeping SKILL.md under 500 lines; when invoked, full skill content loads at under 5,000 tokens (Claude Code docs, 2025). Reference files that load only when needed stay outside that budget entirely.

For the anatomy of a well-structured SKILL.md, see What Goes in a SKILL.md File?.

How do I test a Claude Code skill?

Test in a fresh Claude Code session. Not the session you built the skill in. Fresh-session testing is the only confirmation that the skill works with the context a real user has: the SKILL.md and the trigger prompt, nothing else. Every additional context in your build session is invisible to a real user.

This is the Claude A / Claude B distinction. Claude A (the session where you built the skill) has context that a real user would not have: your reasoning, your prior exchanges, the mental model you developed while writing the instructions. Testing in that session gives false positives because Claude is filling gaps with information it should not have.

Claude B is a fresh session. No prior context. Only the SKILL.md and the prompt. This matters for more than just session isolation: Stanford NLP research found that model performance degrades significantly when relevant instructions appear in the middle of long input contexts, with the best results when instructions are at the start (Nelson Liu et al., "Lost in the Middle," arXiv 2307.03172, 2023). Keep the output contract near the top of SKILL.md.

Run your trigger condition list. Each of the "should activate" prompts should activate the skill. Each of the "should not activate" prompts should not. If any trigger test fails, revise the description and retest.

Then check output quality against your output contract. Does the skill produce exactly what the contract specifies, nothing more? If it adds unrequested sections or omits required fields, the instructions need tightening.

Most skills reach reliability in two to four fresh-session iterations (based on the iteration patterns documented in the Claude Code skill engineering research, 2026). Skills without a clear brief or output contract take significantly more.

How do I deploy a skill at the right level?

Install the skill at the right level and confirm it works in its final environment. Project-level installs go in the repo, committed via PR, visible to every team session. User-level installs are personal, scoped to your own sessions. Getting the level wrong means the skill is unavailable to the people who need it.

If the skill is for a team, install it at project level in the repo. Open a PR with the SKILL.md and any reference files. The PR description is where you document why each design decision was made. Six months later, when someone wants to change the skill, that PR description is the institutional memory.

If the skill is personal, install it at user level. Test it in a fresh session on the actual tasks you plan to use it for, not the contrived tests from Phase 4. Real prompts behave differently from test cases.

The deploy phase is also when you verify the skill does not conflict with other installed skills. Run /skills after install. If two skills have overlapping descriptions, revise the one that is less specific. In a 2025 survey of teams running AI agents in production, 63% planned to improve observability within the year, citing lack of visibility into live agent behavior as the primary gap (Cleanlab, Engineering Leaders Survey, 2025). Post-deploy monitoring of your skill's trigger behavior is the same problem at smaller scale.

For guidance on installing at the right level, see What's the Difference Between Project-Level and User-Level Skills?.

Common questions about the skill development workflow

A complete skill, built with a brief, takes two to three hours across all five phases. The brief is ten minutes of that. The design phase is another thirty. The remaining time is split between writing SKILL.md and running fresh-session tests. Skills built without a brief take longer because scope decisions happen during testing, not before it.

How long does it take to build a skill from scratch? A simple skill with a clear brief takes two to three hours across all five phases, including testing. Complex skills with multiple reference files and output variations take four to eight hours. The brief phase is about ten minutes. Skipping it costs more time than it saves.
Can I skip the test phase for a simple skill? A skill that does one specific thing and has an unambiguous trigger can reasonably be deployed after a single fresh-session test. But "simple" is a judgment call developers get wrong more often than they expect. The test phase reveals whether your assumptions about the trigger conditions match Claude's behavior. That is worth one iteration even for simple skills.
What if I don't know exactly what the skill should produce yet? Build a prototype in a conversation first. Give Claude the task manually a few times and observe what you actually want. Then write the brief based on those observations. The brief works best when it reflects real outputs you have already seen and approved, not ideal outputs you are imagining.
Should I write evals before writing the SKILL.md? Yes, if the skill is going to production or will be used by multiple people. Evaluation-first development produces more reliable skills because the test cases force you to specify the correct behavior precisely before you write the instructions. For personal or exploratory skills, the test cases from Phase 4 are often enough.
How do I know when the skill is done? The skill is done when it produces the correct output for every test case in your trigger list, in every fresh session you run. "It worked three times" is not done. "It worked in 10 consecutive fresh sessions against the full test case list" is closer to done.
Can the brief change during the build phase? Yes, but stop and revise it before continuing if it does. Building against an outdated brief is how scope creep enters skills. The brief is not a formality; it is the specification you are testing against.

Last updated: 2026-05-02