At AEM, a skill-as-a-service platform for building production Claude Code tools that pass a measurable production bar before deployment, most production skills have between 4 and 8 steps. Three-step skills work for the simplest workflows. Skills with more than 10 steps start to lose instruction fidelity as Claude holds too many actions in working memory at once.

Those are the ranges. The more useful question is why — and what the failure modes look like at each extreme.

TL;DR: The right step count for a Claude Code skill is the number of discrete actions the skill takes, each stated once and each independently testable. Too few steps means Claude fills in unstated actions however it sees fit. Too many means instructions in the middle of the sequence get dropped. Most skills land between 4 and 8.


What Does "Too Few Steps" Look Like?

Too few steps means the skill is underspecified: Claude performs the actions you named and then improvises the rest, making choices about format, field names, output length, and edge-case handling that you never specified — producing output that varies run to run because the underlying decisions are made differently each time.

A 2-step competitive analysis skill:

## Process
1. Identify the company's competitors.
2. Produce a competitive analysis.

What happens in practice: step 1 is executed clearly. Step 2 requires Claude to make decisions you did not make: what format, which fields, how long each section, whether to include pricing, whether to include a recommendation. Claude makes those choices. The choices vary across runs.

That is not a skill with the wrong number of steps. It is a spec problem that manifests as an output problem. The steps are too abstract because the output contract was never defined. Once you have a concrete output contract ("produce a JSON object with five named fields"), the steps become concrete too:

## Process
1. Read `references/output-template.md` to load the output format.
2. Identify the company's top 3 competitors based on the target market segment.
3. For each competitor, research the five required fields: name, pricing, differentiator,
   weakness, and market_share.
4. Format the output as the JSON array specified in the template.
5. Verify all five fields are present for each competitor. Flag any field as null if
   the information is not available.

Five steps. Each one describes a specific action. Claude does not need to fill in anything.

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." — Boris Cherny, TypeScript compiler team, Anthropic (2024)

The data supports this: providing explicit output formats with examples moves consistency from roughly 60% to over 95% in structured output benchmarks (Addy Osmani, Engineering Director, Google Chrome, 2024). The difference is entirely in how tightly the output contract is specified, which is exactly what underspecified steps leave open.


What Does "Too Many Steps" Look Like?

Too many steps is a real problem, but it kicks in later than most people expect: a 12-step skill does not fall apart immediately, it falls apart on the steps in the middle, where research consistently shows that LLMs suffer the steepest accuracy drops due to the lost-in-the-middle effect (Liu et al., Stanford NLP Group, 2023, ArXiv 2307.03172).

The research quantifies it: in a 20-document QA task, model accuracy dropped by more than 20 percentage points when the relevant document was placed in the middle versus at the start or end (Liu et al., "Lost in the Middle," Stanford NLP Group, 2023, ArXiv 2307.03172). In a 15-step skill where each step is 2-3 lines, steps 8 through 11 sit at roughly the 60-70% depth mark. Those steps get dropped more often.

What "dropped" looks like in practice: Claude executes step 1 through 7, then skips steps 8-10 silently and jumps to step 11 or 12. The output looks roughly right — because the skipped steps were validation steps, or sub-steps within a larger action, or edge-case handling. Nothing breaks obviously. The quality is just lower than expected.

Three signals that your step count is too high:

  1. Claude skips steps. You run the skill and the output is missing something that step 9 should have handled.
  2. Steps are at different levels of abstraction. "Step 1: Research the company" and "Step 7: Ensure the market_share field is a string formatted as 'X% of Y market'" are not at the same level. The high-level step is hiding several actions.
  3. The skill does several different things. A skill that researches competitors, formats a report, and emails the report to a stakeholder is three skills. Split it.

How Do You Calibrate the Right Step Count?

The right step count for a Claude Code skill is governed by three structural rules that prevent both underspecification failure and mid-context instruction loss — and each rule can be verified mechanically against any SKILL.md before the skill goes to production, without running the skill at all.

These three rules are enough to catch every miscalibration we see in practice, including the most common one, which is steps that are really two or three actions bundled together.

  1. Rule 1: One action per step. If a step contains more than one verb (read and parse, analyze and format, validate and output), it contains more than one action. Split it. Each action is independently testable; combined actions are not.

  2. Rule 2: Same level of abstraction. All steps should be at approximately the same level of specificity. If one step is "identify the competitors" and another is "format the market_share field as a string," you have an abstraction mismatch. Either make the high-level step more specific or add a sub-step structure.

  3. Rule 3: If you have more than 10 steps, look for a split. A 12-step skill is often two 6-step skills in a trenchcoat. Ask: is there a natural handoff point where the output of the first phase becomes the input to the second? If yes, split there. Two 6-step skills that link cleanly are more maintainable than one 12-step skill that occasionally forgets the middle.

Decomposition research backs this up: on multi-step reasoning benchmarks, decomposed prompting achieves 50.6% exact match versus 36% for single monolithic prompts (GSM8K benchmark, LLM task decomposition research, OpenReview 2024). Splitting at the right point is not a style preference. It is a reliability intervention.


When Should You Use Sub-Steps?

Sub-steps are the right tool when a single top-level action has internal complexity that needs to be explicit but is not complex enough to warrant its own step number: use them to define field specifications, output formats, or conditional branches without inflating the step count or forcing Claude to infer what the field should contain.

3. For each competitor, collect the five required fields:
   - name: the company's name as publicly stated
   - pricing: the lowest publicly available tier price, formatted as "$X/month"
   - differentiator: the one feature or claim they lead on, max 30 words
   - weakness: one specific documented complaint from user reviews, max 30 words
   - market_share: percentage of the target segment if available, else null

That is one step with five sub-specifications. It does not add to the step count. It makes step 3 concrete without turning it into five separate numbered steps.

Sub-steps work for specification (what a field must contain) and for conditional logic (if X then Y, else Z). They do not work for sequential dependencies — if sub-step B requires the output of sub-step A, make them separate numbered steps. Sequential dependencies modelled as sub-steps are a known failure pattern in agentic task design: the model treats the sub-step as optional rather than blocking, because numbered steps carry stronger sequencing signals than indented bullet points (Anthropic, Claude Code prompting best practices, platform.claude.com, 2025).


Does Step Count Affect Skill Performance?

Step count is a proxy for the real variable — instruction load — and dense steps push more instruction into mid-context positions where accuracy degrades, which means eight dense steps can underperform twelve short ones depending entirely on instruction density, not step count. Chroma's 2025 research across 18 frontier models found accuracy drops of 30% or more at every input length tested (Chroma Research, "Context Rot," 2025).

In our builds, the symptoms of instruction overload appear at around 500 lines in the full SKILL.md body — not at a specific step count. If your 8-step skill has steps averaging 40 lines each, you are at 320 lines of instruction. That is fine. If your 6-step skill has steps averaging 100 lines each, you are at 600 lines — above the practical ceiling, and you will see step skipping.

The step count guidance (4-8 steps) assumes steps are 2-10 lines each. If your steps are longer than that, the more useful number to watch is total lines in the SKILL.md body, not step count. Independent benchmarks confirm that performance degradation at scale is not model-specific: in controlled tests across 5 open and closed models on math, question-answering, and code generation, performance degraded substantially as input length increased even with perfect evidence retrieval (An et al., ArXiv 2510.05381, 2025).

One clear limitation: this guidance applies to single-phase skills where all steps run in sequence. Multi-phase skills — where the user interacts between phases, or where phases run in separate sessions — need a different architecture. The steps within each phase still follow the 4-8 rule, but the phases themselves are distinct skill invocations, not step 9 through 15 of a single skill.

For the full context on where process steps fit in the skill engineering process, see From Prompt to Production: The Five-Phase Skill Engineering Process. For the earlier article on writing process steps that Claude actually follows, see How Do I Write Step-by-Step Instructions for a Claude Code Skill?.


Frequently Asked Questions

Claude keeps skipping step 3 of my skill. What is wrong?

Three likely causes: (1) step 3 is implicit in the output of step 2, and Claude treats it as already done; (2) step 3's condition is never true on your test inputs, so Claude skips it correctly; (3) the SKILL.md body is long enough that step 3 sits past the 60% depth mark. Check the total line count of your skill body. If it is above 400 lines, split the skill or consolidate the steps above step 3.

Should my skill steps tell Claude which tools to use?

Yes, when the tool choice matters. "Use the Read tool to load references/output-template.md" is more reliable than "read the output template" — the latter leaves open whether Claude uses Read, Bash, or infers the content from memory. Specify the tool when you have one in mind. Leave it unspecified only when you genuinely do not care which tool Claude uses. OpenAI's prompt engineering documentation identifies tool specification as a key reliability lever: examples that name the exact function or action to use produce dramatically more consistent outputs than open-ended instructions (OpenAI, Prompt Engineering Guide, platform.openai.com, 2024).

Is it better to have one complex skill or several simple ones?

For maintainability and reliability, several simple ones. A skill with 3-5 steps that does one thing well is easier to test, easier to update, and easier to debug than a skill with 12 steps that does three things. The composition cost (invoking multiple skills in sequence) is lower than the debugging cost of a single skill that fails unpredictably. Multi-agent architectures that separate concerns into distinct agents consistently outperform monolithic single-agent systems on complex tasks, with some benchmarks showing accuracy gains of 40–80% on multi-hop reasoning when tasks are decomposed (LM2 / Society of Language Models research, ArXiv 2404.02255, 2024).

Can a skill step tell Claude to spawn a subagent?

Yes. A step can instruct Claude to delegate a specific task to a subagent. The step would name the subagent type, describe the task, and specify what output to expect back. This is a pattern suited to tasks that exceed roughly 500 lines of instruction load in a single skill, covered in the multi-agent architecture documentation. For most skills, subagents add latency and token cost without improving quality — use them when the delegated task genuinely exceeds what a process step can specify.

How do I make Claude run two steps in parallel inside my skill?

Add an explicit parallelism note in the step: "Steps 3 and 4 can run in parallel. Begin both before waiting for either to complete." Without that instruction, Claude runs all steps in sequence. Parallelism is never assumed.

What is the minimum viable number of steps for a production skill?

Three: one step to load or validate inputs, one step to execute the core task, one step to format and validate the output. Below three steps, the skill is likely too abstract to produce consistent output. The core task step is doing too many things, and Claude will fill in the unstated actions differently each run. Research on LLM instruction-following confirms the pattern: even when models can retrieve all relevant content with 100% exact match, task performance still degrades substantially as instruction load increases (An et al., "Context Length Alone Hurts LLM Performance," ArXiv 2510.05381, 2025). A three-step minimum keeps total instruction volume low enough that the model retains all of it through to the validation step.


Last updated: 2026-04-17