What Do I Do If My Skill Produces Wrong or Unexpected Output?

Wrong output means the skill is activating but producing something other than what you specified. Start with the output contract: if you did not define precisely what the skill should produce, Claude is guessing. Fix the specification before debugging anything else.

TL;DR: Wrong output from an AEM Claude Code skill traces to one of three layers: the output contract, the instruction layer, or the reference loading layer. Diagnose the layer first, then fix it. Each layer has a distinct symptom pattern and a targeted fix. Fixing the wrong layer wastes time.

Why does a Claude Code skill produce wrong output?

Wrong output is always an underspecification problem at some layer. Claude is doing what it reasonably infers from the instructions it has. The error is not in Claude's execution: it is in what Claude was told. Three root causes account for nearly all wrong-output failures, and each one sits at a different layer of the skill specification.

Cause 1: No output contract. The skill tells Claude what task to perform but not what the output should look like. "Write a code review" without specifying format, length, structure, and what NOT to include leaves Claude with near-unlimited creative latitude. The output will vary based on context, prior session history, and statistical defaults from training data.
Cause 2: Ambiguous instructions. The process steps contain language that Claude can interpret two or more ways. "Summarize the key points" means different things depending on context: bullet points or prose? 3 sentences or 10? High-level overview or detailed breakdown? Claude makes a choice. Its choice may not match yours.
Cause 3: Reference file interference. A reference file loaded by the skill contains instructions that conflict with the SKILL.md body, or loads at a moment in the process when its content distorts the output direction. This cause is the hardest to spot because the SKILL.md itself looks correct: the problem is in a file the SKILL.md reads.

"When you give a model an explicit output format with examples, consistency goes from ~60% to over 95% in our benchmarks." — Addy Osmani, Engineering Director, Google Chrome (2024)

That 35-point consistency gap exists entirely at the output contract layer. Skills with explicit output contracts consistently outperform skills without them on correctness metrics. The fix for wrong output almost always starts here.

How do you diagnose which cause is responsible?

Ask three questions in order. Each question isolates a single layer of the specification. The first question checks the output contract, the second checks the instruction layer, and the third checks reference file behavior. Answer all three before making any edits to the SKILL.md, or you risk fixing the wrong layer.

Question 1: Did you define the output format explicitly? Open your SKILL.md and look for an output contract: a section that specifies the format, length, structure, and what the skill does NOT produce. If this section does not exist, or if it says "produce a clear and comprehensive output" rather than specifying concrete structure, you have a Cause 1 failure. Fix the output contract first.
Question 2: Are your process steps unambiguous? Read each step as if you are seeing it for the first time, knowing nothing about your intent. Can you follow each step without making a judgment call? Any step that requires interpretation is a candidate for the ambiguity causing your wrong output. Tighten the step with structural constraints.
Question 3: Did the wrong output appear on the first run or only after many runs? If the wrong output appeared immediately on a fresh session with no prior context, the problem is in the SKILL.md. If the output was correct initially and degraded over multiple uses, a reference file may be accumulating stale context, or session context is bleeding into the skill's execution.

In our diagnostic work on production skills, Cause 1 accounts for approximately 55% of wrong-output cases, Cause 2 for 35%, and Cause 3 for the remaining 10% (AEM production review, 2025-2026). Most developers skip to Cause 2 because instructions feel like the "real" part of the skill. Cause 1 is more frequent.

Research confirms the pattern. Ambiguity and underspecification appear in 45% of AI interaction failures across production systems, and in 79% of those cases the model generates output without surfacing the ambiguity at all ("Invisible Failures in Human-AI Interactions," arXiv:2603.15423, 2026). The skill owner sees a plausible-looking result and cannot tell it is wrong until they check it against a specification they may not have written.

How do you fix wrong output at the instruction layer?

The fix for ambiguous instructions is structural constraints. Replace every judgment call in the process steps with a concrete specification: a named format, a fixed count, or an explicit structure. Instructions that leave Claude with interpretation latitude produce inconsistent output. Instructions that leave no latitude produce consistent output. Rewrite each vague step using one of these patterns:

"Write a summary" becomes "Write a 3-sentence summary: one sentence stating what the code does, one sentence stating what it does not do, one sentence stating the recommended next action."
"List the key issues" becomes "List exactly 3-5 issues, each in the format: [Issue name]: [1-sentence description] [Severity: high/medium/low]."
"Review the design" becomes "Evaluate the design across three dimensions: performance implications, security implications, and maintainability. For each dimension, output one paragraph of 2-4 sentences."

Each rewrite replaces a vague instruction with a template. Templates are followed more reliably than prose guidance because they give Claude no interpretation latitude on structure: only on content. Structured prompting with explicit constraints increased task accuracy by 30% in Anthropic's own testing across classification and summarization tasks (Anthropic, 2025).

How do you fix wrong output at the output contract layer?

Write the output contract if you do not have one. An output contract is a section in your SKILL.md that specifies what to produce, what to exclude, and what a correct result looks like. Without it, Claude defaults to its training-data distribution for the task type, which is rarely what you want. A complete output contract has four parts:

What the skill produces: format, length, sections, and data types
What the skill does NOT produce: specific items Claude should exclude, even if they seem relevant
One example output: 5-10 lines showing the exact structure, not the exact content
Consumption context: who or what reads this output — human, downstream script, another agent

The "does NOT produce" list is as important as the specification. Claude's training data pulls it toward comprehensive responses. Without explicit exclusions, Claude expands scope. A code review skill without exclusions produces architectural recommendations, performance notes, and design suggestions that the requester never asked for. GitHub's analysis of 2,500+ agent specification repositories identified "clear boundaries" with explicit 'never do' constraints as one of six components that distinguish agents that work from agents that fail (GitHub Engineering Blog, 2025).

How do you fix wrong output caused by reference files?

Reference file interference appears as correct output on simple inputs and wrong output on complex ones. The reference file loads additional context that shifts Claude's behavior for edge cases. It is the hardest cause to diagnose because the SKILL.md body looks correct: the interference source is a file the SKILL.md reads, not the SKILL.md itself.

The diagnostic: remove all reference file Read instructions from your SKILL.md temporarily. Run the skill on the input that produces wrong output. If the output is now correct, a reference file was interfering. Add reference files back one at a time, testing after each addition, until the interference reappears.

Load position matters as much as content. Research on long-context models found that instructions placed in the middle of context windows are followed at rates that make mid-context policy placement unreliable for production systems (Nelson Liu et al., Stanford NLP Group, "Lost in the Middle," arXiv:2307.03172, 2023). A reference file that loads mid-process can fall into the attention trough.

The fix depends on what the reference file is doing:

If the file contains outdated guidance, update it
If the file loads too early in the process, move the Read instruction to the step where the reference is actually needed
If the file conflicts with a process step, resolve the conflict explicitly in whichever location is more authoritative

For a complete diagnostic framework covering all three layers, see How Do You Diagnose Whether a Skill Failure Is a Description Problem, Instruction Problem, or Reference Loading Problem.

When should you rebuild the skill instead of fix it?

Rebuild when wrong output has three characteristics simultaneously: it is systematic (appears on most inputs, not just edge cases), it has persisted through multiple targeted fixes, and the fixes have made the SKILL.md longer without making the output better. That combination means the architecture is wrong, not the wording.

Systematic wrong output that resists targeted fixes means the skill's architecture does not match the task's actual requirements. Adding rules and constraints to a fundamentally misaligned skill increases noise without fixing the signal. At that point, defining the task from scratch — starting with the output contract and working backward to the process steps — is faster than iterative repair.

In practice, skills that reach iteration 4 or 5 without convergence on correct output are candidates for a rebuild. The early iterations of a skill design are fast. The late iterations, patching over accumulated misalignment, are slow and produce fragile results. This cost compounds: 45% of developers report that debugging AI-generated output is more time-consuming than debugging their own code, specifically because the specification mismatch is invisible until the output is compared against a contract (Stack Overflow Developer Survey, 2025).

The three-layer framework does not fix wrong output caused by model capability limits. If the task requires reasoning the model cannot perform at the required depth, or if the task is inherently too ambiguous to fully specify in a SKILL.md, adding more constraints will not help. In those cases the task scope needs to shrink or the model tier needs to change.

Frequently asked questions

Wrong output in a production skill falls into one of the three layers covered above: the output contract, the instruction layer, or the reference loading layer. The questions below address the patterns developers encounter most often, including output degradation over time, non-deterministic behavior, and skills that resist repeated fixes.

My skill output was correct for two weeks and then started producing wrong output. What changed?

Three possibilities: the reference files the skill reads were updated and now contain different guidance, you added a new skill to the project whose description interferes with this skill's activation context, or session length has increased and Claude is carrying more prior context that shapes the skill's output. Test in a fresh session first. If the fresh session produces correct output, the issue is session context bleed, not the skill itself.

How do I tell the difference between wrong output and output I didn't expect but that is actually correct?

Compare the output against the output contract, not against your intuition. If the output contract says "3-5 bullet points" and you receive 7 bullet points, that is wrong output regardless of whether the 7 points are good. If the output contract is ambiguous enough that the output satisfies it, the output contract needs tightening. Do not accept output that exceeds or misses the specification.

Should I add an example output to my SKILL.md to fix consistency?

Yes. An example output is the single fastest fix for many consistency problems. Claude treats an example as a template more reliably than it treats a description. One 10-line example output reduces format variation by more than 4 paragraphs of format description (AEM internal testing, 2025). Put the example in the SKILL.md body or in an assets file pointed to from the body.

My skill produces different output every time. Is that a Claude problem or a skill problem?

Skill problem. Non-deterministic output is the signature of an underspecified output contract combined with ambiguous process steps. Claude fills the underspecified space with its best inference, which varies based on context. Tighten the output contract and add structural constraints to the process steps. If variation persists after both fixes, test whether it reduces in a fresh session — persistent variation in a fresh session indicates a fundamentally ambiguous specification.

I rewrote the process steps three times and the output is still wrong. What now?

Go back to the output contract. In our experience, skills that resist process-step fixes almost always have an output contract problem that was not diagnosed initially. The developer focused on instructions because instructions feel like the "active" part of the skill. Write the output contract from scratch, specifying format, length, sections, exclusions, and an example. Then check whether the process steps now make sense relative to a clearly specified output.

Last updated: 2026-04-20