A skill that doesn't trigger is easy to fix: the description is wrong. A skill that triggers but produces subtly wrong output is harder. The failure isn't visible from the outside. The skill fires, Claude writes something, and the output looks plausible, but it's missing a section, the format is slightly off, or the tone is wrong in a way that's hard to name. This is the failure class that does the most damage in production, because it bypasses the obvious checks.
TL;DR: When a Claude Code skill triggers but produces subtly wrong output, the fault is in the instruction or reference layer. Run the minimal-input test first: strip the input down to the simplest possible case and check whether the output is correct. If the minimal input works but production inputs don't, the problem is instruction scope. If even the minimal input fails, start with the output contract and work backward.
Why Is Subtly Wrong Output Harder to Diagnose Than an Outright Failure?
Subtly wrong output passes the first check: the skill ran. You asked for a review summary, you got a review summary. The failure only appears when you read it carefully and notice the negative findings section is missing, the output format doesn't match your template, or two steps ran in the wrong order.
Outright failures have clear signals: the skill doesn't trigger, Claude returns an error, or a reference file fails to load. Subtle failures require you to know what correct output looks like, then work backward from the gap.
Three layers produce subtle output failures:
- The instruction layer: A step is present but under-specified. Claude follows it technically but misses the intent.
- The reference layer: A reference file loads but Claude de-prioritizes its content when the instruction body is long.
- The output contract layer: You specified format but not completeness. Claude fills in the structure but omits required substance.
Identifying which layer is responsible narrows your debugging to one file.
What Are the Most Common Causes of Subtly Wrong Output?
In order of frequency across AEM commissions:
Instructions that specify what to do, not what to avoid. A content generation skill that says "write a 300-word summary" produces a 300-word summary. Without "always include a negative finding if one exists," the model generates a positive summary whenever the content supports that reading, because omitting a negative finding is never explicitly forbidden.
Reference files that load but don't anchor. When a reference file contains domain knowledge such as brand voice, formatting rules, or technical specifications, Claude reads it during execution. But at token-dense reference files, the model's attention to specific constraints weakens. The instruction step should explicitly invoke key constraints by name, not just say "refer to brand-voice.md." A step like "Apply the tone rules in brand-voice.md, specifically the three rules listed under Banned Phrases" performs better than "follow brand-voice.md" alone. In our builds, we see roughly a 40% drop in constraint adherence when instructions point at a file without naming the specific section.
Output contracts that specify structure without specifying substance. "Output a JSON object with fields: title, summary, tags" produces a JSON object every time. But if you don't specify what the summary field contains and how long it should be, Claude makes that judgment call on every run, and the calls vary across sessions.
"The failure mode isn't that the model is bad at the task — it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." — Simon Willison, creator of Datasette and llm CLI (2024)
How Do You Isolate the Root Cause Using the Minimal-Input Test?
The minimal-input test is the first step in any subtle-output debugging session.
- Create a minimal input that represents the simplest possible case your skill should handle. For a code review skill, use a 10-line function with one obvious issue.
- Run the skill on the minimal input. Check whether the output is correct for that case.
- If the minimal input produces correct output: the problem is input-dependent. Your skill works for simple cases but breaks on complex ones. The issue is instruction scope, not instruction content.
- If the minimal input produces wrong output: the problem is universal. Start with the output contract and work backward to the instruction steps.
We've traced this pattern in commissions where clients report a skill "works about 60% of the time." The failing 40% share a specific input characteristic: longer inputs, multi-part inputs, or inputs with edge-case formatting. The minimal-input test surfaces that distinction in under five minutes.
How Do You Identify Which Reference File Is Causing the Problem?
Reference files load on demand during skill execution. If a file is loading but not contributing correctly, the debugging approach is elimination:
Step 1: Identify which reference files loaded during the failing run. Ask Claude directly in the same session: "Which reference files did you read to complete that last task?" Claude will report what it read, though the self-report isn't always complete.
Step 2: Comment out one reference file instruction at a time. Remove the step that tells Claude to read the file. Re-run the skill.
Step 3: When removing a reference file changes the output, you've found the source. The next question is whether the file's content is wrong or whether the instruction that loads it is wrong.
Two forms of the loading-instruction problem:
- "Read brand-voice.md" tells Claude to read the file. It does. But it doesn't say which parts to prioritize.
- "Read brand-voice.md and apply every rule in the Sentence Structure section before writing" gives Claude an explicit action on the content.
The second form produces consistent output. The first form produces variable output because interpretation is left to the model.
For more on how reference files interact with token economics, see How Does Progressive Disclosure Save Tokens and Improve Performance?.
What Does a Systematic Debugging Session Look Like in Practice?
The four-step protocol we use in AEM debugging commissions:
Step 1: Document the gap precisely. Write down specifically what the output does vs what it should do. "Missing negative findings section" is actionable. "The output is a bit off" is not. If you can't describe the gap precisely, you can't verify when it's fixed.
Step 2: Run the minimal-input test. Does the gap appear on a simple case? If yes, go to the instruction layer. If no, go to input-complexity analysis.
Step 3: Test each step in isolation. Create a prompt that triggers only one step at a time. "Pretend you're at step 3 of my review skill. The input is [X]. Only execute step 3 and stop." Check whether step 3 produces the right output in isolation. Repeat for each step that touches the failing behavior.
Step 4: Add a specificity constraint. Once you've found the under-specified instruction, add one concrete constraint. Don't rewrite the step. "The findings section must include at least one item rated 1-5 on the severity rubric in rubric.md." That single constraint eliminates the ambiguity.
Complex multi-reference skills take three to four debugging cycles. Beyond four cycles, the issue is usually structural: the skill was built without an output contract.
This pattern works for single-domain skills. For cross-domain orchestration where outputs feed into other skills, add one more step: verifying that the handoff artifact is correctly formed. That's a multi-agent debugging problem, not a single-skill one.
For more on the full diagnostic framework, see How Do You Systematically Diagnose a Skill Failure Across Discovery, Loading, and Execution Layers?.
Frequently Asked Questions
Can I use Claude to help debug its own skill output? Yes. In the same session where the wrong output occurred, ask: "Which instructions from the skill did you follow to produce that output? Which parts did you de-prioritize?" Claude will report what it read and how it weighted the instructions. Combine Claude's self-report with step-isolation testing, because self-reports miss what the model omitted silently.
How do I know if the problem is in my instruction body or in a reference file? Run the skill without reference files by commenting out the read instructions. If the skill produces correct output without reference files, the problem is in a reference file. If it fails even without reference files, the problem is in the instruction body.
My skill produces correct output in short sessions but wrong output in long sessions. Why? Long sessions accumulate context that competes with skill execution. Research by Nelson Liu et al. at Stanford found that models lose track of instructions placed in the middle of long contexts at a rate that makes mid-context instruction placement unreliable for production systems (ArXiv 2307.03172, 2023). Re-test in a fresh session with only skill-relevant context. If the fresh session produces correct output, the problem is context bleed, not the skill.
Should I add more rules to fix subtle output problems? Usually not. Adding rules to a complex skill often creates new ambiguities. Instead, add specificity to existing rules. "Always include" outperforms "also include," which outperforms adding a new rule. See At What Point Does Adding More Rules Make a Skill Worse? for the threshold.
What's the fastest way to fix an output contract problem?
Add an explicit example. Create a correct-output-example.md file in your skill's assets folder. Add a step: "The output must match the structure in assets/correct-output-example.md. Use it as the template." In our builds, adding a reference output example cuts format deviation by roughly 70%.
How many debugging cycles does a typical subtle output fix take? Two to three cycles for most skills. First cycle identifies the layer (instruction, reference, or output contract). Second cycle applies the fix and verifies on the minimal-input test. Third cycle verifies on production-complexity inputs. Beyond four cycles, the underlying issue is structural.
Last updated: 2026-04-23