How Do I Collect Feedback on My Claude Code Skill's Performance?

Add a closing feedback gate as the last step of your Claude Code skill's process. The gate asks three questions, routes the answers to one of four destinations, and closes the session. Three minutes of friction at the end of each run produces a self-improving skill. Skipping it produces a skill that stays identical forever. This is the mechanism AEM installs in every production skill we ship: a structured feedback gate that feeds a self-improvement loop, turning each run into a data point rather than a one-time event. Anthropic's internal research (2024) found that employees who use Claude in 59% of their work report a +50% productivity gain. That compounding only happens when the skill itself improves between uses, not when it stays static.

TL;DR: The feedback gate is a mandatory final step that asks: was the output correct, was anything missing, was anything unwanted. Answers route to learnings.md (behavioral corrections), edge-cases.md (entity-specific facts), SKILL.md (universal rules), or approved-examples/ (quality benchmarks). Without it, every failure mode requires a manual debug cycle.

The simplest gate in the world is four lines at the end of your SKILL.md:

## Final Step: Feedback Gate
Ask the user: "Was the output correct? Anything missing? Anything that shouldn't have been there?"
If yes to any question, ask for specifics and route the observation to the appropriate file:
- Behavioral pattern (applies to all runs) → learnings.md
- Client/entity-specific fact → edge-cases.md

That is the minimum. Everything below builds on it.

What Three Questions Should Every Feedback Gate Ask?

Every feedback gate needs exactly three questions: was the output correct (catches commission failures where the skill produced the wrong thing), was anything missing (catches omissions where a required element was never generated), and was anything present that should not have been (catches unwanted additions the user had to delete). Each maps to a distinct failure mode in the spec.

"Was the output correct?" — This catches commission failures: the skill produced something that does not meet the basic requirements. Output that is structurally correct but factually wrong, or right format but wrong content.
"Was anything missing?" — This catches omissions: the skill did not include something the user needed. The most common gap between a skill's first version and a production-ready version is that the designer did not anticipate which information is always required.
"Was anything present that should not have been?" — This catches additions: the skill included content that the user had to delete, or took actions the user did not authorize.

A diagnostic review of 23 Claude Code skill files found that 61% had structural issues (missing trigger conditions, ambiguous instructions, or conflicting rules) that silently degraded performance without producing visible errors (DEV Community, 2024). The three gate questions surface these failures systematically, because each one maps to a distinct failure category in the spec.

"The failure mode isn't that the model is bad at the task — it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." — Simon Willison, creator of Datasette and llm CLI (2024)

Each gate question targets a different type of ambiguity in the spec. Together they cover the full surface area of how a skill can fail.

How Do I Route Feedback to the Right Destination?

Routing is the most important decision in the feedback gate, and the one Claude gets wrong without explicit instructions: behavioral corrections belong in learnings.md, entity-specific facts belong in edge-cases.md, universal rules belong in SKILL.md, and high-quality outputs belong in approved-examples/. Get it wrong and the learnings file fills with factual exceptions, or the edge-cases file fills with behavioral patterns that load inconsistently.

The routing table:

Observation type	Example	Destination
Behavioral correction	"Claude collapses numbered lists to prose"	`learnings.md`
Factual entity exception	"Client ABC invoices in EUR, not GBP"	`edge-cases.md`
Rule that always applies	"Always include a summary section at the top"	SKILL.md (edit directly)
High-quality output	Run that produced exactly the right output	`approved-examples/`

The routing decision belongs to the human, not Claude. When Claude routes without explicit instructions, it defaults to learnings.md for almost everything. That pollution happens fast: within two weeks of daily use without correct routing, the learnings file contains 20-30 entries that belong in edge-cases, and the edge-cases file is empty. Research on LLM context use found that models placed instructions mid-context lose track of them at rates that make mid-context policy placement unreliable for production systems (Liu et al., Stanford NLP Group, "Lost in the Middle," arXiv 2307.03172, 2023). This is why routing rules need to be explicit instructions at the top of SKILL.md, not buried in learnings.

In our builds, we add an explicit routing question to the gate: "Is this observation about one specific client or entity, or does it apply across all inputs?" If one specific entity, it goes to edge-cases. If all inputs, it goes to learnings. If it should always apply regardless of input, it goes directly into SKILL.md.

When Should Feedback Update SKILL.md Directly Instead of Going to Learnings?

Update SKILL.md directly when the observation describes a rule that should apply to every run, every time, without exception: if a new colleague would treat it as a standing instruction rather than a situational note, it belongs in SKILL.md, not in the learnings file where it risks being overlooked or consolidated away.

Signs an observation belongs in SKILL.md rather than learnings:

It would apply to all users of the skill, not just the current user's context
It would survive every future consolidation pass without being pruned
It would not be rendered obsolete by a different input type

Signs an observation belongs in learnings:

It applies when a specific input pattern appears, not universally
It corrects a behavior that is right most of the time but wrong for this pattern
It refines a general instruction rather than replacing it

The practical threshold: if you would tell a new colleague "this is a rule, not an exception," update SKILL.md. If you would tell them "watch out for this specific case," put it in learnings.

Iterative specification improvement compounds. Research on self-refining LLM outputs showed that iterative feedback loops improve task performance by ~20% on average compared to single-pass generation (Madaan et al., "Self-Refine: Iterative Refinement with Self-Feedback," NeurIPS 2023). That effect only applies when the feedback is routed correctly and the spec is updated, not when observations sit unprocessed in a growing learnings file.

For more on what belongs in SKILL.md versus reference files, see What Goes in a SKILL.md File.

Can the Feedback Collection Process Be Automated?

Partially: the three gate questions can run automatically at the end of every skill execution and the raw answers can be captured without human involvement, but the routing decision, meaning which destination each observation belongs in, must stay human for the first 60 runs until you have mapped the real failure patterns your skill produces.

What can be automated:

Triggering the three questions at the end of every run
Appending the user's answer verbatim to a draft observations file
Flagging when a run received a "no" answer to the first gate question

What should not be automated without routing rules:

Writing new entries to learnings.md or edge-cases.md
Editing SKILL.md directly based on gate answers
Adding outputs to approved-examples without human review

Once you understand the failure patterns that actually appear in real use, you can write explicit routing rules and automate the routing step. "If the feedback includes a client name or entity-specific fact, route to edge-cases.md. Otherwise, route to learnings.md." Those rules encode the routing logic you built up manually during the first 60 runs.

Automating the write-back before you understand the failure patterns produces a learnings file full of incorrect routing. The first 60 runs are the period where you learn what the skill actually struggles with. A 2025 survey of 306 AI practitioners found that 62% plan to invest in improved observability as their top priority, and that weak observability is the most common pain point in production AI systems (Cleanlab, "AI Agents in Production 2025"). The feedback gate is the skill-level equivalent: it makes the skill's behavior observable before anything gets routed automatically.

What Feedback Is NOT Worth Collecting?

Three categories of feedback waste space and reduce signal quality in the learnings file: subjective preferences with no actionable detail, one-off anomalies from inputs that cannot recur, and observations that duplicate instructions already in SKILL.md, each adding file length without giving Claude any instruction it can act on in the next run.

Subjective preferences with no actionable specificity — "The output could have been better" with no further detail. This cannot be written as a direct instruction. Ask the follow-up question until the observation becomes specific: "What specifically should have been different?"
One-off anomalies that cannot recur — If the input was structurally malformed in a way that will never appear again (a corrupted file, a test run with random data), the observation is not generalizable. Log it in a separate debugging file, not in learnings.
Feedback that duplicates what SKILL.md already says — If the skill's instructions already cover the case and Claude simply failed to follow them in that run, the fix is not to add a learnings entry. It is to check whether the SKILL.md instruction is clear enough, or whether the skill needs testing with a fresh context window (Claude A bias). Adding a learnings entry for a failure that is already covered creates noise and extra file length without fixing the root cause. Google DeepMind's OPRO research found that iterative prompt optimization improved task accuracy by up to 50% on structured benchmarks over human-written baselines (Yang et al., "Large Language Models as Optimizers," ICLR 2024). That improvement requires making the spec more specific, not adding redundant notes about failures the spec already handles.

How Do I Know If the Gate Is Actually Working?

Two signals confirm the feedback gate is functioning: the learnings file grows at 2-4 entries per week during the first month of real use, and the skill starts handling input variations it was not explicitly designed for, because earlier learnings entries captured the underlying patterns.

A learnings file that stays at zero after two weeks of daily use means the gate is not triggering, or the user is skipping it. The compounding signal is subtler: not just "the skill got the client name right" but "the skill handled a format we never tested because a learnings entry captured the underlying pattern."

In our builds, we check both signals at the two-week mark. If the learnings file has fewer than 6 entries after 14 days of daily use, we investigate whether the gate step is in the SKILL.md process, whether it triggers at the end of every run (not just failures), and whether the user understands the routing distinction. The LangChain State of Agent Engineering survey (2025) found that 89% of organizations now implement some form of observability for AI agents, and that detailed step tracing is the single most common quality investment. Production teams have learned the same lesson: without measurement, failure is invisible.

For a complete picture of how the learnings file and edge-cases file work together, see Can Claude Code Skills Get Better Over Time and the full self-improvement architecture in Claude Code Skills That Get Better Over Time.

Frequently Asked Questions About Skill Feedback Gates

Does the feedback gate need to run after every single run, or only when something goes wrong?

Every run. Not just failures. Runs that go perfectly are the source of approved-examples entries. Runs that almost went perfectly but had one minor issue produce the most valuable learnings entries, because the gap between "correct" and "almost correct" is exactly what the learnings file is designed to capture.

What if the user never gives feedback?

Design the gate so that completing the session requires engaging with it. Put it as the literal final step: the skill does not consider itself done until the gate has been passed. Some teams make this explicit: the skill returns "Gate open: [questions]. Respond with answers or type 'skip' to close." Even a 'skip' response tells you the run was accepted without issues.

Should I phrase the gate questions differently for different skill types?

Yes. The three core questions stay the same, but the framing should match the skill domain. A coding skill gate asks about correctness in terms of test outcomes. A content skill gate asks about tone, accuracy, and completeness. A data-processing skill gate asks about coverage and formatting. Keep the logic identical; adjust the vocabulary to the task.

Can I use the gate to collect feedback from users who are not the skill designer?

Yes, and this is where the gate becomes especially valuable. A skill designer tests their own work (Claude A bias). A different user running the skill for the first time surfaces failure modes the designer never encountered because their mental model of the task differs. Gate feedback from users who are not the designer is higher-signal than feedback from the designer.

What do I do with gate feedback from a user who misunderstood the skill's purpose?

Log it in a separate file from the learnings. This feedback is more useful for improving the skill's description and trigger conditions than for improving its behavior. If multiple users misunderstand the skill's purpose in the same way, the description field needs updating, not the learnings file.

How do I handle gate feedback that contradicts a previous learnings entry?

The newer entry wins, with one exception: if the older entry has been reinforced multiple times and the newer entry appears only once, investigate before updating. A new contradiction is either a genuine change in requirements (update both SKILL.md and learnings) or noise from a malformed input (log it separately and do not add to learnings).

Last updated: 2026-04-16