Claude Code Skills That Get Better Over Time: Self-Improvement Patterns

Skills degrade the moment the world moves on. Most teams build a skill, ship it, and discover three months later that it still outputs the old format, still ignores the new house style, still treats every input like it was the easy case from the demo. The fix is not to rebuild from scratch. It is to design for improvement from run one.

TL;DR: A Claude Code skill improves through three feedback channels: a learnings.md file for behavioral corrections, an edge-cases.md file for factual exceptions, and an approved-examples folder for quality benchmarks. A closing feedback gate routes observations to the right destination at the end of every run. Skills built with this infrastructure compound in quality across months of real use.

This article covers the full self-improvement infrastructure for Claude Code skill engineering, from the basic three-file setup to advanced patterns. It draws on AEM's production skill engineering work across client commissions, where we track quality metrics across the full lifecycle of each skill. For the evaluation foundation that makes self-improvement measurable, see Evaluation-First Skill Development: Write Tests Before Instructions.

How Do Claude Code Skills Actually Get Better Over Time?

Skills improve when real-world observations feed back into the skill's file structure through three dedicated channels: a learnings.md file for behavioral corrections, an edge-cases.md file for entity-specific exceptions, and an approved-examples folder for quality benchmarks that Claude references on each subsequent run. A feedback gate at the end of every run routes each observation to the right destination.

A skill in Claude Code is a collection of text files. SKILL.md contains the instructions; reference files carry domain knowledge; an approved-examples folder holds quality benchmarks. None of these update themselves. The self-improvement infrastructure is the system that keeps them current based on what actually happens when real users run the skill.

Three channels carry the signal:

Learnings file (learnings.md): behavioral corrections from real runs. "When the input is a numbered list, Claude collapses it to prose by default. Preserve the list format."
Edge-cases file (edge-cases.md): factual exceptions for specific entities, clients, or contexts. "Client Halverson invoices in GBP. Do not convert to USD."
Approved-examples folder (approved-examples/): finished outputs that passed the bar check and serve as quality anchors for future runs.

Without all three, the skill is a fair-weather skill: it handles the cases it was designed for and fails silently on everything the designer did not anticipate. In skill engineering terms: a prompt in a trenchcoat with better metadata.

The infrastructure cost is low. Three files, one additional step in the skill's process. The quality payoff is cumulative.

What Is the Closing Feedback Gate and Why Is It Mandatory?

The closing feedback gate is the last step of every skill run: a structured three-question prompt that asks whether the output was correct, whether anything was missing, and whether anything extraneous appeared, then routes each observation to the appropriate file before the session closes. Observations route to the appropriate file immediately. Skip this step and you lose every improvement signal from that run, permanently.

It asks three questions before the session closes:

Was the output correct?
Was anything missing that should have been present?
Was anything present that should not have been?

The answers are routed to one of four destinations:

Observation type	Destination
Behavioral correction (Claude does X, should do Y)	`learnings.md`
Factual exception (specific client, entity, or format)	`edge-cases.md`
Structural rule that applies to every run	SKILL.md rules section
High-quality output ready to serve as a benchmark	`approved-examples/`

The routing decision belongs to the skill designer, not Claude. When Claude routes observations without instruction, it defaults to learnings.md for almost everything. That pollutes the behavioral signal with factual exceptions and fills the file past its useful ceiling faster than necessary.

Make the routing decision explicit in the gate step: "If the feedback is specific to one entity or client, route it to edge-cases.md. If it applies every time this input pattern appears, route it to learnings.md. If it is a rule that should always hold, update SKILL.md directly."

"The failure mode isn't that the model is bad at the task — it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." — Simon Willison, creator of Datasette and llm CLI (2024)

The feedback gate is how ambiguous instructions get replaced with precise ones, one observed failure at a time.

How Large Can Learnings and Edge-Cases Files Get?

learnings.md and edge-cases.md each have a hard line cap because reference files that grow without a ceiling produce two measurable failure modes in skill output: contradictions Claude resolves arbitrarily by position, and important patterns buried past the attention zone where the model applies instructions reliably.

learnings.md: 100-line hard cap. Consolidate at 80 lines.

edge-cases.md: 60-line hard cap. Split into sub-files (such as edge-cases-clients.md, edge-cases-formats.md) when it hits 60 lines.

These caps are not arbitrary. Nelson Liu et al. at Stanford NLP Group documented that "models placed in the middle of long contexts lose track of instructions at a rate that makes mid-context policy placement unreliable for production systems" (ArXiv 2307.03172, 2023). The same effect applies inside skill reference files. Instructions near line 90 of a 100-line file receive less attention than instructions in the first 30 lines.

Past the caps, two specific failure modes appear. Liu et al. measured a U-shaped attention degradation: near-perfect accuracy at context positions 1 and 20, falling below 40% at position 10. That degradation is the structural cause of both failure modes, because rules buried in the middle of an overgrown file are the least reliably followed.

Contradiction accumulation: a learning from week 1 contradicts a learning from week 8. Claude has no mechanism to resolve the conflict, so it picks whichever appears earlier in the file.
Signal dilution: when every run produces a minor observation, the file fills with low-signal entries. The genuinely important patterns, the ones that caused real failures, get buried among observations that were logged once and never reinforced.

How Do I Consolidate a Learnings File Without Losing Important Patterns?

Consolidation at the 80-line mark is a 20-minute task that distills 5–10 related entries into one consolidated statement, removes low-signal observations logged only once, and leaves the file holding only the highest-precision behavioral signal the skill's real-world runs have produced so far. Run it before the file exceeds 100 lines.

The process:

Read all entries and group them by theme
Write one consolidated statement that captures the essential signal from 3-5 related entries
Delete the individual entries, keep the consolidated version
Leave intact any entries that describe rare, high-stakes exceptions with no related entries to group with

Entries that appeared once and were not reinforced are safe to delete. If the pattern was real, it will reappear in future runs and get re-added. Keeping unreinforced entries increases contradiction risk without adding precision.

Anthropic's engineering team documents the same principle for agent context management: compaction, which distills accumulated context into a high-fidelity summary, lets agents "continue with minimal performance degradation" while keeping the active context lean ("Effective context engineering for AI agents," Anthropic Engineering, 2025). The same principle drives learnings file consolidation.

Research on iterative self-refinement confirms the feedback precision principle: Madaan et al. (Self-Refine, NeurIPS 2023, arXiv 2303.17651) found that LLMs refining their own outputs through structured feedback improve by approximately 20% absolute on average across tasks, with improvement magnitude directly tied to the specificity of the feedback provided.

In our builds, a well-managed learnings file after four months of daily use holds 40-60 entries. All specific, none contradicting each other. That is the ceiling the 80-line consolidation rule enforces.

How Do Approved Examples Improve Skill Output Quality?

Approved examples work through distribution anchoring: a finished output in the approved-examples folder shows Claude the target format, length, and tone without requiring the skill to specify every dimension in prose instructions, because a single concrete output teaches those dimensions more precisely than any written rule can. One strong example teaches more about the expected output than 500 words of formatting rules, and the quality lift compounds as the folder grows across real runs.

"When you give a model an explicit output format with examples, consistency goes from ~60% to over 95% in our benchmarks." — Addy Osmani, Engineering Director, Google Chrome (2024)

The mechanism is distribution anchoring. An example shows Claude the target format, length, and tone without requiring the skill to specify every dimension in prose. A 2,000-word client proposal example teaches more about what the output should look like than 500 words of formatting instructions. Cattan et al. ("DoubleDipper," arXiv 2406.13632, 2024) demonstrated that in-context examples drawn from the same domain as the input produce +16 absolute points of improvement on average across QA benchmarks. The same domain-specificity principle applies when approved examples come from the skill's real production runs rather than generic samples.

The approved-examples folder:

Lives inside the skill folder at skill-name/approved-examples/
Contains real outputs from real runs that passed the full bar check
Is referenced in SKILL.md: "See approved-examples/ for the target quality level"
Gets updated when a new output clearly raises the bar

Add examples only when they would survive the bar check unchanged. A mediocre example anchors to mediocrity. One strong example in the folder is worth more than eight adequate ones.

This pattern works best for skills that produce structured documents: proposals, briefs, reports, content pieces. It is less useful for skills that produce code (where correctness matters more than style) or skills that produce short, variable-length answers.

What Is the Verifier Pattern?

The verifier pattern uses three roles inside a single skill run to catch output failures before the user sees them: a Planner that structures the task, an Executor that produces the output, and a Verifier that evaluates it against defined criteria and triggers a revision if failures are found. No second model is required.

Planner — Takes the input and produces a structured plan. "For this proposal, the output needs an executive summary, three solution options, and a pricing table. Tone should match the client's formal register."
Executor — Follows the plan to produce the output.
Verifier — Evaluates the executor's output against the plan and the skill's quality criteria. Reports any failures. If failures exist, the executor revises with the failure note before the output is returned to the user.

The verifier does not require a second model. It runs as a final step inside the same context: "Now review your output against the following 4 criteria and report any gaps before proceeding."

In our builds, the verifier pattern reduces first-draft failure rate from 25-30% down to 8-12% on structured document skills. The cost is 20-30% more tokens per run. For high-stakes outputs (client-facing documents, published content, compliance filings), that tradeoff is correct. For simple formatting tasks, it adds cost without proportionate benefit.

"The single biggest predictor of whether an agent works reliably is whether the instructions are written as a closed spec, not an open suggestion." — Boris Cherny, TypeScript compiler team, Anthropic (2024)

The verifier forces exactly that: it converts vague quality expectations into a closed spec that runs on every output before the user sees it. The pattern is not the right tool for skills whose failure modes are hard to specify in criteria. If you cannot write the verification criteria, the pattern cannot catch the failures.

For a focused look at how the verifier works step by step, see What Is the Verifier Pattern.

What Is the Auto-Research Pattern?

Auto-research is a scheduled loop that runs the skill on real examples, evaluates output quality against a rubric, and writes proposed improvements back to the skill file for human review, covering all three levels of the criteria framework from objective hard rules through to pattern-matching behavioral consistency. It extends the standard feedback gate by removing the human from the evaluation step, requiring a rubric precise enough to substitute for human judgment on the quality dimensions being measured.

It differs from standard self-improvement (which is human-in-the-loop for every run) in one key way: the evaluation step runs without a human present. This requires a rubric precise enough to replace human judgment at the evaluation layer.

The three-level criteria framework used in our auto-research builds:

Hard rules: objective requirements that must always be met. "Output must include all four required fields." These are checkable without model judgment.
Pattern matching: behavioral consistency. "Output tone must match approved-examples." This requires LLM-as-judge to evaluate reliably.
Deep creative: quality of reasoning or insight. "Does the output contain at least one non-obvious recommendation?" This is the hardest to automate.

Auto-research works reliably at levels 1 and 2. Level 3 improvements require human review before being applied to the skill file.

Research on automated prompt optimization confirms the pattern: Pryzant et al. ("Automatic Prompt Optimization with 'Gradient Descent' and Beam Search," EMNLP 2023, arXiv 2305.03495) found that automated instruction-editing loops can improve an initial prompt's performance by up to 31% across benchmark NLP tasks, with evaluation quality as the primary constraint on the improvement ceiling.

In commissions where we have run auto-research loops over a four-week cycle, the documented improvement range on the primary quality metric is 9-27%. The ceiling is almost always the quality of the evaluation criteria, not the optimization process itself.

For more on writing rubric criteria with enough precision for automated evaluation, see What Is a Rubric in a Claude Code Skill.

When Should I Use LLM-as-Judge?

LLM-as-judge uses a second model invocation to evaluate skill output against a defined rubric before returning it to the user, producing an independent quality verdict that the generating model cannot produce about its own output because it lacks external context on what the output was supposed to achieve. Use it when:

Output quality depends on judgment, not just structure
You have a rubric with concrete score descriptions (not just "good" vs "bad")
The cost of a wrong output exceeds the cost of an extra model call

Skip it when:

The skill already produces consistent output that passes its evals
You cannot write a rubric specific enough to give the judge something to evaluate against
Speed matters more than quality for this particular use case

The most common mistake is running LLM-as-judge against vague criteria. "Is this a good output?" produces yes 94% of the time regardless of actual quality. Zheng et al. ("Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," NeurIPS 2023, arXiv 2306.05685) identified position bias, verbosity bias, and self-enhancement bias as the three dominant failure modes when judge criteria are underspecified. Ye et al. (JudgeBench, arXiv 2410.12784, ICLR 2025) found that the strongest LLM judges achieve only 64% accuracy on hard discriminative pairs, dropping to near chance on the most challenging benchmark tasks. Hashemi et al. ("LLM-Rubric," ACL 2024, Microsoft Research) demonstrated that a 9-question multidimensional rubric predicts human quality judgments with RMS error below 0.5, a 2x improvement over uncalibrated single-question prompts. A rubric with four discriminating dimensions and anchored score descriptions closes much of that gap; vague prompts exploit the same bias the benchmark documents.

"Developers don't adopt AI tools because they're impressive — they adopt them because they reduce friction on tasks they repeat every day." — Marc Bara, AI product consultant (2024)

The same principle applies to LLM-as-judge: a judge that returns unhelpful evaluations will be bypassed. Rubric precision is what makes it worth running.

How Do Self-Improvement Mechanisms Relate to Evaluation?

Self-improvement and evaluation are complements, not alternatives: evaluation checks whether the skill still meets its original specification, while self-improvement updates the specification and behavior based on gaps that real-world runs reveal, and a skill without both mechanisms will improve without a safety net or pass its tests while failing cases those tests never covered. A skill needs both. Without evals, self-improvement has no regression harness. Without self-improvement, evals pass but the skill fails on cases they never covered.

Evaluation (evals.json, rubrics) checks whether the skill meets its original spec. Self-improvement updates the spec and behavior based on what real-world runs reveal about the original spec's gaps.

A skill with evals but no self-improvement passes its tests and fails on cases the tests did not cover. A skill with self-improvement but no evals accumulates changes without any check on whether they broke something that was already working.

The production workflow:

Write evals before writing instructions (evaluation-first development)
Launch and collect real runs
Route feedback through the feedback gate to the appropriate file
Consolidate learnings.md at 80 lines
Re-run evals to check for regression
Repeat

Evals are a regression harness. Every time learnings are consolidated or SKILL.md is updated, the evals confirm the skill still passes its original criteria. Qu et al. ("Recursive Introspection," NeurIPS 2024, arXiv 2407.18219) demonstrated that iterative self-improvement with structured feedback produces 6-24% accuracy gains across math and reasoning benchmarks, with improvement magnitude directly correlated to feedback precision. Without a regression harness, those gains erode undetected.

For a deeper look at setting up the evaluation foundation, see What Are Evals in Claude Code Skills.

What Does Self-Improvement Look Like After 90 Days?

After 90 days of daily use with a feedback gate and maintained reference files, a production skill is substantially different from one without this infrastructure: the learnings file holds 40-60 behavioral corrections from real runs, the approved-examples folder contains 8-12 output benchmarks, and the edge-cases file captures entity-specific exceptions that would otherwise require re-explanation at the start of every session.

The learnings file has 40-60 specific behavioral corrections from real runs. The approved-examples folder has 8-12 high-quality output benchmarks built from accepted outputs across the first three months. The edge-cases file has the 15-20 entity-specific exceptions that would otherwise require tribal knowledge, notes written somewhere in Notion, or re-explanation at the start of every session.

The skill handles new variations of familiar inputs without instruction rewrites, because the learnings capture the underlying pattern rather than just the surface example.

This pattern has a limit worth naming: self-improvement can refine a well-designed skill significantly. It cannot fix a fundamentally wrong skill design. A skill with a broken output contract, a misleading description, or a process that requires capabilities Claude does not have will not improve itself out of those problems. Self-improvement is refinement, not rescue.

The underlying demand makes this infrastructure worth building: Stack Overflow's 2024 Developer Survey found that 62% of developers are actively using AI tools in their workflow, up from 44% in 2023. Teams that invest in skill quality infrastructure in 2024-2025 build the compounding advantage that teams treating AI as a demo tool will not have three years from now.

FAQ

Adding the self-improvement infrastructure to an existing skill takes under an hour: three files, a gate step, and a few reference lines in SKILL.md. The questions below cover implementation timing, routing decisions, automation thresholds, and how the feedback loop interacts with evals.

How do I start the self-improvement infrastructure on an existing skill?

Add three things to the skill folder: a learnings.md file, an edge-cases.md file, and an approved-examples/ subfolder. Then add a feedback gate as the final step in the SKILL.md process. The skill's instructions should reference all three: "Consult learnings.md and edge-cases.md before producing output. Consult approved-examples/ for the target quality level."

How long before self-improvement shows measurable results?

In our experience, the first meaningful behavioral shift appears after 10-15 runs with consistent feedback routing. At that point, the learnings file has enough signal to change how Claude handles the three or four most common edge cases it was previously getting wrong. Larger quality improvements take 4-6 weeks of daily use.

Can I automate the feedback gate?

The gate questions can be automated. The routing decision should stay human for the first 60 runs. Once you understand which patterns the skill struggles with, you can write explicit routing rules and automate them. Automating routing before you understand the failure patterns produces a learnings file full of noise routed incorrectly.

What happens if I run self-improvement on a skill without evals?

You can. The risk is regression without detection: a learning added in week 6 breaks something that worked in week 1, and without evals, you do not catch it until a user reports the failure. Evals are the difference between improvement with a safety net and improvement without one.

Does the verifier pattern work with every skill type?

No. The verifier pattern requires verification criteria specific enough to evaluate against. Skills that produce highly variable creative output (short social posts, brainstormed ideas, open-ended analysis) often lack criteria precise enough to make verification work. For structured document skills (proposals, reports, briefs, compliance filings), the criteria are usually clear enough.

How does LLM-as-judge differ from the verifier pattern?

The verifier pattern uses the same model in the same context to check its own output against criteria. LLM-as-judge uses a second model call, at a separate context, to evaluate the output from outside the generation context. LLM-as-judge is more expensive and more independent. The verifier pattern is cheaper and catches self-consistency failures. Use LLM-as-judge when you need independent evaluation; use the verifier pattern when you need the model to check its own work against a defined rubric.

My learnings file has 120 lines and the skill quality has dropped. What do I do?

Prune to the 40-50 most specific and reinforced entries. Delete anything that appeared only once, anything that contradicts another entry, and anything that restates what SKILL.md already says explicitly. Re-run your evals after the prune to check for regression. If the skill still underperforms after pruning, the root cause is likely a SKILL.md instruction that was silently overriding the learnings, not the learnings file length itself.

Should edge-cases.md have a line cap too?

Yes. 60 lines is the practical cap. Past 60 lines, the file becomes a lookup table that Claude cannot reliably scan mid-run. Split it into domain-specific sub-files (edge-cases-clients.md, edge-cases-formats.md) and load each only when the relevant context appears in the input.

Last updated: 2026-04-16