Can Claude Code Skills Get Better Over Time?

Yes. Claude Code skills built on the AEM skill engineering framework get better over time, but only when you build the improvement infrastructure from the start. Left to default, a skill performs identically on day 1 and day 90. With three files and one extra step in the skill process, it compounds in quality across months of real use.

TL;DR: A Claude Code skill improves through a learnings file (behavioral corrections), an edge-cases file (factual exceptions), and an approved-examples folder (quality benchmarks). A closing feedback gate routes new observations to the right destination after every run. This infrastructure adds less than one hour of setup time. Without it, every failure mode demands a manual fix cycle.

Most skills in the wild are static. They handle the cases they were designed for and silently fail on everything the designer did not anticipate. You notice at the third client complaint, or the fifth time you manually correct the same output format before sending. Those repeated corrections are data. The self-improvement system turns them into lasting changes. Research from the 2024 AI Index published by Stanford HAI found that 78% of organizations reported using AI in at least one business function in 2024, up from 55% in 2023, meaning the gap between teams with structured skill infrastructure and those without is widening fast (Stanford HAI, 2024 AI Index Report).


What Three Things Make a Claude Code Skill Get Better Over Time?

Three files form the self-improvement infrastructure: a learnings file that captures behavioral corrections, an edge-cases file that holds factual exceptions for specific entities, and an approved-examples folder that anchors output quality to real, verified outputs from production runs, with all three read by Claude at the start of every skill run. Each file serves a distinct function and updates at a different frequency.

  1. Learnings file (learnings.md): behavioral corrections from real runs. Each entry records a pattern the skill got wrong and what the correct behavior should be. "When the input contains a numbered list, Claude collapses it to prose. Preserve the list format." This file lives inside the skill folder and gets read every time the skill runs.
  2. Edge-cases file (edge-cases.md): factual exceptions for specific entities, clients, or formats. "Client Halverson invoices in GBP. Never convert to USD." These are not behavioral patterns, they are facts about specific entities the skill needs to handle correctly. They live separately from learnings because they have a different update frequency and a different pruning policy.
  3. Approved-examples folder (approved-examples/): real outputs from real runs that passed the full quality bar check. Claude reads these examples before producing output and anchors to their format, length, and tone. One strong example anchors quality better than 500 words of formatting instructions. Addy Osmani, Engineering Director at Google Chrome, measured this directly: giving a model an explicit output format with examples moves consistency from roughly 60% to over 95% (Addy Osmani, Google Chrome, 2024).

"Developers don't adopt AI tools because they're impressive — they adopt them because they reduce friction on tasks they repeat every day." — Marc Bara, AI product consultant (2024)

A skill that reduces friction the first time is useful. A skill that reduces friction on the first run and compounds that reduction over 90 days is a production asset.


What Is the Feedback Gate That Powers Self-Improvement?

The feedback gate is the last step of every skill run: a structured three-question checkpoint that routes observations from the current session into the right persistent file before the session closes and that context evaporates, converting each imperfect run into a durable correction rather than a forgotten note. Without it, every correction stays in memory and disappears. It asks three questions:

  1. Was the output correct?
  2. Was anything missing?
  3. Was anything present that should not have been?

The answers get routed to the right file. Behavioral corrections go to learnings.md. Factual exceptions go to edge-cases.md. Structural rules that should always apply get written directly into SKILL.md. High-quality outputs get saved to approved-examples/.

Without the gate, observations stay in the session and evaporate. With the gate, every run that produces an imperfect output also produces an improvement to the skill that prevents the same failure next time. McKinsey research from 2023 found that developers waste over 30% of their working time on repetitive tasks that could be systematized (McKinsey, "Yes, You Can Measure Software Developer Productivity," 2023). The feedback gate is how a skill stops generating that waste.

For a detailed look at setting up the feedback gate, see How Do I Collect Feedback on My Skill's Performance.


How Much Better Does a Skill Actually Get?

For skills running daily with a maintained feedback gate, the quality shift is measurable within two weeks: the three or four most common failure modes are corrected by week four, and by month three the skill handles input variations the original design never anticipated. Frequency and consistency determine the pace; the direction is always forward.

In our builds, skills with a feedback gate and a maintained learnings file show a measurable quality shift within the first two weeks of real use. By week four, the three or four most common failure modes have been corrected. By month three, the skill handles input variations that the original design never anticipated, because the learnings file has captured the underlying patterns.

The compounding is not linear. The first 10 learnings fix the obvious failures. The next 20 fix the edge cases. The next 20 refine the output quality on already-correct cases. Each layer builds on the previous one. The 2024 State of Developer Productivity report by Cortex found that 58% of respondents lose more than 5 hours per developer per week to unproductive work, with maintenance and bug fixes cited as a top drain by 26% of teams (Cortex, 2024).

The hard ceiling is a well-managed learnings file at 80 lines. Past 80 lines, the file needs consolidation: grouping related entries, writing one consolidated pattern per theme, and deleting the individual entries. Past 100 lines without consolidation, the file starts to work against the skill rather than for it. Instructions in the middle of a 120-line file receive less attention from Claude than instructions in the first 30 lines (Nelson Liu et al., Stanford NLP Group, ArXiv 2307.03172, 2023).

For a complete look at learnings file mechanics, see What Is a Learnings File in a Skill.


What Kinds of Improvements Does This System Produce?

Three types of improvement accumulate through the self-improvement infrastructure: format corrections that align output structure to what users actually want, content corrections that fill gaps the original design never anticipated, and quality improvements driven by an ever-rising approved-examples benchmark that shifts upward each time a better output gets added. Each type compounds on the previous layer rather than replacing it.

  • Format improvements: Claude learns to preserve list structures, use the correct heading hierarchy, match the expected output length, and follow the document structure the user actually wants rather than the structure the original instructions implied.
  • Content improvements: Claude learns which information to include by default, which to ask for when missing, and which to exclude even when the input suggests it. These corrections come from real failure modes, not anticipated ones.
  • Quality improvements: The approved-examples folder raises the bar over time. As better outputs get added to the folder, the benchmark shifts upward. Claude's output distribution follows the examples.

This system works for skills that run repeatedly on similar inputs: content creation, proposal writing, code review, client communication, documentation. It works less well for skills that run once on completely unique inputs each time, where there is no pattern to learn from. A controlled study on AI-assisted coding found that developers completed tasks 55.8% faster with AI assistance once the tool had learned their workflow context (Ziegler et al., arXiv 2302.06590, 2023). Repeating-task skills are where that learning compounds.


What Does a Skill Without Self-Improvement Cost You?

A static skill costs you the repeated correction cycle: every failure mode requires a human to open SKILL.md, diagnose the problem, write a fix, re-test, and deploy the updated skill, with no persistent record kept and no protection for the next user who hits the same failure. That manual process runs 20-45 minutes per failure mode and compounds across every user who hits the same unrecorded edge case after you.

In our experience, a skill used daily in real work surfaces one new failure mode per week during the first month. Four failure modes, four manual fix cycles: two hours of debugging that should have been 10 minutes of routed feedback.

The larger cost is tribal knowledge loss. Every correction you do not record in the learnings file lives only in the session. The next person to use the skill starts from the same failure modes. A skill that has been running for six months with a maintained learnings file gives every user the benefit of six months of corrections from the first run. A March 2023 survey found that 43% of enterprise developers spend 10-25% of their working time debugging bugs in production applications (DevOps.com, 2023). A maintained learnings file routes that time into improvement instead.


FAQ

Do I need to set up self-improvement before I launch a skill?

Set up the three files before the first real use, not before testing. During development and testing, you are making changes to SKILL.md directly. The self-improvement infrastructure handles corrections that appear during real production use, not design-phase changes. Add it before the first user runs the skill for real work.

Can I add self-improvement to a skill I already built?

Yes. Add learnings.md, edge-cases.md, and the approved-examples/ folder to the existing skill folder. Add a feedback gate as the last step in the SKILL.md process. Update the SKILL.md instructions to reference the three files: "Consult learnings.md and edge-cases.md before producing output. See approved-examples/ for the target quality level." The skill starts improving from the first run after the setup.

What if the skill only runs once per month?

Self-improvement still works, but the improvement cycle is slower. With one run per month and one feedback gate per run, the learnings file accumulates 12 entries per year. That is enough to fix 4-6 persistent failure modes. The system is worth running even at low frequency.

Does self-improvement work for skills that produce code?

Partially. Behavioral corrections (learnings.md) work well: "When the input is a TypeScript file, use type annotations in all function signatures." Edge cases work well: "The Payments module uses a non-standard error format. See edge-cases.md." Approved examples work less well, because code correctness is easier to specify in tests than in examples. For code-producing skills, evals.json is a stronger quality mechanism than the approved-examples folder.

How do I know if the self-improvement system is actually working?

Two signals. First, the learnings file grows: more entries means more observations are being captured. Second, the skill handles cases you did not explicitly design for: it correctly formats an input type you never tested because a learnings entry captured the underlying pattern. If the skill is not handling new cases better after 30 days of use, check whether the feedback gate is actually running at the end of each session, or whether the learnings file references are in the SKILL.md instructions.

Can I run self-improvement on multiple skills at the same time?

Yes. Each skill has its own learnings file, edge-cases file, and approved-examples folder. The infrastructure is per-skill, not shared. Running it across multiple skills simultaneously adds no complexity: each skill improves independently from its own feedback data.

What is the biggest mistake people make with learnings files?

Not consolidating at 80 lines. The file grows past 100 lines without maintenance, contradictions accumulate, and the file starts to confuse Claude rather than guide it. Set a calendar reminder to check the line count every four weeks if the skill runs daily. Consolidation takes 20 minutes and keeps the file effective for another two months.

Last updated: 2026-04-16