What's the Curation Strategy for Maintaining a Public Skill Library at Scale?

TL;DR: The practical curation threshold is 30 active skills. Beyond it, Claude's discovery classifier starts making routing errors as too many descriptions compete for the same queries. Scale-ready curation means quarterly audits, explicit retirement criteria, and a governance model that prevents the library from growing faster than it gets quality-reviewed.


SkillsMP hosts over 700,000 community skills (SkillsMP, 2026). Nobody maintains most of them. That's not a scalability lesson — that's the warning.

If you maintain a public skill library, the moment you stop culling is the moment quality starts compressing toward the mean. Below 30 skills, discovery usually works on its own. Above it, the library starts working against you. This is what active curation looks like in AEM's production skill libraries, and the framework applies to any public Claude Code skill collection.


Why Does a Library Break Down Above 30 Skills?

At around 30 skills, Claude's internal classifier begins producing routing errors. Skill descriptions consume approximately 100 tokens each at startup (Anthropic, Claude Code documentation 2025): at 50 skills, that's 5,000 tokens before a user types a word. The descriptions start bleeding together, and Claude routes ambiguous queries to whichever skill has the most assertive description, not the most relevant one.

Multiply by 30 and you're at 3,000 tokens of description context before any user input. The cumulative skill metadata budget is approximately 16,000 characters total: in a documented case with 63 installed skills, 21 of them (33%) were silently excluded from Claude's system prompt entirely (Anthropic, Claude Code GitHub issue #13099, 2025). In our audits of production skill libraries, the 30-skill boundary is where skill conflicts appear consistently. Below it, routing is clean. Above it, the symptoms are predictable: a skill triggers on inputs it was never designed to handle, or a newer skill silently overrides the one the user expected.

Curation is not aesthetic. It is what keeps the discovery layer functional.


What Does a Quarterly Curation Audit Look Like?

A quarterly audit runs three phases across two weeks. Phase 1 collects usage data and flags skills with fewer than 3 invocations in 90 days. Phase 2 runs those flagged skills against fresh test prompts to catch trigger failures. Phase 3 makes a retire, merge, repair, or promote decision for each shortlisted skill.

  • Phase 1: Usage data collection (week 1): Log which skills were invoked in the last 90 days and how many times. Any skill with fewer than 3 invocations in 90 days goes on the retirement shortlist.

  • Phase 2: Quality review (week 1-2): For shortlisted skills, run them against a fresh test prompt. If they fail the original trigger eval, a prompt that should activate the skill does not, they need repair or retirement.

  • Phase 3: Decision and cleanup (week 2): For each shortlisted skill, make one of four calls:

  1. Retire. Remove from the library entirely.
  2. Merge. Fold its functionality into a related, active skill.
  3. Repair. Fix the description or instructions and reset the clock.
  4. Promote. Escalate to higher visibility in library documentation.

The full process takes one person roughly half a day per quarter for a 30-50 skill library. Skipping it once means two quarters' worth of debt next time. For context on usage scale: 51% of professional developers used AI tools daily as of the 2025 Stack Overflow Developer Survey (Stack Overflow, 2025), which means production skill libraries are under continuous invocation pressure, not occasional use.


How Do You Decide Which Skills to Retire?

Three criteria trigger retirement: low invocation rate (fewer than 3 uses in 90 days), functional overlap (two skills with 60% or more shared trigger signals), and outdated instructions (deprecated Claude Code syntax or old API patterns that produce wrong output). Any skill hitting one criterion goes on the shortlist; hitting two means retire, not repair.

  • Low invocation rate: A skill used fewer than 3 times in 90 days is not earning its slot in the token budget. If it is genuinely needed but not used, it is probably undiscoverable, meaning the description needs fixing, not preserving.

  • Functional overlap: Two skills with overlapping trigger conditions create ambiguity for the classifier. Audit for pairs where descriptions contain 60% or more of the same intent signals. One absorbs the other.

  • Outdated instructions: Skills that reference deprecated Claude Code syntax or old API patterns route correctly but produce wrong output. A skill that worked 12 months ago is not safe to keep if nobody has checked it since.

"The failure mode isn't that the model is bad at the task — it's that the task wasn't specified tightly enough. Almost every production failure traces back to an ambiguous instruction." — Simon Willison, creator of Datasette and llm CLI (2024)

Retirement is not failure. A retired skill replaced with a better one is the system working correctly. Keeping a broken skill in the library because removing it feels like admitting defeat is how library quality decays. Aqua Nautilus research on the npm ecosystem found that 21.2% of the top 50,000 packages were effectively deprecated, including archived or unmaintained repos, with users downloading deprecated packages 2.1 billion times weekly (Aqua Nautilus / Aqua Security, January 2024). Skill libraries face the same accumulation dynamic if retirement criteria are not enforced.


What Governance Model Works for Public Libraries?

For public libraries with multiple contributors, three patterns are in active use: gatekeeper (one reviewer approves all submissions), tiered (community, verified, and production tiers with explicit quality criteria), and fork-and-audit (each team maintains its own quality-filtered fork). Most libraries at scale converge on tiered, because it is the only model that handles submission volume without a single point of failure.

  • Gatekeeper model: One maintainer reviews and approves all submissions. Works cleanly under 50 skills. The bottleneck is the maintainer's available time, not quality.

  • Tiered model: Skills are tagged: community (unreviewed), verified (passed a submission checklist), or production (met the production bar). Contributors submit freely; users know what tier they're installing. The production bar must be defined explicitly: trigger eval must pass, output contract must be present, SKILL.md body under 300 lines.

  • Fork-and-audit model: Each team forks the shared library and maintains their own quality-filtered version. The main library functions as a staging ground; forks are the production libraries. This distributes curation work but fragments the install base.

Most public skill libraries at scale use the tiered model. It is the only one that handles submission volume without sacrificing quality signals. SkillsMP applies a minimum 2-star GitHub star threshold as its entry-level quality filter across 66,500+ listed skills (SkillsMP, January 2026), which illustrates why a tiered model needs explicit bar definitions at each level, not just volume controls.


What Process Prevents a Library From Bloating?

Four gates block every submission before it enters a public library: a trigger eval with 3 activating and 2 non-activating test prompts, an explicit output contract in SKILL.md, a description length check against the 1,024-character limit, and a semantic conflict scan against existing skills for 60% overlap. All four must pass.

Before any skill gets added to a public library, those four gates in detail:

  1. Trigger eval required. Submission must include at least 3 test prompts that activate the skill and 2 that do not.
  2. Output contract required. The SKILL.md must specify what the skill produces and explicitly list what it does not produce.
  3. Description length check. Under 1,024 characters for the full SKILL.md description. Note: Claude Code v2.1.86 introduced a 250-character effective cap in the /skills listing, meaning only the first 250 characters of a description are shown to Claude for routing decisions (Anthropic, Claude Code v2.1.86 changelog, 2025). Front-load trigger keywords.
  4. Conflict scan. Run the new description against existing ones. Any pair with 60% or more semantic overlap returns a warning.

Without these gates, a public library accumulates prompt-in-a-trenchcoat submissions faster than the curation team can review them.

This pattern works for Git-hosted and marketplace-hosted skill libraries. For enterprise-managed libraries with SSO access controls and compliance requirements, you need additional governance layers that a simple trigger eval won't cover.

For what separates a production-grade skill from a community submission, see What Makes a Community Skill 'Production Ready' vs Just a Prompt in a File? and How Do You Design Skills Generic Enough for Public Distribution but Specific Enough to Be Genuinely Useful?.

For the mechanics of packaging skills before adding them to a library, see How Do I Package a Skill for Distribution to Others?.


FAQ

For libraries under 30 skills, curation overhead is low and discovery stays clean without active management. Above that threshold, the questions below cover the specific decisions that require deliberate answers: ownership, backward compatibility, invocation thresholds, and versioning policy for public distribution.

Can a public skill library have too many skills?

Yes. SkillsMP's 700,000+ skills demonstrate the failure mode at scale: volume with no quality floor. A library with 100 unreviewed skills is harder to use than one with 30 quality-reviewed ones. The token budget constraint is real: every skill occupies approximately 100 tokens at startup, and libraries above 50 active skills start producing routing errors in practice. Claude Code's overall skill metadata budget is capped at 2% of the context window (Anthropic, Claude Code documentation 2025), which is approximately 4,000 characters on the standard 200K token plan.

Who should own curation in a community-maintained library?

Curation needs a named owner. A library with five contributors and no designated curator will default to nobody curating. A single rotating quarterly reviewer, even for a small library, prevents accumulation from becoming a structural quality problem.

How do you handle backward compatibility when retiring a skill others depend on?

Use a two-release cycle. Mark the skill deprecated in the current release, adding a deprecation note to the description. Remove it in the next release. Document the replacement. Dependent projects get one release window to migrate, which is the minimum for widely-installed skills.

What's the difference between a low-invocation skill that should be kept vs one that should be removed?

A low-invocation skill with a clean trigger eval and a clear use case that does not overlap with another skill should stay. A skill that fails its trigger eval, has overlapping descriptions, or lacks an output contract should be removed regardless of invocation count.

Should public libraries use semantic versioning the same way as private ones?

Yes, with one difference: the pre-deletion window. Private libraries can delete immediately. Public libraries need a deprecation window before removal, at minimum one release, and two releases for skills with significant install counts.


Last updated: 2026-04-27