skillscore 0.6.0 copy "skillscore: ^0.6.0" to clipboard
skillscore: ^0.6.0 copied to clipboard

Lint and score AI agent skills (SKILL.md) against the official Claude, Codex, and Antigravity authoring guides. Offline, deterministic CLI.

skillscore — lint and score AI agent skills (SKILL.md) #

skillscore — score your AI agent's SKILL.md 0 to 100 against the Claude, Codex, and Antigravity authoring guides

CI pub package VS Marketplace Open VSX license: Apache-2.0

skillscore statically analyzes any AI agent skill — a SKILL.md manifest and its folder — and produces a 0–100 quality score, a letter grade, and a list of actionable findings, scored against the official skill authoring guides from Anthropic (Claude), Google (Antigravity), and OpenAI (Codex). Offline, deterministic, CI-friendly.

What is skillscore? #

skillscore is a skill linter / SKILL.md validator / agent-skill quality checker / AI skill scorer. Agent skills are an open standard — a folder with a SKILL.md (YAML frontmatter + Markdown body) plus optional references/, examples/, scripts/, and assets/ — used by Claude Code, Codex, Antigravity, Gemini CLI, and Cursor. Because an agent keeps every skill's name and description in its context budget permanently, a vague or malformed skill is worse than no skill. skillscore catches exactly those problems before a skill ships.

See it in action #

Score a single skill — here, the Flutter team's own flutter-add-widget-test (90/A) — with a full per-category breakdown and cited findings:

Terminal recording: skillscore scores the Flutter team's flutter-add-widget-test skill 90 out of 100 grade A with per-category bars and two findings, then skillscore explain shows the rule rationale and its Flutter authoring-guide source

Or scan several skills at once and drill into the lowest scorer — here, three skills from addyosmani/agent-skills:

Terminal recording: skillscore scores three agent-skills in one command (spec-driven-development 91/A, test-driven-development 88/B, performance-optimization 77/C), then a second command drills into performance-optimization showing the missing Safety section error and vague description warning

Token budget #

Every scorecard now includes the BPE token cost of each skill, split by the two scopes in which agent runtimes load SKILL.md content:

  Tokens  description (permanent)    67 gpt-4   ~74 claude
          full manifest (active)   1474 gpt-4  ~1622 claude

Permanent is the per-prompt cost — the agent loads the description field on every call so it knows which skills exist. Active is the per-invocation cost, paid only when the agent decides to use the skill.

The counts use the cl100k_base BPE vocabulary (exact for GPT-4 and Codex; Claude estimates add 10% overhead).

Tested on all 31 skills from google/skills:

Terminal recording: skillscore scans all 31 Google skills, then drills into the top scorer (95/A, 67-token description) and the lowest scorer (56/F, 142-token description), showing token counts alongside each scorecard

Description token counts across the Google skills repo ranged from 24 to 142 — a 6x spread. The 56/F gke-basics skill pays 142 tokens on every prompt just for discovery, while the 95/A agent-platform-tuning-management skill pays 67. Less tokens, better score, better skill.

Token counts also appear in --format json under a tokens key, making them available to dashboards and CI pipelines.

API-validated accuracy #

The +10% Claude estimate was validated against the official Anthropic count_tokens API across all 31 Google skills:

Terminal recording: skillscore estimate for gke-basics followed by the Anthropic count_tokens API validation across all 31 Google skills, showing mean overhead of +10.2% and median of +10.0%

Metric Value
Skills validated 31 (all of google/skills)
Mean actual Claude overhead vs cl100k +10.2%
Median +10.0%
Range +0% to +20% (varies with keyword density)

The heuristic is accurate on average. Individual skills with dense trigger-keyword lists in their descriptions (like gke-basics) run toward +18-20%; clean prose descriptions run toward 0-6%.

Editor integration #

Prefer to score inside your IDE? The Skillscore VS Code extension wraps this CLI and adds inline diagnostics, hover tooltips, a sidebar score panel, and a live status-bar indicator — available for VS Code, Antigravity IDE, VSCodium, and Cursor.

Install from the VS Marketplace or Open VSX.


Quickstart #

# Install
dart pub global activate skillscore

# Score a single skill (any name, any location)
skillscore path/to/SKILL.md

# Score every skill in a folder or monorepo
skillscore path/to/skills/

# Score multiple specific skills in one command
skillscore skill-a/ skill-b/ skill-c/

# Pick a target ruleset
skillscore my-skill/ --target claude

# Machine-readable output for CI / dashboards
skillscore my-skill/ --format json

# Gate CI: fail the build if any skill scores below 80
skillscore skills/ --min-score 80

Sample output (trimmed):

csv-to-xlsx  (skills/spreadsheet-skill/SKILL.md)
  Score: 72/100  Grade: C

  Tokens  description (permanent)    22 gpt-4   ~25 claude
          full manifest (active)    185 gpt-4  ~204 claude

  A  Frontmatter validity                     15/15  ██████████
  B  Description quality                      12/25  █████░░░░░
  C  Conciseness & token economy            10.5/15  ███████░░░
  D  Structure & progressive disclosure       15/15  ██████████
  E  Instruction quality                       9/20  █████░░░░░
  F  Content hygiene                          10/10  ██████████
  G  Safety & scripts                    no penalty

  WARNING B2_description_when  line 3
          Description has no trigger clause saying when to use the skill.
          fix: Add a trigger clause such as "Use when the user asks to ..."

Commands and flags #

skillscore <path> [<path> ...]        Score one or more manifests, folders, or trees
skillscore rules                      List every rule: id, title, weight, targets, source guide
skillscore explain <rule-id>          Print a rule's rationale, the fix, and its source guide
skillscore eval init <path>           Scaffold evals.json from the skill's description
skillscore eval validate <path>       Validate and summarise evals.json
skillscore eval run <path>            Run trigger-rate evals offline (no API key required)
skillscore --version
skillscore --help
Flag Values Default Purpose
--target claude | antigravity | codex | universal universal Which guide's ruleset to apply
--format pretty | json | sarif pretty Output format (SARIF 2.1.0 renders in code-review tools)
--min-score <n> 0–100 Exit non-zero if any skill scores below n
--strict off Treat warning-level findings as errors
--quiet off Print only the final score line per skill
--no-color off Disable ANSI colors

Exit codes: 0 all skills meet the threshold · 1 a skill is below --min-score, or --strict and any error/warning exists · 2 usage error (bad path, unreadable file, invalid flag).

How is the score calculated? #

100 points are distributed across categories A–F. Each rule awards full, partial, or zero points; partial-credit formulas are documented in each rule's doc comment and shown by skillscore explain <id>. Category G (safety) is a penalty of up to −15 that applies only when the skill ships scripts or terminal commands. Profiles that exclude a rule (e.g. --target claude excludes the Codex-specific B4) are normalized back to a 0–100 scale, so scores are comparable across targets.

Grades: A 90–100 · B 80–89 · C 70–79 · D 60–69 · F below 60.

The full rubric #

Rule Title Pts Severity Targets Source
A1_frontmatter_present YAML frontmatter delimited by --- 4 error all Anthropic
A2_name_format name ≤64 chars, lowercase/digits/hyphens 4 error all Anthropic
A3_name_reserved_words name avoids "anthropic"/"claude" 3 error (claude) / info all Anthropic
A4_description_present description present, ≤1024 chars 4 error all Anthropic
B1_description_what States WHAT (opens with action verb) 6 warning all Anthropic
B2_description_when States WHEN ("use when ...") 6 warning all Anthropic
B3_third_person Written in third person 5 warning all Anthropic
B4_frontloaded_triggers Concrete keywords in first ~60 chars 4 warning codex, universal Codex
B5_boundary_clause Has a "do not use" boundary 4 warning (antigravity) / info antigravity, universal Antigravity
C1_body_length Body ≤500 lines (linear to 0 at 1000) 6 warning all Anthropic
C2_explainer_bloat No definitions of common knowledge 5 warning all Anthropic
C3_excessive_optionality No long "or" chains 4 info all Anthropic
D1_progressive_disclosure Depth split into references/examples 5 info all Anthropic
D2_one_level_links Reference links one level deep 5 warning all Anthropic
D3_reference_toc Long reference files have a TOC 5 info all Anthropic
E1_anti_patterns States anti-patterns explicitly 6 warning all Flutter
E2_workflow_checklist Checklist or numbered workflow 5 warning all Anthropic
E3_feedback_loop Validate → fix → repeat loop 5 warning all Anthropic
E4_code_example At least one fenced code example 4 warning all Anthropic
F1_time_sensitive No date-anchored statements that rot 4 warning all Anthropic
F2_forward_slashes Paths use forward slashes only 3 error all Anthropic
F3_consistent_terminology No synonym mixing (conservative) 3 info all Anthropic
G1_safety_section Scripts/commands need a Safety section −8 error antigravity, universal Antigravity
G2_script_docs Bundled scripts are documented −7 warning all Anthropic

Run skillscore rules for the live table and skillscore explain <rule-id> for any rule's rationale and fix.

Eval harness #

Static linting tells you a skill is well-formed. The eval harness tells you whether queries actually route to it correctly — the thing that matters once a skill is deployed. Three subcommands, one workflow, zero API keys:

# 1. Scaffold 20 queries from the skill's description
skillscore eval init my-skill/

# 2. Review and extend the generated queries
cat my-skill/evals.json

# 3. Run the eval — fully offline, no API key, no cost
skillscore eval run my-skill/

eval init reads the skill's description frontmatter and derives 20 queries — 10 trigger (a request the skill should handle) and 10 non-trigger (a request it should not). Every query is a real English sentence grounded in the skill's trigger and boundary clauses; the file is runnable immediately and easy to extend with project-specific queries.

Terminal: skillscore eval init pdf-form-filler/ — Created pdf-form-filler/evals.json, 20 queries scaffolded

eval validate parses evals.json, verifies it contains both trigger and non-trigger queries, and prints a structured summary of the test suite.

Terminal: skillscore eval validate pdf-form-filler/ — shows skill name, 10 trigger + 10 non-trigger, runs/query, threshold, 60 total checks

eval run executes 20 queries × 3 runs = 60 checks locally, streams live progress, then prints a per-query PASS/FAIL table. A trigger query passes when the heuristic scores it as triggered in at least 50% of runs; a non-trigger query passes when it stays below that threshold. FAILs on non-trigger queries show exactly which phrasings over-reach the skill's intended scope — and which boundary clause to tighten.

Terminal: skillscore eval run pdf-form-filler/ — live progress, then per-query PASS/FAIL table, 20 passed 0 failed

--format json on eval run emits a machine-readable result for dashboards and CI pipelines. The runs_per_query and trigger_threshold fields in evals.json can be tuned per-project.

How the scoring algorithm works #

eval run uses a local heuristic — no model call, no network, no API key. It scores each query by matching content words against three semantic regions extracted from the skill's description field:

Region Source Role
Trigger terms "Use when …" clause, scaffold words stripped What the skill is activated by
Boundary terms "Do not use …" clause What the skill explicitly excludes
What terms First sentence of the description The skill's primary capability

All text is lowercased, split on non-alphanumeric characters, stop-word filtered (the, a, use, when, user, asks, …), and suffix-stemmed (-ing, -tion, -ed, -es, -s) before any comparison.

The decision path for each query:

Flowchart: the HeuristicEvalClient scoring algorithm — meta-query check, clause term extraction, boundary exclusivity filter, content-word match count, wave noise, and final threshold comparison

Boundary exclusivity. A boundary term only penalises a query when it does not also appear in the trigger or what context. This prevents shared nouns (e.g. "pdf" in "Do not use for scanned PDFs") from falsely blocking trigger queries that legitimately mention the same noun.

Wave noise. A small deterministic offset — ((i % 7 − 3) / 3.5) × 0.07 — cycles through roughly ±7% across successive calls. With three runs per query, a borderline query may trigger on two runs and not on one, modelling the natural stochasticity of a real model across repeated calls.

What PASS/FAIL means. A trigger query passes when triggered count ≥ trigger_threshold × runs_per_query (default: 2 of 3). A non-trigger query passes when it stays below that fraction. The heuristic measures textual alignment with the skill's declared intent — not actual model routing. Use it to catch obvious description problems early in the authoring loop.

How do I gate CI on skill quality? #

# .github/workflows/skills.yml
- name: Lint agent skills
  run: |
    dart pub global activate skillscore
    skillscore skills/ --min-score 80 --no-color

--format json feeds dashboards; --format sarif uploads to GitHub code scanning so findings annotate pull requests.

FAQ #

What is an agent skill? A folder with a SKILL.md manifest (YAML frontmatter + Markdown instructions) that teaches an AI agent a repeatable task. Optional subfolders hold references, examples, scripts, and assets.

Does skillscore work with Claude Code / Codex / Antigravity / Gemini CLI / Cursor? Yes. The SKILL.md format is shared across all of them. Score against one vendor's rules with --target, or use the default universal profile, which a portable skill should pass everywhere.

Is it offline? Completely. skillscore makes no network calls — both the linter and the eval harness run on local files only. The eval heuristic is deterministic given the same description and query set.

How do I score every skill in a monorepo? skillscore path/to/repo/ — it walks the tree, finds every folder with a SKILL.md (case-insensitive), and scores each one, deterministically ordered by path.

How do I score a specific set of skills in one command? Pass each path as a separate argument: skillscore skill-a/ skill-b/ skill-c/. You get a combined report with a summary line showing the count, average, and lowest score. Duplicate paths are silently deduplicated, so overlapping arguments (e.g. a tree root and one of its children) each score once. If one path is invalid, the rest still score and the bad path is reported as a warning.

Does my skill have to be named a certain way? No. skillscore is name-agnostic: the frontmatter name, the folder name, and the file name are all independent, and unusual names (including non-ASCII folder names) are handled — though rule A2 will tell you if the name field itself violates the official format.

What happens with malformed frontmatter? No crash: the relevant A-category errors are reported and every other rule that can still run does, so you always get a score.

How does skillscore compare to alternatives? #

  • Vendor skill validators (e.g. quick checks built into agent CLIs) verify only schema validity — name format, description present. skillscore additionally scores quality: discoverability, conciseness, structure, instruction design, hygiene, and safety, with cited sources per rule.
  • Generic Markdown linters (markdownlint, Vale) check prose style, not skill semantics; they don't know what a frontmatter description must contain for an agent to find the skill.
  • Asking an LLM to review your skill is non-deterministic and unsuitable for CI gates. skillscore is static, reproducible, and exits with codes designed for pipelines. The two combine well.

Library use #

skillscore is also a Dart library:

import 'package:skillscore/skillscore.dart';

void main() {
  final doc = SkillParser().parseFile('my-skill/SKILL.md');
  final result = Scorer(RuleRegistry()).score(doc, Target.universal);
  print('${result.score}/100 ${result.grade}');
}

Contributing #

New rules are one class + one registration — see CONTRIBUTING.md for the walkthrough and the project's design principles (every rule cites its source guide, deterministic output, offline only, name-agnostic). Use the "Propose a new rule" issue template to suggest one.

License #

Apache-2.0. See CHANGELOG.md for release history.

1
likes
160
points
290
downloads

Documentation

Documentation
API reference

Publisher

verified publishersayedalialkamel.com

Weekly Downloads

Lint and score AI agent skills (SKILL.md) against the official Claude, Codex, and Antigravity authoring guides. Offline, deterministic CLI.

Repository (GitHub)
View/report issues
Contributing

Topics

#agent-skills #skill-md #linter #claude #ai-agents

License

Apache-2.0 (license)

Dependencies

args, glob, path, tiktoken_tokenizer_gpt4o_o1, yaml

More

Packages that depend on skillscore