skillscore — lint and score AI agent skills (SKILL.md)

skillscore — score your AI agent's SKILL.md 0 to 100 against the Claude, Codex, and Antigravity authoring guides

CI pub package license: Apache-2.0

skillscore statically analyzes any AI agent skill — a SKILL.md manifest and its folder — and produces a 0–100 quality score, a letter grade, and a list of actionable findings, scored against the official skill authoring guides from Anthropic (Claude), Google (Antigravity), and OpenAI (Codex). Offline, deterministic, CI-friendly.

What is skillscore?

skillscore is a skill linter / SKILL.md validator / agent-skill quality checker / AI skill scorer. Agent skills are an open standard — a folder with a SKILL.md (YAML frontmatter + Markdown body) plus optional references/, examples/, scripts/, and assets/ — used by Claude Code, Codex, Antigravity, Gemini CLI, and Cursor. Because an agent keeps every skill's name and description in its context budget permanently, a vague or malformed skill is worse than no skill. skillscore catches exactly those problems before a skill ships.

See it in action

Here's skillscore grading the Flutter team's own flutter-add-widget-test skill — a 90/A — then explaining one of its findings, with the source guide the rule comes from:

Terminal recording: skillscore scores the Flutter team's flutter-add-widget-test skill 90 out of 100 grade A with per-category bars and two findings, then skillscore explain shows the rule rationale and its Flutter authoring-guide source

Quickstart

# Install
dart pub global activate skillscore

# Score a single skill (any name, any location)
skillscore path/to/SKILL.md

# Score every skill in a folder or monorepo
skillscore path/to/skills/

# Pick a target ruleset
skillscore my-skill/ --target claude

# Machine-readable output for CI / dashboards
skillscore my-skill/ --format json

# Gate CI: fail the build if any skill scores below 80
skillscore skills/ --min-score 80

Sample output (trimmed):

csv-to-xlsx  (skills/spreadsheet-skill/SKILL.md)
  Score: 72/100  Grade: C

  A  Frontmatter validity                     15/15  ██████████
  B  Description quality                      12/25  █████░░░░░
  C  Conciseness & token economy            10.5/15  ███████░░░
  D  Structure & progressive disclosure       15/15  ██████████
  E  Instruction quality                       9/20  █████░░░░░
  F  Content hygiene                          10/10  ██████████
  G  Safety & scripts                    no penalty

  WARNING B2_description_when  line 3
          Description has no trigger clause saying when to use the skill.
          fix: Add a trigger clause such as "Use when the user asks to ..."

Commands and flags

skillscore <path>                Score a manifest, a skill folder, or a tree of skills
skillscore rules                 List every rule: id, title, weight, targets, source guide
skillscore explain <rule-id>     Print a rule's rationale, the fix, and its source guide
skillscore --version
skillscore --help
Flag Values Default Purpose
--target claude | antigravity | codex | universal universal Which guide's ruleset to apply
--format pretty | json | sarif pretty Output format (SARIF 2.1.0 renders in code-review tools)
--min-score <n> 0–100 Exit non-zero if any skill scores below n
--strict off Treat warning-level findings as errors
--quiet off Print only the final score line per skill
--no-color off Disable ANSI colors

Exit codes: 0 all skills meet the threshold · 1 a skill is below --min-score, or --strict and any error/warning exists · 2 usage error (bad path, unreadable file, invalid flag).

How is the score calculated?

100 points are distributed across categories A–F. Each rule awards full, partial, or zero points; partial-credit formulas are documented in each rule's doc comment and shown by skillscore explain <id>. Category G (safety) is a penalty of up to −15 that applies only when the skill ships scripts or terminal commands. Profiles that exclude a rule (e.g. --target claude excludes the Codex-specific B4) are normalized back to a 0–100 scale, so scores are comparable across targets.

Grades: A 90–100 · B 80–89 · C 70–79 · D 60–69 · F below 60.

The full rubric

Rule Title Pts Severity Targets Source
A1_frontmatter_present YAML frontmatter delimited by --- 4 error all Anthropic
A2_name_format name ≤64 chars, lowercase/digits/hyphens 4 error all Anthropic
A3_name_reserved_words name avoids "anthropic"/"claude" 3 error (claude) / info all Anthropic
A4_description_present description present, ≤1024 chars 4 error all Anthropic
B1_description_what States WHAT (opens with action verb) 6 warning all Anthropic
B2_description_when States WHEN ("use when ...") 6 warning all Anthropic
B3_third_person Written in third person 5 warning all Anthropic
B4_frontloaded_triggers Concrete keywords in first ~60 chars 4 warning codex, universal Codex
B5_boundary_clause Has a "do not use" boundary 4 warning (antigravity) / info antigravity, universal Antigravity
C1_body_length Body ≤500 lines (linear to 0 at 1000) 6 warning all Anthropic
C2_explainer_bloat No definitions of common knowledge 5 warning all Anthropic
C3_excessive_optionality No long "or" chains 4 info all Anthropic
D1_progressive_disclosure Depth split into references/examples 5 info all Anthropic
D2_one_level_links Reference links one level deep 5 warning all Anthropic
D3_reference_toc Long reference files have a TOC 5 info all Anthropic
E1_anti_patterns States anti-patterns explicitly 6 warning all Flutter
E2_workflow_checklist Checklist or numbered workflow 5 warning all Anthropic
E3_feedback_loop Validate → fix → repeat loop 5 warning all Anthropic
E4_code_example At least one fenced code example 4 warning all Anthropic
F1_time_sensitive No date-anchored statements that rot 4 warning all Anthropic
F2_forward_slashes Paths use forward slashes only 3 error all Anthropic
F3_consistent_terminology No synonym mixing (conservative) 3 info all Anthropic
G1_safety_section Scripts/commands need a Safety section −8 error antigravity, universal Antigravity
G2_script_docs Bundled scripts are documented −7 warning all Anthropic

Run skillscore rules for the live table and skillscore explain <rule-id> for any rule's rationale and fix.

How do I gate CI on skill quality?

# .github/workflows/skills.yml
- name: Lint agent skills
  run: |
    dart pub global activate skillscore
    skillscore skills/ --min-score 80 --no-color

--format json feeds dashboards; --format sarif uploads to GitHub code scanning so findings annotate pull requests.

FAQ

What is an agent skill? A folder with a SKILL.md manifest (YAML frontmatter + Markdown instructions) that teaches an AI agent a repeatable task. Optional subfolders hold references, examples, scripts, and assets.

Does skillscore work with Claude Code / Codex / Antigravity / Gemini CLI / Cursor? Yes. The SKILL.md format is shared across all of them. Score against one vendor's rules with --target, or use the default universal profile, which a portable skill should pass everywhere.

Is it offline? Completely. skillscore makes no network calls at runtime, analyzes local files only, and is fully deterministic — the same input always produces the same score and finding order.

How do I score every skill in a monorepo? skillscore path/to/repo/ — it walks the tree, finds every folder with a SKILL.md (case-insensitive), and scores each one, deterministically ordered by path.

Does my skill have to be named a certain way? No. skillscore is name-agnostic: the frontmatter name, the folder name, and the file name are all independent, and unusual names (including non-ASCII folder names) are handled — though rule A2 will tell you if the name field itself violates the official format.

What happens with malformed frontmatter? No crash: the relevant A-category errors are reported and every other rule that can still run does, so you always get a score.

How does skillscore compare to alternatives?

  • Vendor skill validators (e.g. quick checks built into agent CLIs) verify only schema validity — name format, description present. skillscore additionally scores quality: discoverability, conciseness, structure, instruction design, hygiene, and safety, with cited sources per rule.
  • Generic Markdown linters (markdownlint, Vale) check prose style, not skill semantics; they don't know what a frontmatter description must contain for an agent to find the skill.
  • Asking an LLM to review your skill is non-deterministic and unsuitable for CI gates. skillscore is static, reproducible, and exits with codes designed for pipelines. The two combine well.

Library use

skillscore is also a Dart library:

import 'package:skillscore/skillscore.dart';

void main() {
  final doc = SkillParser().parseFile('my-skill/SKILL.md');
  final result = Scorer(RuleRegistry()).score(doc, Target.universal);
  print('${result.score}/100 ${result.grade}');
}

Contributing

New rules are one class + one registration — see CONTRIBUTING.md for the walkthrough and the project's design principles (every rule cites its source guide, deterministic output, offline only, name-agnostic). Use the "Propose a new rule" issue template to suggest one.

License

Apache-2.0. See CHANGELOG.md for release history.

Libraries

skillscore
Lint and score AI agent skills (SKILL.md) against the official Claude, Codex, and Antigravity authoring guides.