skillscore — lint and score AI agent skills (SKILL.md)

skillscore — score your AI agent's SKILL.md 0 to 100 against the Claude, Codex, and Antigravity authoring guides

skillscore statically analyzes any AI agent skill — a SKILL.md manifest and its folder — and produces a 0–100 quality score, a letter grade, and a list of actionable findings, scored against the official skill authoring guides from Anthropic (Claude), Google (Antigravity), and OpenAI (Codex). Offline, deterministic, CI-friendly.

What is skillscore?

skillscore is a skill linter / SKILL.md validator / agent-skill quality checker / AI skill scorer. Agent skills are an open standard — a folder with a SKILL.md (YAML frontmatter + Markdown body) plus optional references/, examples/, scripts/, and assets/ — used by Claude Code, Codex, Antigravity, Gemini CLI, and Cursor. Because an agent keeps every skill's name and description in its context budget permanently, a vague or malformed skill is worse than no skill. skillscore catches exactly those problems before a skill ships.

See it in action

Here's skillscore grading the Flutter team's own flutter-add-widget-test skill — a 90/A — then explaining one of its findings, with the source guide the rule comes from:

Terminal recording: skillscore scores the Flutter team's flutter-add-widget-test skill 90 out of 100 grade A with per-category bars and two findings, then skillscore explain shows the rule rationale and its Flutter authoring-guide source

Quickstart

# Install
dart pub global activate skillscore

# Score a single skill (any name, any location)
skillscore path/to/SKILL.md

# Score every skill in a folder or monorepo
skillscore path/to/skills/

# Pick a target ruleset
skillscore my-skill/ --target claude

# Machine-readable output for CI / dashboards
skillscore my-skill/ --format json

# Gate CI: fail the build if any skill scores below 80
skillscore skills/ --min-score 80

Sample output (trimmed):

csv-to-xlsx  (skills/spreadsheet-skill/SKILL.md)
  Score: 72/100  Grade: C

  A  Frontmatter validity                     15/15  ██████████
  B  Description quality                      12/25  █████░░░░░
  C  Conciseness & token economy            10.5/15  ███████░░░
  D  Structure & progressive disclosure       15/15  ██████████
  E  Instruction quality                       9/20  █████░░░░░
  F  Content hygiene                          10/10  ██████████
  G  Safety & scripts                    no penalty

  WARNING B2_description_when  line 3
          Description has no trigger clause saying when to use the skill.
          fix: Add a trigger clause such as "Use when the user asks to ..."

Commands and flags

skillscore <path>                Score a manifest, a skill folder, or a tree of skills
skillscore rules                 List every rule: id, title, weight, targets, source guide
skillscore explain <rule-id>     Print a rule's rationale, the fix, and its source guide
skillscore --version
skillscore --help

Flag	Values	Default	Purpose
`--target`	`claude` \| `antigravity` \| `codex` \| `universal`	`universal`	Which guide's ruleset to apply
`--format`	`pretty` \| `json` \| `sarif`	`pretty`	Output format (SARIF 2.1.0 renders in code-review tools)
`--min-score <n>`	0–100	—	Exit non-zero if any skill scores below `n`
`--strict`	—	off	Treat warning-level findings as errors
`--quiet`	—	off	Print only the final score line per skill
`--no-color`	—	off	Disable ANSI colors

Exit codes: 0 all skills meet the threshold · 1 a skill is below --min-score, or --strict and any error/warning exists · 2 usage error (bad path, unreadable file, invalid flag).

How is the score calculated?

100 points are distributed across categories A–F. Each rule awards full, partial, or zero points; partial-credit formulas are documented in each rule's doc comment and shown by skillscore explain <id>. Category G (safety) is a penalty of up to −15 that applies only when the skill ships scripts or terminal commands. Profiles that exclude a rule (e.g. --target claude excludes the Codex-specific B4) are normalized back to a 0–100 scale, so scores are comparable across targets.

Grades: A 90–100 · B 80–89 · C 70–79 · D 60–69 · F below 60.

The full rubric

Rule	Title	Pts	Severity	Targets	Source
`A1_frontmatter_present`	YAML frontmatter delimited by `---`	4	error	all	Anthropic
`A2_name_format`	`name` ≤64 chars, lowercase/digits/hyphens	4	error	all	Anthropic
`A3_name_reserved_words`	`name` avoids "anthropic"/"claude"	3	error (claude) / info	all	Anthropic
`A4_description_present`	`description` present, ≤1024 chars	4	error	all	Anthropic
`B1_description_what`	States WHAT (opens with action verb)	6	warning	all	Anthropic
`B2_description_when`	States WHEN ("use when ...")	6	warning	all	Anthropic
`B3_third_person`	Written in third person	5	warning	all	Anthropic
`B4_frontloaded_triggers`	Concrete keywords in first ~60 chars	4	warning	codex, universal	Codex
`B5_boundary_clause`	Has a "do not use" boundary	4	warning (antigravity) / info	antigravity, universal	Antigravity
`C1_body_length`	Body ≤500 lines (linear to 0 at 1000)	6	warning	all	Anthropic
`C2_explainer_bloat`	No definitions of common knowledge	5	warning	all	Anthropic
`C3_excessive_optionality`	No long "or" chains	4	info	all	Anthropic
`D1_progressive_disclosure`	Depth split into references/examples	5	info	all	Anthropic
`D2_one_level_links`	Reference links one level deep	5	warning	all	Anthropic
`D3_reference_toc`	Long reference files have a TOC	5	info	all	Anthropic
`E1_anti_patterns`	States anti-patterns explicitly	6	warning	all	Flutter
`E2_workflow_checklist`	Checklist or numbered workflow	5	warning	all	Anthropic
`E3_feedback_loop`	Validate → fix → repeat loop	5	warning	all	Anthropic
`E4_code_example`	At least one fenced code example	4	warning	all	Anthropic
`F1_time_sensitive`	No date-anchored statements that rot	4	warning	all	Anthropic
`F2_forward_slashes`	Paths use forward slashes only	3	error	all	Anthropic
`F3_consistent_terminology`	No synonym mixing (conservative)	3	info	all	Anthropic
`G1_safety_section`	Scripts/commands need a Safety section	−8	error	antigravity, universal	Antigravity
`G2_script_docs`	Bundled scripts are documented	−7	warning	all	Anthropic

Run skillscore rules for the live table and skillscore explain <rule-id> for any rule's rationale and fix.

How do I gate CI on skill quality?

# .github/workflows/skills.yml
- name: Lint agent skills
  run: |
    dart pub global activate skillscore
    skillscore skills/ --min-score 80 --no-color

--format json feeds dashboards; --format sarif uploads to GitHub code scanning so findings annotate pull requests.

FAQ

What is an agent skill? A folder with a SKILL.md manifest (YAML frontmatter + Markdown instructions) that teaches an AI agent a repeatable task. Optional subfolders hold references, examples, scripts, and assets.

Does skillscore work with Claude Code / Codex / Antigravity / Gemini CLI / Cursor? Yes. The SKILL.md format is shared across all of them. Score against one vendor's rules with --target, or use the default universal profile, which a portable skill should pass everywhere.

Is it offline? Completely. skillscore makes no network calls at runtime, analyzes local files only, and is fully deterministic — the same input always produces the same score and finding order.

How do I score every skill in a monorepo? skillscore path/to/repo/ — it walks the tree, finds every folder with a SKILL.md (case-insensitive), and scores each one, deterministically ordered by path.

Does my skill have to be named a certain way? No. skillscore is name-agnostic: the frontmatter name, the folder name, and the file name are all independent, and unusual names (including non-ASCII folder names) are handled — though rule A2 will tell you if the name field itself violates the official format.

What happens with malformed frontmatter? No crash: the relevant A-category errors are reported and every other rule that can still run does, so you always get a score.

How does skillscore compare to alternatives?

Vendor skill validators (e.g. quick checks built into agent CLIs) verify only schema validity — name format, description present. skillscore additionally scores quality: discoverability, conciseness, structure, instruction design, hygiene, and safety, with cited sources per rule.
Generic Markdown linters (markdownlint, Vale) check prose style, not skill semantics; they don't know what a frontmatter description must contain for an agent to find the skill.
Asking an LLM to review your skill is non-deterministic and unsuitable for CI gates. skillscore is static, reproducible, and exits with codes designed for pipelines. The two combine well.

Library use

skillscore is also a Dart library:

import 'package:skillscore/skillscore.dart';

void main() {
  final doc = SkillParser().parseFile('my-skill/SKILL.md');
  final result = Scorer(RuleRegistry()).score(doc, Target.universal);
  print('${result.score}/100 ${result.grade}');
}

Contributing

New rules are one class + one registration — see CONTRIBUTING.md for the walkthrough and the project's design principles (every rule cites its source guide, deterministic output, offline only, name-agnostic). Use the "Propose a new rule" issue template to suggest one.

License

Apache-2.0. See CHANGELOG.md for release history.