eval library
Eval subsystem for dart_agent_core.
See Anthropic — Demystifying evals for AI agents
for the underlying methodology, and doc/eval-guide.zh-CN.md for the
usage guide.
This entry point is intentionally separate from dart_agent_core.dart:
applications that don't need eval primitives should not pay the
import cost. Backward compatibility is preserved.
Classes
- AgentHarnessFactory
- Anthropic: an agent harness is the system that lets a model act as an agent. The framework does not assume any specific implementation — applications provide their own factory.
- AgentHarnessSession
- One trial worth of agent execution. Always created via the factory.
- Assertion
- One specific check inside a Score. Holds enough detail that a human can understand why it passed or failed without re-running the trial.
- BrokenTaskCandidate
- 跨多次 run 都几乎全失败的任务——通常是任务定义/grader 配置有 bug, 而不是 agent 真的不会做(Anthropic Step 2 显式提醒)。
- CalibrationConfig
- 默认配置:超过 0.15 的差距视为"不同意",并报告 top 20 个最严重的偏差。
- CalibrationReport
- LLM judge 与人工评分的相关性。
- ClassificationMetrics
- Precision/Recall/F1 for binary classification tasks (e.g. Memory Agent).
- CodeGrader
- Convenience base for deterministic code-based graders.
- CompositeTraceExporter
- Fans out events to multiple exporters. Order is preserved within each exporter; cross-exporter order is not guaranteed.
- EvalClock
-
Time source abstraction. Production agents call
clock.now()instead ofDateTime.now()so the eval runner can lock time per trial. - EvalContext
- Per-trial context provided by an EvalEnvironment.
- EvalEnvironment
- Anthropic Step 4: each trial must run in an isolated, clean environment.
- EvalRunConfig
-
Parsed L2 (runner-level) configuration. Sources: CLI args, env vars.
Applications instantiate one in
main()and pass fields to the runner. - EvalRunDiff
- 跨两个 EvalRunReport 的差分。
- EvalRunner
- Runs evaluation suites with bounded concurrency and optional rate limiting. See RFC §6.8 and §6.15.
- EvalRunReport
-
Aggregated outcome of one
EvalRunner.runSuiteinvocation. - EvalSuite
- Anthropic: a collection of tasks measuring specific capabilities or behaviors. Tasks in a suite typically share a broad goal.
- EvalTask
- Anthropic: a task is a single test with defined inputs and success criteria. Implementations are pure data — they do not run the agent.
- EvalTranscriptRecorder
- Records a trial transcript from the shared AgentController.
- FileRecordingStore
- Filesystem-backed store. One JSON file per (hash) under rootDir.
- FileReportStore
- 文件系统实现。
- FixedEvalClock
- Fixed clock — always returns reference. Useful for deterministic trials.
- Grader
- A grader scores some aspect of an agent's performance for one trial.
- GraderRegistry
- Maps a grader name (as it appears in JSONL/JSON files) to a factory that can construct an actual Grader from a config map.
- GraduationCandidate
- 当一个 task 在最近 N 次 run 都达到成熟通过率,建议毕业到 regression suite。
- HumanGrader
- Anthropic Step 5 / 8: human graders are gold-standard for subjective dimensions, used both for direct scoring and for calibrating LLM judges.
- HumanLabeledTrial
- 一个被人工标过的 trial。用于校准 LLM judge。
- HumanReviewQueue
- 人工审阅队列。框架不拥有 UI,但提供一个抽象让应用层接到自己的审阅平台 (Langfuse Annotation Queue、自建 Web、Slack 工作流等)。
- InMemoryRecordingStore
- In-memory store; useful for tests of the eval subsystem itself and for short-lived runs.
- JsonEvalTask
- Data-driven EvalTask. Constructed from a parsed JSON map plus a list of graders already resolved by the loader.
- JsonlTraceExporter
- Writes one JSON object per line to a file. Easy to grep, easy to import into other tools.
- JudgeCalibrator
- 度量 LLM judge 与人工评分的一致性。
- JudgeScore
- LangfuseClient
-
把 LangfuseEvent 批量发到
POST /api/public/ingestion。 - LangfuseConfig
- Langfuse 客户端配置。
- LangfuseEvent
-
Langfuse
/api/public/ingestion上的 event 包装格式。 - LangfuseTraceExporter
-
Streams trial events to Langfuse via
/api/public/ingestion. - LLMRequestHash
- Computes a stable hash of an LLM request, used as the cache key for recording / replay.
- ModelGrader
- Convenience base for LLM-as-judge graders.
- NoopRateLimitGate
- No-op gate. Default when callers don't configure rate limiting.
- Outcome
- Anthropic: the final state of the environment at the end of a trial.
- PersistedRunReport
- 当从持久化 store 加载历史 run 时返回的"快照"。 trials 完整保留,但 suite 字段是 SuiteSnapshot——历史 run 中的真实 EvalSuite 实例(含 grader / referenceSolution 等运行时对象)已经不可 重建。跨 run 分析(saturation / graduation / diff)只读元数据,够用。
- RateLimitGate
- Throttles outgoing LLM calls. Independent from runner concurrency.
- RecordingLLMClient
- Wraps an inner LLMClient and records every successful (request, response) pair into a RecordingStore. Reads still go through the inner client; the store is a write target only.
- RecordingStore
- Append-only key-value store for recorded LLM responses.
- ReferenceSolution
- Anthropic Step 2: a known working solution that passes all graders. Useful for proving a task is solvable and that graders are configured correctly.
- ReplayLLMClient
- Replays from a RecordingStore; falls back to an inner client (or throws) on miss.
- ReportStore
- 持久化历史 run report。append-only。
- RpmRateLimitGate
- Permits up to N requests per minute (token bucket, refill at +N/60s).
- RunIndexEntry
- 索引文件中一行的轻量元数据。
- SaturationStatus
- 在单次 run 视角下评估 suite 的健康度(饱和率 + 候选清单的当下值)。
- SaturationThresholds
- Saturation 评估的阈值和分桶决议。Anthropic Step 7:当 capability suite 上 task 普遍接近 100% 通过时,应当把它们"毕业"到 regression suite, 并往 capability suite 里补更难的 task。
- Score
- The output of a Grader for one trial.
- Sha256LLMRequestHash
-
Default SHA-256 based implementation. Hashes the JSON-encoded
(messages, tools, modelConfig, jsonOutput, trialSalt)tuple.Tool.executableclosures are ignored (they're not in toJson). - SuiteHealthAnalyzer
- 跨多次 run 分析 suite 健康度。Anthropic Step 7 / 8 的工具支撑。
- SuiteHealthReport
- 跨多次 run 的 suite 健康分析。
- SuiteSnapshot
- Suite 元数据的不可执行快照。只保留分析需要的字段。
- SystemEvalClock
- Real clock — delegates to DateTime.now.
- TaskFilter
- Filter applied to suites and tasks at runtime.
- TaskTransition
- ToolCallRecord
- Record of a single tool invocation during a trial.
- TpmRateLimitGate
- Permits up to N tokens per minute (token-aware, refills similarly).
- TraceExporter
- Streams trial events to an external observability backend.
- Transcript
- Anthropic: a transcript is the complete record of a trial — messages, tool calls, reasoning, intermediate results, etc.
- TranscriptEvent
- Lightweight log of additional events that don't fit neatly into messages or tool calls (e.g. retries, plan changes, exceptions).
- TranscriptMetrics
- Quantitative metrics about a trial.
- Trial
- Metadata about one attempt at a task.
- TrialDisagreement
- TrialId
- Identifies one trial uniquely across runs.
- TrialResult
- All artifacts produced by one trial.
- WorkspaceDiff
- Captures filesystem-level changes between fixture setup and trial end.
Enums
- EvalRunMode
- Run mode controlled by CLI / env.
- GraderKind
- Anthropic identifies three kinds of graders.
- ReferenceSolutionSource
- How the reference solution was produced.
- SuiteKind
- Anthropic: capability evals start at low pass rates and improve over time; regression evals stay near 100% and any decline is a red flag.
- TaskTransitionKind
- TrialStatus
- Final status of a trial.
Extensions
- EvalRunnerOps on EvalRunner
- EvalRunReportDiff on EvalRunReport
-
Convenience:
current.diffWith(baseline)reads more naturally thandiffRunReports(current: current, baseline: baseline). - EvalRunReportReporting on EvalRunReport
-
Convenience methods on EvalRunReport that render Markdown / diff output.
Implemented as extensions so the core
EvalRunReportclass doesn't have to import the reporting layer. - JudgeCalibratorOps on JudgeCalibrator
- SuiteHealthAnalyzerOps on SuiteHealthAnalyzer
Constants
- transcriptViewerUsage → const String
- 命令行帮助文档。
Functions
-
diffRunReports(
{required EvalRunReport current, required EvalRunReport baseline, double significanceThreshold = 0.05}) → EvalRunDiff - 计算两份 run report 的 diff。
-
generateMarkdownReport(
EvalRunReport report, {Map< String, String> ? taskBucketMap, SuiteHealthReport? health, List<int> ksToReport = const [1, 3]}) → String - 生成一次 EvalRunReport 的 Markdown 总结。
-
loadEvalSuiteFromDir(
Directory root, {required GraderRegistry graderRegistry}) → EvalSuite - Loads an EvalSuite from a directory laid out as:
-
parseEvalRunArgs(
List< String> args, {Map<String, String> ? env, String? defaultRunName}) → EvalRunConfig -
Tiny CLI parser. Avoids pulling in
package:argsto keep deps lean. -
passAtK(
List< bool> trialPasses, int k) → double -
Anthropic: probability of at least one success in
kindependent trials. -
passCaretK(
List< bool> trialPasses, int k) → double - Anthropic: probability that all k trials pass.
-
runTranscriptViewer(
List< String> args) → Future<int> -
命令行入口。
bin/transcripts.dart直接await runTranscriptViewer(args)。
Typedefs
-
GraderFactoryFunction
= Grader Function(Map<
String, dynamic> config) - Builds a Grader instance from a JSON config blob. Each demo / application registers one factory per grader-name; the loader uses the registry to materialize graders referenced from data files.
-
JudgeScorer
= Future<
JudgeScore?> Function(HumanLabeledTrial labeled) - 业务方提供的 judge 评分回调。
Exceptions / Errors
- RecordingNotFoundException
-
Thrown when a ReplayLLMClient cannot find a recorded response and
has no fallback configured (or
strictReplayis true).