eval library

Eval subsystem for dart_agent_core.

See Anthropic — Demystifying evals for AI agents for the underlying methodology, and doc/eval-guide.zh-CN.md for the usage guide.

This entry point is intentionally separate from dart_agent_core.dart: applications that don't need eval primitives should not pay the import cost. Backward compatibility is preserved.

Classes

AgentHarnessFactory
Anthropic: an agent harness is the system that lets a model act as an agent. The framework does not assume any specific implementation — applications provide their own factory.
AgentHarnessSession
One trial worth of agent execution. Always created via the factory.
Assertion
One specific check inside a Score. Holds enough detail that a human can understand why it passed or failed without re-running the trial.
BrokenTaskCandidate
跨多次 run 都几乎全失败的任务——通常是任务定义/grader 配置有 bug, 而不是 agent 真的不会做(Anthropic Step 2 显式提醒)。
CalibrationConfig
默认配置:超过 0.15 的差距视为"不同意",并报告 top 20 个最严重的偏差。
CalibrationReport
LLM judge 与人工评分的相关性。
ClassificationMetrics
Precision/Recall/F1 for binary classification tasks (e.g. Memory Agent).
CodeGrader
Convenience base for deterministic code-based graders.
CompositeTraceExporter
Fans out events to multiple exporters. Order is preserved within each exporter; cross-exporter order is not guaranteed.
EvalClock
Time source abstraction. Production agents call clock.now() instead of DateTime.now() so the eval runner can lock time per trial.
EvalContext
Per-trial context provided by an EvalEnvironment.
EvalEnvironment
Anthropic Step 4: each trial must run in an isolated, clean environment.
EvalRunConfig
Parsed L2 (runner-level) configuration. Sources: CLI args, env vars. Applications instantiate one in main() and pass fields to the runner.
EvalRunDiff
跨两个 EvalRunReport 的差分。
EvalRunner
Runs evaluation suites with bounded concurrency and optional rate limiting. See RFC §6.8 and §6.15.
EvalRunReport
Aggregated outcome of one EvalRunner.runSuite invocation.
EvalSuite
Anthropic: a collection of tasks measuring specific capabilities or behaviors. Tasks in a suite typically share a broad goal.
EvalTask
Anthropic: a task is a single test with defined inputs and success criteria. Implementations are pure data — they do not run the agent.
EvalTranscriptRecorder
Records a trial transcript from the shared AgentController.
FileRecordingStore
Filesystem-backed store. One JSON file per (hash) under rootDir.
FileReportStore
文件系统实现。
FixedEvalClock
Fixed clock — always returns reference. Useful for deterministic trials.
Grader
A grader scores some aspect of an agent's performance for one trial.
GraderRegistry
Maps a grader name (as it appears in JSONL/JSON files) to a factory that can construct an actual Grader from a config map.
GraduationCandidate
当一个 task 在最近 N 次 run 都达到成熟通过率,建议毕业到 regression suite。
HumanGrader
Anthropic Step 5 / 8: human graders are gold-standard for subjective dimensions, used both for direct scoring and for calibrating LLM judges.
HumanLabeledTrial
一个被人工标过的 trial。用于校准 LLM judge。
HumanReviewQueue
人工审阅队列。框架不拥有 UI,但提供一个抽象让应用层接到自己的审阅平台 (Langfuse Annotation Queue、自建 Web、Slack 工作流等)。
InMemoryRecordingStore
In-memory store; useful for tests of the eval subsystem itself and for short-lived runs.
JsonEvalTask
Data-driven EvalTask. Constructed from a parsed JSON map plus a list of graders already resolved by the loader.
JsonlTraceExporter
Writes one JSON object per line to a file. Easy to grep, easy to import into other tools.
JudgeCalibrator
度量 LLM judge 与人工评分的一致性。
JudgeScore
LangfuseClient
LangfuseEvent 批量发到 POST /api/public/ingestion
LangfuseConfig
Langfuse 客户端配置。
LangfuseEvent
Langfuse /api/public/ingestion 上的 event 包装格式。
LangfuseTraceExporter
Streams trial events to Langfuse via /api/public/ingestion.
LLMRequestHash
Computes a stable hash of an LLM request, used as the cache key for recording / replay.
ModelGrader
Convenience base for LLM-as-judge graders.
NoopRateLimitGate
No-op gate. Default when callers don't configure rate limiting.
Outcome
Anthropic: the final state of the environment at the end of a trial.
PersistedRunReport
当从持久化 store 加载历史 run 时返回的"快照"。 trials 完整保留,但 suite 字段是 SuiteSnapshot——历史 run 中的真实 EvalSuite 实例(含 grader / referenceSolution 等运行时对象)已经不可 重建。跨 run 分析(saturation / graduation / diff)只读元数据,够用。
RateLimitGate
Throttles outgoing LLM calls. Independent from runner concurrency.
RecordingLLMClient
Wraps an inner LLMClient and records every successful (request, response) pair into a RecordingStore. Reads still go through the inner client; the store is a write target only.
RecordingStore
Append-only key-value store for recorded LLM responses.
ReferenceSolution
Anthropic Step 2: a known working solution that passes all graders. Useful for proving a task is solvable and that graders are configured correctly.
ReplayLLMClient
Replays from a RecordingStore; falls back to an inner client (or throws) on miss.
ReportStore
持久化历史 run report。append-only。
RpmRateLimitGate
Permits up to N requests per minute (token bucket, refill at +N/60s).
RunIndexEntry
索引文件中一行的轻量元数据。
SaturationStatus
在单次 run 视角下评估 suite 的健康度(饱和率 + 候选清单的当下值)。
SaturationThresholds
Saturation 评估的阈值和分桶决议。Anthropic Step 7:当 capability suite 上 task 普遍接近 100% 通过时,应当把它们"毕业"到 regression suite, 并往 capability suite 里补更难的 task。
Score
The output of a Grader for one trial.
Sha256LLMRequestHash
Default SHA-256 based implementation. Hashes the JSON-encoded (messages, tools, modelConfig, jsonOutput, trialSalt) tuple. Tool.executable closures are ignored (they're not in toJson).
SuiteHealthAnalyzer
跨多次 run 分析 suite 健康度。Anthropic Step 7 / 8 的工具支撑。
SuiteHealthReport
跨多次 run 的 suite 健康分析。
SuiteSnapshot
Suite 元数据的不可执行快照。只保留分析需要的字段。
SystemEvalClock
Real clock — delegates to DateTime.now.
TaskFilter
Filter applied to suites and tasks at runtime.
TaskTransition
ToolCallRecord
Record of a single tool invocation during a trial.
TpmRateLimitGate
Permits up to N tokens per minute (token-aware, refills similarly).
TraceExporter
Streams trial events to an external observability backend.
Transcript
Anthropic: a transcript is the complete record of a trial — messages, tool calls, reasoning, intermediate results, etc.
TranscriptEvent
Lightweight log of additional events that don't fit neatly into messages or tool calls (e.g. retries, plan changes, exceptions).
TranscriptMetrics
Quantitative metrics about a trial.
Trial
Metadata about one attempt at a task.
TrialDisagreement
TrialId
Identifies one trial uniquely across runs.
TrialResult
All artifacts produced by one trial.
WorkspaceDiff
Captures filesystem-level changes between fixture setup and trial end.

Enums

EvalRunMode
Run mode controlled by CLI / env.
GraderKind
Anthropic identifies three kinds of graders.
ReferenceSolutionSource
How the reference solution was produced.
SuiteKind
Anthropic: capability evals start at low pass rates and improve over time; regression evals stay near 100% and any decline is a red flag.
TaskTransitionKind
TrialStatus
Final status of a trial.

Extensions

EvalRunnerOps on EvalRunner
EvalRunReportDiff on EvalRunReport
Convenience: current.diffWith(baseline) reads more naturally than diffRunReports(current: current, baseline: baseline).
EvalRunReportReporting on EvalRunReport
Convenience methods on EvalRunReport that render Markdown / diff output. Implemented as extensions so the core EvalRunReport class doesn't have to import the reporting layer.
JudgeCalibratorOps on JudgeCalibrator
SuiteHealthAnalyzerOps on SuiteHealthAnalyzer

Constants

transcriptViewerUsage → const String
命令行帮助文档。

Functions

diffRunReports({required EvalRunReport current, required EvalRunReport baseline, double significanceThreshold = 0.05}) EvalRunDiff
计算两份 run report 的 diff。
generateMarkdownReport(EvalRunReport report, {Map<String, String>? taskBucketMap, SuiteHealthReport? health, List<int> ksToReport = const [1, 3]}) String
生成一次 EvalRunReport 的 Markdown 总结。
loadEvalSuiteFromDir(Directory root, {required GraderRegistry graderRegistry}) EvalSuite
Loads an EvalSuite from a directory laid out as:
parseEvalRunArgs(List<String> args, {Map<String, String>? env, String? defaultRunName}) EvalRunConfig
Tiny CLI parser. Avoids pulling in package:args to keep deps lean.
passAtK(List<bool> trialPasses, int k) double
Anthropic: probability of at least one success in k independent trials.
passCaretK(List<bool> trialPasses, int k) double
Anthropic: probability that all k trials pass.
runTranscriptViewer(List<String> args) Future<int>
命令行入口。bin/transcripts.dart 直接 await runTranscriptViewer(args)

Typedefs

GraderFactoryFunction = Grader Function(Map<String, dynamic> config)
Builds a Grader instance from a JSON config blob. Each demo / application registers one factory per grader-name; the loader uses the registry to materialize graders referenced from data files.
JudgeScorer = Future<JudgeScore?> Function(HumanLabeledTrial labeled)
业务方提供的 judge 评分回调。

Exceptions / Errors

RecordingNotFoundException
Thrown when a ReplayLLMClient cannot find a recorded response and has no fallback configured (or strictReplay is true).