eval library

Classes

AgentHarnessFactory: Anthropic: an agent harness is the system that lets a model act as an agent. The framework does not assume any specific implementation — applications provide their own factory.
AgentHarnessSession: One trial worth of agent execution. Always created via the factory.
Assertion: One specific check inside a Score. Holds enough detail that a human can understand why it passed or failed without re-running the trial.
BrokenTaskCandidate: 跨多次 run 都几乎全失败的任务——通常是任务定义/grader 配置有 bug，而不是 agent 真的不会做（Anthropic Step 2 显式提醒）。
CalibrationConfig: 默认配置：超过 0.15 的差距视为"不同意"，并报告 top 20 个最严重的偏差。
CalibrationReport: LLM judge 与人工评分的相关性。
ClassificationMetrics: Precision/Recall/F1 for binary classification tasks (e.g. Memory Agent).
CodeGrader: Convenience base for deterministic code-based graders.
CompositeTraceExporter: Fans out events to multiple exporters. Order is preserved within each exporter; cross-exporter order is not guaranteed.
EvalClock: Time source abstraction. Production agents call clock.now() instead of DateTime.now() so the eval runner can lock time per trial.
EvalContext: Per-trial context provided by an EvalEnvironment.
EvalEnvironment: Anthropic Step 4: each trial must run in an isolated, clean environment.
EvalRunConfig: Parsed L2 (runner-level) configuration. Sources: CLI args, env vars. Applications instantiate one in main() and pass fields to the runner.
EvalRunDiff: 跨两个 EvalRunReport 的差分。
EvalRunner: Runs evaluation suites with bounded concurrency and optional rate limiting. See RFC §6.8 and §6.15.
EvalRunReport: Aggregated outcome of one EvalRunner.runSuite invocation.
EvalSuite: Anthropic: a collection of tasks measuring specific capabilities or behaviors. Tasks in a suite typically share a broad goal.
EvalTask: Anthropic: a task is a single test with defined inputs and success criteria. Implementations are pure data — they do not run the agent.
EvalTranscriptRecorder: Records a trial transcript from the shared AgentController.
FileRecordingStore: Filesystem-backed store. One JSON file per (hash) under rootDir.
FileReportStore: 文件系统实现。
FixedEvalClock: Fixed clock — always returns reference. Useful for deterministic trials.
Grader: A grader scores some aspect of an agent's performance for one trial.
GraderRegistry: Maps a grader name (as it appears in JSONL/JSON files) to a factory that can construct an actual Grader from a config map.
GraduationCandidate: 当一个 task 在最近 N 次 run 都达到成熟通过率，建议毕业到 regression suite。
HumanGrader: Anthropic Step 5 / 8: human graders are gold-standard for subjective dimensions, used both for direct scoring and for calibrating LLM judges.
HumanLabeledTrial: 一个被人工标过的 trial。用于校准 LLM judge。
HumanReviewQueue: 人工审阅队列。框架不拥有 UI，但提供一个抽象让应用层接到自己的审阅平台（Langfuse Annotation Queue、自建 Web、Slack 工作流等）。
InMemoryRecordingStore: In-memory store; useful for tests of the eval subsystem itself and for short-lived runs.
JsonEvalTask: Data-driven EvalTask. Constructed from a parsed JSON map plus a list of graders already resolved by the loader.
JsonlTraceExporter: Writes one JSON object per line to a file. Easy to grep, easy to import into other tools.
JudgeCalibrator: 度量 LLM judge 与人工评分的一致性。
JudgeScore
LangfuseClient: 把 LangfuseEvent 批量发到 POST /api/public/ingestion。
LangfuseConfig: Langfuse 客户端配置。
LangfuseEvent: Langfuse /api/public/ingestion 上的 event 包装格式。
LangfuseTraceExporter: Streams trial events to Langfuse via /api/public/ingestion.
LLMRequestHash: Computes a stable hash of an LLM request, used as the cache key for recording / replay.
ModelGrader: Convenience base for LLM-as-judge graders.
NoopRateLimitGate: No-op gate. Default when callers don't configure rate limiting.
Outcome: Anthropic: the final state of the environment at the end of a trial.
PersistedRunReport: 当从持久化 store 加载历史 run 时返回的"快照"。 trials 完整保留，但 suite 字段是 SuiteSnapshot——历史 run 中的真实 EvalSuite 实例（含 grader / referenceSolution 等运行时对象）已经不可重建。跨 run 分析（saturation / graduation / diff）只读元数据，够用。
RateLimitGate: Throttles outgoing LLM calls. Independent from runner concurrency.
RecordingLLMClient: Wraps an inner LLMClient and records every successful (request, response) pair into a RecordingStore. Reads still go through the inner client; the store is a write target only.
RecordingStore: Append-only key-value store for recorded LLM responses.
ReferenceSolution: Anthropic Step 2: a known working solution that passes all graders. Useful for proving a task is solvable and that graders are configured correctly.
ReplayLLMClient: Replays from a RecordingStore; falls back to an inner client (or throws) on miss.
ReportStore: 持久化历史 run report。append-only。
RpmRateLimitGate: Permits up to N requests per minute (token bucket, refill at +N/60s).
RunIndexEntry: 索引文件中一行的轻量元数据。
SaturationStatus: 在单次 run 视角下评估 suite 的健康度（饱和率 + 候选清单的当下值）。
SaturationThresholds: Saturation 评估的阈值和分桶决议。Anthropic Step 7：当 capability suite 上 task 普遍接近 100% 通过时，应当把它们"毕业"到 regression suite，并往 capability suite 里补更难的 task。
Score: The output of a Grader for one trial.
Sha256LLMRequestHash: Default SHA-256 based implementation. Hashes the JSON-encoded (messages, tools, modelConfig, jsonOutput, trialSalt) tuple. Tool.executable closures are ignored (they're not in toJson).
SuiteHealthAnalyzer: 跨多次 run 分析 suite 健康度。Anthropic Step 7 / 8 的工具支撑。
SuiteHealthReport: 跨多次 run 的 suite 健康分析。
SuiteSnapshot: Suite 元数据的不可执行快照。只保留分析需要的字段。
SystemEvalClock: Real clock — delegates to DateTime.now.
TaskFilter: Filter applied to suites and tasks at runtime.
TaskTransition
ToolCallRecord: Record of a single tool invocation during a trial.
TpmRateLimitGate: Permits up to N tokens per minute (token-aware, refills similarly).
TraceExporter: Streams trial events to an external observability backend.
Transcript: Anthropic: a transcript is the complete record of a trial — messages, tool calls, reasoning, intermediate results, etc.
TranscriptEvent: Lightweight log of additional events that don't fit neatly into messages or tool calls (e.g. retries, plan changes, exceptions).
TranscriptMetrics: Quantitative metrics about a trial.
Trial: Metadata about one attempt at a task.
TrialDisagreement
TrialId: Identifies one trial uniquely across runs.
TrialResult: All artifacts produced by one trial.
WorkspaceDiff: Captures filesystem-level changes between fixture setup and trial end.

Enums

EvalRunMode: Run mode controlled by CLI / env.
GraderKind: Anthropic identifies three kinds of graders.
ReferenceSolutionSource: How the reference solution was produced.
SuiteKind: Anthropic: capability evals start at low pass rates and improve over time; regression evals stay near 100% and any decline is a red flag.
TaskTransitionKind
TrialStatus: Final status of a trial.

Extensions

EvalRunnerOps on EvalRunner
EvalRunReportDiff on EvalRunReport: Convenience: current.diffWith(baseline) reads more naturally than diffRunReports(current: current, baseline: baseline).
EvalRunReportReporting on EvalRunReport: Convenience methods on EvalRunReport that render Markdown / diff output. Implemented as extensions so the core EvalRunReport class doesn't have to import the reporting layer.
JudgeCalibratorOps on JudgeCalibrator
SuiteHealthAnalyzerOps on SuiteHealthAnalyzer

Constants

transcriptViewerUsage → const String: 命令行帮助文档。

Functions

diffRunReports({required EvalRunReport current, required EvalRunReport baseline, double significanceThreshold = 0.05}) → EvalRunDiff: 计算两份 run report 的 diff。
generateMarkdownReport(EvalRunReport report, {Map<String, String>? taskBucketMap, SuiteHealthReport? health, List<int> ksToReport = const [1, 3]}) → String: 生成一次 EvalRunReport 的 Markdown 总结。
loadEvalSuiteFromDir(Directory root, {required GraderRegistry graderRegistry}) → EvalSuite: Loads an EvalSuite from a directory laid out as:
parseEvalRunArgs(List<String> args, {Map<String, String>? env, String? defaultRunName}) → EvalRunConfig: Tiny CLI parser. Avoids pulling in package:args to keep deps lean.
passAtK(List<bool> trialPasses, int k) → double: Anthropic: probability of at least one success in k independent trials.
passCaretK(List<bool> trialPasses, int k) → double: Anthropic: probability that all k trials pass.
runTranscriptViewer(List<String> args) → Future<int>: 命令行入口。bin/transcripts.dart 直接 await runTranscriptViewer(args)。

Typedefs

GraderFactoryFunction = Grader Function(Map<String, dynamic> config): Builds a Grader instance from a JSON config blob. Each demo / application registers one factory per grader-name; the loader uses the registry to materialize graders referenced from data files.
JudgeScorer = Future<JudgeScore?> Function(HumanLabeledTrial labeled): 业务方提供的 judge 评分回调。

Exceptions / Errors

RecordingNotFoundException: Thrown when a ReplayLLMClient cannot find a recorded response and has no fallback configured (or strictReplay is true).

Classes

Enums

Extensions

Constants

Functions

Typedefs

Exceptions / Errors

dart_agent_core package

eval library