benchmark_harness_plus library

A statistically rigorous benchmarking harness for Dart.

This package provides reliable performance measurements using statistical best practices: median-based comparisons, coefficient of variation for reliability assessment, proper warmup phases, and outlier-resistant analysis.

Quick Start

import 'package:benchmark_harness_plus/benchmark_harness_plus.dart';

void main() {
  final benchmark = Benchmark(
    title: 'String Operations',
    variants: [
      BenchmarkVariant(name: 'concat', run: () => 'a' + 'b' + 'c'),
      BenchmarkVariant(name: 'interpolation', run: () => '${'a'}${'b'}${'c'}'),
    ],
  );

  final results = benchmark.run(log: print);
  printResults(results);
}

Why Use This Package?

Traditional benchmarking often uses mean (average) for comparisons, which is sensitive to outliers from GC pauses, OS scheduling, and CPU throttling. This package uses median as the primary metric, providing stable measurements even with occasional outliers.

The coefficient of variation (CV%) tells you how reliable your measurements are:

  • CV < 10%: Highly reliable
  • CV 10-20%: Acceptable
  • CV 20-50%: Directional only
  • CV > 50%: Unreliable (measurement is noise)

Configuration

Use predefined configurations or create custom ones:

// Quick feedback during development
Benchmark(..., config: BenchmarkConfig.quick);

// Standard benchmarking (default)
Benchmark(..., config: BenchmarkConfig.standard);

// Important performance decisions
Benchmark(..., config: BenchmarkConfig.thorough);

// Custom configuration
Benchmark(..., config: BenchmarkConfig(
  iterations: 5000,
  samples: 15,
  warmupIterations: 1000,
));

Interpreting Results

  1. Look at CV% first - if > 20%, treat comparisons as directional only
  2. Compare medians - this is your primary metric
  3. Check mean vs median - large difference indicates outliers
  4. Look at the ratio - 1.42x means 42% faster than baseline

Best Practices

  • Use at least 10 samples (20 for important decisions)
  • Each sample should take at least 10ms (adjust iterations accordingly)
  • Always warm up before measuring
  • Report CV% alongside results
  • Re-run when results seem surprising

Classes

Benchmark
Runs benchmarks and collects statistically rigorous results.
BenchmarkComparison
Comparison between two benchmark results.
BenchmarkConfig
Configuration for benchmark runs.
BenchmarkResult
Results from benchmarking a single variant.
BenchmarkVariant
A benchmark variant to measure.

Enums

ReliabilityLevel
Describes the reliability level of a measurement based on its CV%.

Functions

cv(List<double> samples) double
Calculates the coefficient of variation (CV) as a percentage.
formatComparison(BenchmarkComparison comparison) String
Formats a comparison between two results.
formatDetailedResult(BenchmarkResult result) String
Formats a detailed report for a single benchmark result.
formatResults(List<BenchmarkResult> results, {String? baselineName}) String
Formats benchmark results as a table string.
formatResultsAsCsv(List<BenchmarkResult> results) String
Formats results as CSV for export or further analysis.
max(List<double> samples) double
Returns the maximum value in samples.
mean(List<double> samples) double
Calculates the arithmetic mean (average) of a list of samples.
median(List<double> samples) double
Calculates the median (middle value) of a list of samples.
min(List<double> samples) double
Returns the minimum value in samples.
printReliabilityWarning(List<BenchmarkResult> results) bool
Prints a reliability warning if any result has poor reliability.
printResults(List<BenchmarkResult> results, {String? baselineName}) → void
Prints benchmark results to the console.
reliabilityFromCV(double cvPercent) ReliabilityLevel
Determines the reliability level based on coefficient of variation.
stdDev(List<double> samples) double
Calculates the sample standard deviation of a list of samples.

Typedefs

BenchmarkLogger = void Function(String message)
Optional callback for benchmark progress reporting.