benchmark_harness_plus

A statistically rigorous benchmarking harness for Dart. Provides median-based comparisons, coefficient of variation, proper warmup phases, and outlier-resistant measurements for reliable performance analysis.

Why This Package?

The standard benchmark_harness package uses mean (average) for measurements, which is sensitive to outliers from GC pauses, OS scheduling, and CPU throttling. This package uses median as the primary metric, providing stable measurements even with occasional outliers.

Sample data with one GC pause: [5.0, 5.1, 4.9, 5.0, 50.0]

Mean:   14.0 us  (skewed by outlier)
Median:  5.0 us  (accurate representation)

Features

Median-based comparisons: Robust against outliers
Coefficient of variation (CV%): Know how reliable your measurements are
Proper warmup: JIT compilation and cache warming before measurement
Randomized ordering: Reduces systematic bias from CPU throttling
Multiple samples: Statistical confidence, not single-shot measurements
Detailed reporting: Full statistics with reliability assessment

Installation

Add to your pubspec.yaml:

dev_dependencies:
  benchmark_harness_plus: ^1.0.0

Quick Start

import 'package:benchmark_harness_plus/benchmark_harness_plus.dart';

void main() {
  final benchmark = Benchmark(
    title: 'String Operations',
    variants: [
      BenchmarkVariant(
        name: 'concat',
        run: () => 'a' + 'b' + 'c',
      ),
      BenchmarkVariant(
        name: 'interpolation',
        run: () => '${'a'}${'b'}${'c'}',
      ),
    ],
  );

  final results = benchmark.run(log: print);
  printResults(results);
}

Output:

[String Operations] Warming up 2 variant(s)...
[String Operations] Collecting 10 sample(s)...
[String Operations] Done.

  Variant        |     median |       mean |    fastest |   stddev |    cv% |  vs base
  ------------------------------------------------------------------------------------
  concat         |       0.42 |       0.43 |       0.40 |     0.02 |    4.7 |        -
  interpolation  |       0.38 |       0.39 |       0.36 |     0.01 |    3.2 |    1.11x

  (times in microseconds per operation)

Configuration

Use predefined configurations or create custom ones:

// Quick feedback during development (less accurate)
Benchmark(..., config: BenchmarkConfig.quick);

// Standard benchmarking (default)
Benchmark(..., config: BenchmarkConfig.standard);

// Important performance decisions (more accurate)
Benchmark(..., config: BenchmarkConfig.thorough);

// Custom configuration
Benchmark(..., config: BenchmarkConfig(
  iterations: 5000,     // Iterations per sample
  samples: 15,          // Number of samples to collect
  warmupIterations: 1000,
  randomizeOrder: true, // Randomize variant order
));

Understanding CV% (Coefficient of Variation)

CV% normalizes variance across different scales. It tells you how reliable your measurements are:

CV%	Reliability	Interpretation
< 10%	Excellent	Highly reliable, trust exact ratios
10-20%	Good	Rankings are reliable
20-50%	Moderate	Directional only, do not trust exact ratios
> 50%	Poor	Unreliable, measurement is mostly noise

final result = benchmark.run().first;
print('Reliability: ${result.reliability}'); // excellent, good, moderate, or poor

Interpreting Results

Look at CV% first: If > 20%, treat comparisons as directional only
Compare medians: This is your primary metric
Check mean vs median: Large difference indicates outliers
Look at the ratio: 1.42x means 42% faster than baseline

Detailed Analysis

final results = benchmark.run();

// Detailed report for a single result
print(formatDetailedResult(results[0]));

// Compare two variants
final comparison = BenchmarkComparison(
  baseline: results[0],
  test: results[1],
);
print('Speedup: ${comparison.speedup.toStringAsFixed(2)}x');
print('Improvement: ${comparison.improvementPercent.toStringAsFixed(1)}%');
print('Reliable: ${comparison.isReliable}');

// Export as CSV
final csv = formatResultsAsCsv(results);
File('results.csv').writeAsStringSync(csv);

Statistical Functions

The package exports individual statistical functions for custom analysis:

import 'package:benchmark_harness_plus/benchmark_harness_plus.dart';

final samples = [10.0, 11.0, 9.5, 10.2, 10.1];

print('Mean: ${mean(samples)}');
print('Median: ${median(samples)}');
print('Stddev: ${stdDev(samples)}');
print('CV%: ${cv(samples)}');
print('Range: ${min(samples)} - ${max(samples)}');
print('Reliability: ${reliabilityFromCV(cv(samples))}');

Best Practices

Use enough samples: Minimum 10, prefer 20 for important decisions
Use enough iterations: Each sample should take at least 10ms
Warm up properly: JIT needs time to optimize hot paths
Report CV%: Always show measurement stability
Use median for comparisons: More robust than mean
Re-run when in doubt: If results seem surprising, verify with another run

Common Pitfalls

Sub-microsecond measurements: Inherently noisy, expect CV% > 50%
First run bias: Always warm up before measuring
Order effects: Randomize variant order across samples (enabled by default)
Single sample: Never trust a single measurement

Learn More

BENCHMARKING_GUIDE.md - In-depth explanation of:

The statistical foundations behind each metric
Benefits and downsides of mean, median, stddev, and CV%
How to interpret results correctly
What to do when measurements are unreliable
How to choose the right configuration

MIGRATION_GUIDE.md - Migrating from benchmark_harness:

Side-by-side code comparisons
Step-by-step migration instructions
Common migration patterns
What you gain by switching

License

MIT License. See LICENSE file for details.