benchmark_harness_plus library
A statistically rigorous benchmarking harness for Dart.
This package provides reliable performance measurements using statistical best practices: median-based comparisons, coefficient of variation for reliability assessment, proper warmup phases, and outlier-resistant analysis.
Quick Start
import 'package:benchmark_harness_plus/benchmark_harness_plus.dart';
void main() {
final benchmark = Benchmark(
title: 'String Operations',
variants: [
BenchmarkVariant(name: 'concat', run: () => 'a' + 'b' + 'c'),
BenchmarkVariant(name: 'interpolation', run: () => '${'a'}${'b'}${'c'}'),
],
);
final results = benchmark.run(log: print);
printResults(results);
}
Why Use This Package?
Traditional benchmarking often uses mean (average) for comparisons, which is sensitive to outliers from GC pauses, OS scheduling, and CPU throttling. This package uses median as the primary metric, providing stable measurements even with occasional outliers.
The coefficient of variation (CV%) tells you how reliable your measurements are:
- CV < 10%: Highly reliable
- CV 10-20%: Acceptable
- CV 20-50%: Directional only
- CV > 50%: Unreliable (measurement is noise)
Configuration
Use predefined configurations or create custom ones:
// Quick feedback during development
Benchmark(..., config: BenchmarkConfig.quick);
// Standard benchmarking (default)
Benchmark(..., config: BenchmarkConfig.standard);
// Important performance decisions
Benchmark(..., config: BenchmarkConfig.thorough);
// Custom configuration
Benchmark(..., config: BenchmarkConfig(
iterations: 5000,
samples: 15,
warmupIterations: 1000,
));
Interpreting Results
- Look at CV% first - if > 20%, treat comparisons as directional only
- Compare medians - this is your primary metric
- Check mean vs median - large difference indicates outliers
- Look at the ratio - 1.42x means 42% faster than baseline
Best Practices
- Use at least 10 samples (20 for important decisions)
- Each sample should take at least 10ms (adjust iterations accordingly)
- Always warm up before measuring
- Report CV% alongside results
- Re-run when results seem surprising
Classes
- Benchmark
- Runs benchmarks and collects statistically rigorous results.
- BenchmarkComparison
- Comparison between two benchmark results.
- BenchmarkConfig
- Configuration for benchmark runs.
- BenchmarkResult
- Results from benchmarking a single variant.
- BenchmarkVariant
- A benchmark variant to measure.
Enums
- ReliabilityLevel
- Describes the reliability level of a measurement based on its CV%.
Functions
-
cv(
List< double> samples) → double - Calculates the coefficient of variation (CV) as a percentage.
-
formatComparison(
BenchmarkComparison comparison) → String - Formats a comparison between two results.
-
formatDetailedResult(
BenchmarkResult result) → String - Formats a detailed report for a single benchmark result.
-
formatResults(
List< BenchmarkResult> results, {String? baselineName}) → String - Formats benchmark results as a table string.
-
formatResultsAsCsv(
List< BenchmarkResult> results) → String - Formats results as CSV for export or further analysis.
-
max(
List< double> samples) → double -
Returns the maximum value in
samples. -
mean(
List< double> samples) → double - Calculates the arithmetic mean (average) of a list of samples.
-
median(
List< double> samples) → double - Calculates the median (middle value) of a list of samples.
-
min(
List< double> samples) → double -
Returns the minimum value in
samples. -
printReliabilityWarning(
List< BenchmarkResult> results) → bool - Prints a reliability warning if any result has poor reliability.
-
printResults(
List< BenchmarkResult> results, {String? baselineName}) → void - Prints benchmark results to the console.
-
reliabilityFromCV(
double cvPercent) → ReliabilityLevel - Determines the reliability level based on coefficient of variation.
-
stdDev(
List< double> samples) → double - Calculates the sample standard deviation of a list of samples.
Typedefs
- BenchmarkLogger = void Function(String message)
- Optional callback for benchmark progress reporting.