normalizeTrailingSingleUpperTokenSplit function

String normalizeTrailingSingleUpperTokenSplit(
  1. String line
)

Splits alpha tokens that end with a stray trailing uppercase letter.

OCR can drop the space before a trailing capital initial, producing tokens like ChaE or GadgetC. When a token is otherwise clean title/lowercase text and ends with a single uppercase letter, insert the missing space.

Implementation

String normalizeTrailingSingleUpperTokenSplit(String line) {
  return line.replaceAllMapped(RegExp(r'\b([A-Z][a-z]{2,})([A-Z])\b'), (
    Match match,
  ) {
    final String stem = match.group(1) ?? '';
    final String trailingUpper = match.group(regexGroupSecond) ?? '';
    if (stem.length < _trailingSingleUpperStemMinLength ||
        trailingUpper.isEmpty) {
      return match.group(0) ?? line;
    }

    return '$stem $trailingUpper';
  });
}