normalizeTrailingSingleUpperTokenSplit function
Splits alpha tokens that end with a stray trailing uppercase letter.
OCR can drop the space before a trailing capital initial, producing tokens
like ChaE or GadgetC. When a token is otherwise clean title/lowercase
text and ends with a single uppercase letter, insert the missing space.
Implementation
String normalizeTrailingSingleUpperTokenSplit(String line) {
return line.replaceAllMapped(RegExp(r'\b([A-Z][a-z]{2,})([A-Z])\b'), (
Match match,
) {
final String stem = match.group(1) ?? '';
final String trailingUpper = match.group(regexGroupSecond) ?? '';
if (stem.length < _trailingSingleUpperStemMinLength ||
trailingUpper.isEmpty) {
return match.group(0) ?? line;
}
return '$stem $trailingUpper';
});
}