post_process_line library

Line-level cleanup passes for OCR post-processing.

Handles noise line detection and merging, short noisy line normalization, and punctuation-heavy text filtering.

Functions

mergeNoiseLines(List<String> lines) List<String>
Merges short noise-only lines into the following content line when useful.
normalizePriceLikeTableRow(String line) String
Repairs receipt-style quantity/price rows with noisy decimal separators.
normalizePunctuationHeavyText(String text) String
Normalizes lines that are overwhelmingly punctuation.
normalizeRegionPostalCodeSpacing(String line) String
Repairs split 5-digit postal codes after uppercase region abbreviations.
normalizeRepeatedCommaSuffix(String line) String
Splits all-caps merchant/location tokens that repeat the comma suffix.
normalizeShortNoisyLines(List<String> lines) List<String>
Normalizes tiny noisy lines often produced by decorative serif fragments.
normalizeStandaloneUpperDigitTokenSplit(String line) String
Splits short standalone uppercase-digit tokens in mixed-case table rows.
normalizeTrailingSingleUpperTokenSplit(String line) String
Splits alpha tokens that end with a stray trailing uppercase letter.