post_process_line library
Line-level cleanup passes for OCR post-processing.
Handles noise line detection and merging, short noisy line normalization, and punctuation-heavy text filtering.
Functions
-
mergeNoiseLines(
List< String> lines) → List<String> - Merges short noise-only lines into the following content line when useful.
-
normalizePriceLikeTableRow(
String line) → String - Repairs receipt-style quantity/price rows with noisy decimal separators.
-
normalizePunctuationHeavyText(
String text) → String - Normalizes lines that are overwhelmingly punctuation.
-
normalizeRegionPostalCodeSpacing(
String line) → String - Repairs split 5-digit postal codes after uppercase region abbreviations.
-
normalizeRepeatedCommaSuffix(
String line) → String - Splits all-caps merchant/location tokens that repeat the comma suffix.
-
normalizeShortNoisyLines(
List< String> lines) → List<String> - Normalizes tiny noisy lines often produced by decorative serif fragments.
-
normalizeStandaloneUpperDigitTokenSplit(
String line) → String - Splits short standalone uppercase-digit tokens in mixed-case table rows.
-
normalizeTrailingSingleUpperTokenSplit(
String line) → String - Splits alpha tokens that end with a stray trailing uppercase letter.