isOcrRequiredPdfExtractionError static method
Returns true when a PDF extraction error indicates a scanned/image-only document — the kind OCR can recover.
The Rust parser surfaces a below-threshold error that shares the same
"… fewer than N non-whitespace …" prefix across three cases. It appends
the scanned-specific marker for exactly the OCR-recoverable ones, so this
keys on that marker rather than the shared prefix:
- scanned/image-only PDFs with no text layer — marker present, OCR helps;
- mixed PDFs that are scanned but also have some pages that failed to extract — marker still present, OCR recovers the scanned pages;
- PDFs where every page failed to extract (corrupt/unsupported content) — no marker, OCR will not help.
Implementation
static bool isOcrRequiredPdfExtractionError(Object error) {
final message = error.toString();
return message.contains('PDF text extraction returned fewer than') &&
message.contains('scanned/image-only');
}