isOcrRequiredPdfExtractionError method - DocumentParser class - document_parser library

Returns true when a PDF extraction error indicates a scanned/image-only document — the kind OCR can recover.

The Rust parser surfaces a below-threshold error that shares the same "… fewer than N non-whitespace …" prefix across three cases. It appends the scanned-specific marker for exactly the OCR-recoverable ones, so this keys on that marker rather than the shared prefix:

scanned/image-only PDFs with no text layer — marker present, OCR helps;
mixed PDFs that are scanned but also have some pages that failed to extract — marker still present, OCR recovers the scanned pages;
PDFs where every page failed to extract (corrupt/unsupported content) — no marker, OCR will not help.

isOcrRequiredPdfExtractionError static method

Implementation

DocumentParser class