DocTextExtractor

A Flutter package for extracting text from Word (.doc, .docx), PDF, Markdown(.md) and Google Docs URLs

DocTextExtractor is a lightweight Flutter package that extracts text from Word (.doc, .docx), PDF, Markdown(.md) and Google Docs URLs, with offline .doc support and real filename extraction. Perfect for AI-driven apps like NotteChat, it enables document-based chat and analysis by processing legacy and modern formats efficiently.

Features

  • Word (.doc, .docx) Extraction: Parse legacy .doc files offline and .docx files via XML.
  • PDF Extraction: Extract text from PDFs using Syncfusion.
  • Google Docs Support: Download PDF exports from Google Docs URLs with real filename extraction.
  • Offline Support: Process local .doc, .docx, .md, and PDF files without internet.
  • Real Filename Extraction: Retrieve accurate document names from Content-Disposition headers or URLs.
  • Cross-Platform: Works on iOS, Android, and web via Flutter.

Installation

Add the package to your pubspec.yaml:

dependencies:
  doc_text_extractor: ^1.0.0

Run:

flutter pub get

Usage

Extract Text from a URL

import 'package:doc_text_extractor/doc_text_extractor.dart';

void main() async {
  final extractor = TextExtractor();
  try {
    // Extract text from a Google Docs URL
    final result = await extractor.extractText('https://docs.google.com/document/d/EXAMPLE_ID/edit');
    print('Filename: ${result['filename']}');
    print('Text: ${result['text']}');

    // Extract text from a .doc URL
    final docResult = await extractor.extractText('https://example.com/sample.doc');
    print('Filename: ${docResult['filename']}');
    print('Text: ${docResult['text']}');

    // Extract text from a .md URL
    final mdResult = await extractor.extractText('https://example.com/sample.md');
    print('Filename: ${mdResult['filename']}');
    print('Text: ${mdResult['text']}');
  } catch (e) {
    print('Error: $e');
  }
}

Extract Text from a Local File

import 'package:doc_text_extractor/doc_text_extractor.dart';
import 'package:path_provider/path_provider.dart';
import 'dart:io';

void main() async {
  final extractor = TextExtractor();
  try {
    final dir = await getTemporaryDirectory();
    final filePath = '${dir.path}/sample.pdf';
    // Assume sample.pdf exists in temporary directory
    final result = await extractor.extractText(filePath, isUrl: false);
    print('Filename: ${result['filename']}');
    print('Text: ${result['text']}');
  } catch (e) {
    print('Error: $e');
  }
}

Dependencies

  • http: Fetches document URLs.
  • syncfusion_flutter_pdf: Extracts PDF text.
  • archive and xml: Parse .docx files.

Limitations

  • Google Docs URLs must be publicly accessible or shared with export permissions.
  • Large files (>10MB) may require loading dialogs for optimal UX.

Contributing

Contributions are welcome! Fork the repository, create a branch, and submit a pull request. Report issues at GitHub Issues.

License

MIT License. See LICENSE for details.

Contact

Libraries

doc_text_extractor