docx_to_markdown 0.1.0 copy "docx_to_markdown: ^0.1.0" to clipboard
docx_to_markdown: ^0.1.0 copied to clipboard

Convert Word .docx (OOXML) files to GitHub-Flavored or CommonMark Markdown in pure Dart, with images, tables, lists, footnotes, math, and extensible hooks.

docx_to_markdown #

pub version pub points downloads style: lints License: MIT

Convert Word .docx (OOXML) files to clean, deterministic Markdown - in pure Dart.

docx_to_markdown parses a document into a structured intermediate representation (IR) and then renders it, rather than scraping text run by run. That extra step is what buys Pandoc-level structural fidelity - deeply nested lists, complex tables, footnotes, math, and definition lists survive the trip - while keeping the package 100% pure Dart with zero native dependencies. It runs anywhere Dart runs: server, CLI, Flutter on Android, iOS, macOS, Windows, Linux, and browser-based Flutter web apps.

final bytes = File('report.docx').readAsBytesSync();
final markdown = await DocxConverter(bytes).convert();

Features #

Content - paragraphs, headings (1-6), blockquotes, ordered/unordered/nested/mixed lists, task lists, tables (with HTML fallback for merged cells), images, links, internal bookmarks, footnotes, endnotes, comments, headers/footers, definition lists, and horizontal rules.

Formatting - bold, italic, strikethrough, underline, sub/superscript, inline code, highlight, and text color, with faithful handling of style-inherited and character-style run properties.

Output targeting - one input, tuned output: GitHub Flavored Markdown or strict CommonMark, with switchable conventions for tables, underline, images (Obsidian / Pandoc sizing), definition lists, and ordered-list markers.

Math - OMML equations are extracted with a readable text fallback, or routed through your own OMML-to-LaTeX converter via a hook.

Metadata - document properties (docProps/core.xml and docProps/custom.xml) are always parsed, and can optionally be emitted as a Pandoc-style YAML front matter block.

Extensible - rewrite links and image targets, transform or drop blocks and inlines, collect warnings, and reach the full IR - all without forking.

Safe - malformed archives, non-DOCX ZIPs, and missing parts degrade to partial output and warnings instead of crashing (opt into strict mode to throw instead).

Install #

dart pub add docx_to_markdown
# Flutter:
flutter pub add docx_to_markdown

Or add it to your pubspec.yaml:

dependencies:
  docx_to_markdown: ^0.1.0

Quick start #

import 'dart:io';
import 'package:docx_to_markdown/docx_to_markdown.dart';

Future<void> main() async {
  // The converter works on bytes, so load the file however you like.
  final bytes = await File('document.docx').readAsBytes();

  // Defaults: GitHub Flavored Markdown, footnotes on, tables auto.
  final markdown = await DocxConverter(bytes).convert();

  print(markdown);
}

By default, DocxConverter never touches the disk or network - you supply the bytes, and it returns a Markdown String. Disk image extraction happens only when you pass imageOutputDirectory. Invalid input (empty bytes or a non-ZIP file) throws a DocxPackageException.

Configuration #

Pass a DocxToMarkdownConfig to control the output. Every option has a sensible default, so override only what you need:

final config = DocxToMarkdownConfig(
  flavor: MarkdownFlavor.gfm,            // or .commonmark
  tableMode: TableMode.auto,             // pipes for simple tables, HTML for complex
  extractImages: true,
  imageSizeMode: ImageSizeMode.obsidian, // ![alt](img.png =300x200)
  includeComments: false,
  metadataMode: MetadataMode.yamlFrontMatter,
  trackChangesMode: TrackChangesMode.showDeletionsAsStrikethrough,

  // Map your template's custom paragraph styles onto Markdown constructs.
  paragraphStyleOverrides: {
    'CodeExample': ParagraphStyleOverride(isCodeBlock: true, codeLanguage: 'dart'),
    'Callout': ParagraphStyleOverride(isQuote: true),
    'ChapterTitle': ParagraphStyleOverride(headingLevel: 1),
  },
);

final converter = DocxConverter(bytes, config: config);

// On VM/native platforms, provide a directory to write extracted images.
final markdown = await converter.convert(imageOutputDirectory: 'assets/images');

Configs are immutable; use config.copyWith(...) to derive variants.

Web image handling #

On Flutter web or browser tests, pass bytes from your picker/uploader and use imageAssetSink to decide where image bytes go. The returned string becomes the Markdown image source:

final exportedImages = <DocxImageAsset>[];

final markdown = await DocxConverter(bytes).convert(
  imageAssetSink: (asset) {
    exportedImages.add(asset);
    return 'images/${asset.suggestedFilename}';
  },
);

The sink receives the image package path, suggested filename, MIME type, and bytes. It can return a Blob URL, upload URL, app-local key, data URI, or any source your Markdown renderer understands. Return null or an empty string to skip one image.

To remove all images from the output and avoid reading/exporting image bytes:

final markdown = await DocxConverter(
  bytes,
  config: DocxToMarkdownConfig(extractImages: false),
).convert();

imageOutputDirectory requires VM/native file IO. Use imageAssetSink on web.

Key options #

Output and format

Option Default Notes
flavor gfm gfm or commonmark. CommonMark renders tables as HTML.
tableMode auto auto, markdownOnly, or htmlOnly.
definitionListMode html html (<dl>), pandoc, or paragraphs.
pageBreakMode ignore ignore, thematicBreak (---), or htmlComment.
metadataMode none Set to yamlFrontMatter to emit front matter.

Content

Option Default Notes
extractImages true Use imageOutputDirectory on VM/native or imageAssetSink on web. Set false to skip all images.
imageSizeMode none none, obsidian (=WxH), or pandoc ({ width=... }).
includeFootnotes true GFM [^1] footnote syntax.
includeEndnotes false Treated like footnotes when enabled.
includeComments false Rendered as HTML comments.
includeHeadersFooters false Rendered as separate sections.
unknownElementPolicy keepText drop, keepText, or keepHtml for unsupported XML.

Formatting

Option Default Notes
underlineMode html <u>, ++text++, or ignore.
highlightMode none mark wraps in <mark>.
textColorMode none htmlSpan wraps in <span style="color:...">.
trackChangesMode acceptAll Accept, reject, or show deletions as ~~strikethrough~~.
lineBreakStyle hardBreak hardBreak, htmlBr, or newline.

Lists

Option Default Notes
orderedListNumbering alwaysOne keep preserves source numbers; alwaysOne emits 1..
orderedListMarker decimal preserveFormat emits Pandoc Roman/alpha markers.
tightLists true No blank lines between items.

Diagnostics

Option Default Notes
strict false Throw on structural problems instead of warning.
emitWarningsAsHtmlComments false Inline <!-- Warning: ... --> markers.

This is the common subset. See the full API reference for every option (code detection via monospaceFonts / codeBlockStyleName / treatShadedRunsAsCode, bulletMarkers, listIndent, maxImageWidth, preserveEmptyParagraphs, and more).

Document metadata #

Parsing always populates Document.metadata. By default nothing is rendered, so the Markdown body is unchanged. Set metadataMode to prepend a Pandoc-style YAML front matter block:

final config = DocxToMarkdownConfig(metadataMode: MetadataMode.yamlFrontMatter);
final markdown = await DocxConverter(bytes, config: config).convert();
---
title: Quarterly Report
author: A. M.
subject: This is the subject
keywords:
  - keyword 1
  - keyword 2
lang: en-US
created: "2025-08-04T18:53:49Z"
modified: "2025-08-04T18:53:49Z"
custom:
  Company: My Company
---

Core properties map to Pandoc-friendly keys (creator becomes author, language becomes lang), keywords is split into a list, and custom properties are nested under custom. Front matter is emitted only when metadata is present; producer statistics from docProps/app.xml (page/word counts) are intentionally excluded.

Hooks and the IR #

Hooks let you intercept conversion without forking the library. The package exports the full IR (Document, Block, Inline and friends), so transforms can match on concrete node types:

final hooks = DocxToMarkdownHooks(
  // Fix relative links.
  rewriteLinkTarget: (url, [ctx]) =>
      url.startsWith('/') ? 'https://example.com$url' : url,

  // Rewrite image references after export.
  rewriteImageTarget: (src, [ctx]) => 'https://cdn.example.com/${src.split('/').last}',

  // Drop every table from the output.
  transformBlock: (block, ctx) => block is TableBlock ? null : block,

  // Supply your own OMML -> LaTeX conversion (return null to use the text fallback).
  ommlToLatex: (ommlXml, [ctx]) => null,

  // Observe non-fatal issues.
  onWarning: (warning, ctx) => print('[${warning.code}] ${warning.message} @ ${ctx.location}'),
);

final markdown =
    await DocxConverter(bytes, config: DocxToMarkdownConfig(hooks: hooks)).convert();

Supported elements #

Word element Markdown output Notes
Paragraphs text Empty paragraphs collapsed unless preserveEmptyParagraphs.
Headings 1-6 # Heading From outline levels, style names, or overrides.
Lists - item / 1. item Deep nesting, mixed and ordered/unordered.
Task lists - [ ] / - [x] GFM checkboxes.
Tables | col | or <table> Simple tables use pipes; complex ones fall back to HTML.
Images ![alt](src) Extracted to disk, exported through an asset sink, or skipped; optional sizing.
Links [text](url) Web links and internal bookmarks.
Bold / italic **bold**, *italic* Plus ~~strikethrough~~ and sub/superscript.
Underline <u> / ++..++ Configurable via underlineMode.
Highlight / color <mark> / <span> Opt-in via highlightMode / textColorMode.
Inline / block code `code` / fenced Detected by font, shading, or style name.
Blockquotes > text From styles containing "Quote".
Footnotes / endnotes [^1]: ... GFM syntax; endnotes opt-in.
Definition lists <dl> / Pandoc / flat Configurable via definitionListMode.
Math (OMML) text or LaTeX Text fallback by default; LaTeX via ommlToLatex hook.
Metadata YAML front matter Opt-in via MetadataMode.yamlFrontMatter.

Limitations #

Worth knowing before you adopt:

  • SmartArt, charts, and embedded objects cannot be cleanly represented in Markdown. By default their text content is preserved (unknownElementPolicy: keepText) and a warning is emitted; the visual structure is lost.
  • Page geometry - sections, columns, margins, and page size - has no Markdown equivalent and is not modeled. Explicit page breaks can optionally become a thematic break or an HTML comment.
  • Headers, footers, comments, and endnotes are off by default; enable them explicitly when you want them in the output.

Warnings are surfaced through DocxToMarkdownHooks.onWarning (or inlined with emitWarningsAsHtmlComments), so you can always see what was downgraded.

Philosophy #

This package aims for "Pandoc-level" quality in pure Dart. It treats a .docx not as a bag of text but as a structured hierarchy, and prioritizes:

  1. Correctness - lists and tables should not break the surrounding layout.
  2. Safety - a malformed document should never crash the parser.
  3. Flexibility - you should be able to tune the output to match your CMS, wiki, or renderer.

Contributing #

Contributions are welcome. See docs_guide.md for the documentation philosophy this codebase follows ("Describe the Behavior, Not Just the Data").

License #

MIT - Copyright 2025 Omar Khaled Hanafy.

0
likes
160
points
30
downloads

Documentation

Documentation
API reference

Publisher

verified publishertomars.tech

Weekly Downloads

Convert Word .docx (OOXML) files to GitHub-Flavored or CommonMark Markdown in pure Dart, with images, tables, lists, footnotes, math, and extensible hooks.

Repository (GitHub)
View/report issues

Topics

#docx #markdown #ooxml #word #documents

License

MIT (license)

Dependencies

archive, collection, meta, path, xml

More

Packages that depend on docx_to_markdown