docx_to_markdown
Convert Word .docx (OOXML) files to clean, deterministic Markdown - in pure Dart.
docx_to_markdown parses a document into a structured intermediate representation (IR) and then renders it, rather than scraping text run by run. That extra step is what buys Pandoc-level structural fidelity - deeply nested lists, complex tables, footnotes, math, and definition lists survive the trip - while keeping the package 100% pure Dart with zero native dependencies. It runs anywhere Dart runs: server, CLI, Flutter on Android, iOS, macOS, Windows, Linux, and browser-based Flutter web apps.
final bytes = File('report.docx').readAsBytesSync();
final markdown = await DocxConverter(bytes).convert();
Features
Content - paragraphs, headings (1-6), blockquotes, ordered/unordered/nested/mixed lists, task lists, tables (with HTML fallback for merged cells), images, links, internal bookmarks, footnotes, endnotes, comments, headers/footers, definition lists, and horizontal rules.
Formatting - bold, italic, strikethrough, underline, sub/superscript, inline code, highlight, and text color, with faithful handling of style-inherited and character-style run properties.
Output targeting - one input, tuned output: GitHub Flavored Markdown or strict CommonMark, with switchable conventions for tables, underline, images (Obsidian / Pandoc sizing), definition lists, and ordered-list markers.
Math - OMML equations are extracted with a readable text fallback, or routed through your own OMML-to-LaTeX converter via a hook.
Metadata - document properties (docProps/core.xml and docProps/custom.xml) are always parsed, and can optionally be emitted as a Pandoc-style YAML front matter block.
Extensible - rewrite links and image targets, transform or drop blocks and inlines, collect warnings, and reach the full IR - all without forking.
Safe - malformed archives, non-DOCX ZIPs, and missing parts degrade to partial output and warnings instead of crashing (opt into strict mode to throw instead).
Install
dart pub add docx_to_markdown
# Flutter:
flutter pub add docx_to_markdown
Or add it to your pubspec.yaml:
dependencies:
docx_to_markdown: ^0.1.0
Quick start
import 'dart:io';
import 'package:docx_to_markdown/docx_to_markdown.dart';
Future<void> main() async {
// The converter works on bytes, so load the file however you like.
final bytes = await File('document.docx').readAsBytes();
// Defaults: GitHub Flavored Markdown, footnotes on, tables auto.
final markdown = await DocxConverter(bytes).convert();
print(markdown);
}
By default, DocxConverter never touches the disk or network - you supply the bytes, and it returns a Markdown String. Disk image extraction happens only when you pass imageOutputDirectory. Invalid input (empty bytes or a non-ZIP file) throws a DocxPackageException.
Configuration
Pass a DocxToMarkdownConfig to control the output. Every option has a sensible default, so override only what you need:
final config = DocxToMarkdownConfig(
flavor: MarkdownFlavor.gfm, // or .commonmark
tableMode: TableMode.auto, // pipes for simple tables, HTML for complex
extractImages: true,
imageSizeMode: ImageSizeMode.obsidian, // 
includeComments: false,
metadataMode: MetadataMode.yamlFrontMatter,
trackChangesMode: TrackChangesMode.showDeletionsAsStrikethrough,
// Map your template's custom paragraph styles onto Markdown constructs.
paragraphStyleOverrides: {
'CodeExample': ParagraphStyleOverride(isCodeBlock: true, codeLanguage: 'dart'),
'Callout': ParagraphStyleOverride(isQuote: true),
'ChapterTitle': ParagraphStyleOverride(headingLevel: 1),
},
);
final converter = DocxConverter(bytes, config: config);
// On VM/native platforms, provide a directory to write extracted images.
final markdown = await converter.convert(imageOutputDirectory: 'assets/images');
Configs are immutable; use config.copyWith(...) to derive variants.
Web image handling
On Flutter web or browser tests, pass bytes from your picker/uploader and use imageAssetSink to decide where image bytes go. The returned string becomes the Markdown image source:
final exportedImages = <DocxImageAsset>[];
final markdown = await DocxConverter(bytes).convert(
imageAssetSink: (asset) {
exportedImages.add(asset);
return 'images/${asset.suggestedFilename}';
},
);
The sink receives the image package path, suggested filename, MIME type, and bytes. It can return a Blob URL, upload URL, app-local key, data URI, or any source your Markdown renderer understands. Return null or an empty string to skip one image.
To remove all images from the output and avoid reading/exporting image bytes:
final markdown = await DocxConverter(
bytes,
config: DocxToMarkdownConfig(extractImages: false),
).convert();
imageOutputDirectory requires VM/native file IO. Use imageAssetSink on web.
Key options
Output and format
| Option | Default | Notes |
|---|---|---|
flavor |
gfm |
gfm or commonmark. CommonMark renders tables as HTML. |
tableMode |
auto |
auto, markdownOnly, or htmlOnly. |
definitionListMode |
html |
html (<dl>), pandoc, or paragraphs. |
pageBreakMode |
ignore |
ignore, thematicBreak (---), or htmlComment. |
metadataMode |
none |
Set to yamlFrontMatter to emit front matter. |
Content
| Option | Default | Notes |
|---|---|---|
extractImages |
true |
Use imageOutputDirectory on VM/native or imageAssetSink on web. Set false to skip all images. |
imageSizeMode |
none |
none, obsidian (=WxH), or pandoc ({ width=... }). |
includeFootnotes |
true |
GFM [^1] footnote syntax. |
includeEndnotes |
false |
Treated like footnotes when enabled. |
includeComments |
false |
Rendered as HTML comments. |
includeHeadersFooters |
false |
Rendered as separate sections. |
unknownElementPolicy |
keepText |
drop, keepText, or keepHtml for unsupported XML. |
Formatting
| Option | Default | Notes |
|---|---|---|
underlineMode |
html |
<u>, ++text++, or ignore. |
highlightMode |
none |
mark wraps in <mark>. |
textColorMode |
none |
htmlSpan wraps in <span style="color:...">. |
trackChangesMode |
acceptAll |
Accept, reject, or show deletions as ~~strikethrough~~. |
lineBreakStyle |
hardBreak |
hardBreak, htmlBr, or newline. |
Lists
| Option | Default | Notes |
|---|---|---|
orderedListNumbering |
alwaysOne |
keep preserves source numbers; alwaysOne emits 1.. |
orderedListMarker |
decimal |
preserveFormat emits Pandoc Roman/alpha markers. |
tightLists |
true |
No blank lines between items. |
Diagnostics
| Option | Default | Notes |
|---|---|---|
strict |
false |
Throw on structural problems instead of warning. |
emitWarningsAsHtmlComments |
false |
Inline <!-- Warning: ... --> markers. |
This is the common subset. See the full API reference for every option (code detection via
monospaceFonts/codeBlockStyleName/treatShadedRunsAsCode,bulletMarkers,listIndent,maxImageWidth,preserveEmptyParagraphs, and more).
Document metadata
Parsing always populates Document.metadata. By default nothing is rendered, so the Markdown body is unchanged. Set metadataMode to prepend a Pandoc-style YAML front matter block:
final config = DocxToMarkdownConfig(metadataMode: MetadataMode.yamlFrontMatter);
final markdown = await DocxConverter(bytes, config: config).convert();
---
title: Quarterly Report
author: A. M.
subject: This is the subject
keywords:
- keyword 1
- keyword 2
lang: en-US
created: "2025-08-04T18:53:49Z"
modified: "2025-08-04T18:53:49Z"
custom:
Company: My Company
---
Core properties map to Pandoc-friendly keys (creator becomes author, language becomes lang), keywords is split into a list, and custom properties are nested under custom. Front matter is emitted only when metadata is present; producer statistics from docProps/app.xml (page/word counts) are intentionally excluded.
Hooks and the IR
Hooks let you intercept conversion without forking the library. The package exports the full IR (Document, Block, Inline and friends), so transforms can match on concrete node types:
final hooks = DocxToMarkdownHooks(
// Fix relative links.
rewriteLinkTarget: (url, [ctx]) =>
url.startsWith('/') ? 'https://example.com$url' : url,
// Rewrite image references after export.
rewriteImageTarget: (src, [ctx]) => 'https://cdn.example.com/${src.split('/').last}',
// Drop every table from the output.
transformBlock: (block, ctx) => block is TableBlock ? null : block,
// Supply your own OMML -> LaTeX conversion (return null to use the text fallback).
ommlToLatex: (ommlXml, [ctx]) => null,
// Observe non-fatal issues.
onWarning: (warning, ctx) => print('[${warning.code}] ${warning.message} @ ${ctx.location}'),
);
final markdown =
await DocxConverter(bytes, config: DocxToMarkdownConfig(hooks: hooks)).convert();
Supported elements
| Word element | Markdown output | Notes |
|---|---|---|
| Paragraphs | text | Empty paragraphs collapsed unless preserveEmptyParagraphs. |
| Headings 1-6 | # Heading |
From outline levels, style names, or overrides. |
| Lists | - item / 1. item |
Deep nesting, mixed and ordered/unordered. |
| Task lists | - [ ] / - [x] |
GFM checkboxes. |
| Tables | | col | or <table> |
Simple tables use pipes; complex ones fall back to HTML. |
| Images |  |
Extracted to disk, exported through an asset sink, or skipped; optional sizing. |
| Links | [text](url) |
Web links and internal bookmarks. |
| Bold / italic | **bold**, *italic* |
Plus ~~strikethrough~~ and sub/superscript. |
| Underline | <u> / ++..++ |
Configurable via underlineMode. |
| Highlight / color | <mark> / <span> |
Opt-in via highlightMode / textColorMode. |
| Inline / block code | `code` / fenced |
Detected by font, shading, or style name. |
| Blockquotes | > text |
From styles containing "Quote". |
| Footnotes / endnotes | [^1]: ... |
GFM syntax; endnotes opt-in. |
| Definition lists | <dl> / Pandoc / flat |
Configurable via definitionListMode. |
| Math (OMML) | text or LaTeX | Text fallback by default; LaTeX via ommlToLatex hook. |
| Metadata | YAML front matter | Opt-in via MetadataMode.yamlFrontMatter. |
Limitations
Worth knowing before you adopt:
- SmartArt, charts, and embedded objects cannot be cleanly represented in Markdown. By default their text content is preserved (
unknownElementPolicy: keepText) and a warning is emitted; the visual structure is lost. - Page geometry - sections, columns, margins, and page size - has no Markdown equivalent and is not modeled. Explicit page breaks can optionally become a thematic break or an HTML comment.
- Headers, footers, comments, and endnotes are off by default; enable them explicitly when you want them in the output.
Warnings are surfaced through DocxToMarkdownHooks.onWarning (or inlined with emitWarningsAsHtmlComments), so you can always see what was downgraded.
Philosophy
This package aims for "Pandoc-level" quality in pure Dart. It treats a .docx not as a bag of text but as a structured hierarchy, and prioritizes:
- Correctness - lists and tables should not break the surrounding layout.
- Safety - a malformed document should never crash the parser.
- Flexibility - you should be able to tune the output to match your CMS, wiki, or renderer.
Contributing
Contributions are welcome. See docs_guide.md for the documentation philosophy this codebase follows ("Describe the Behavior, Not Just the Data").
License
MIT - Copyright 2025 Omar Khaled Hanafy.
Libraries
- docx_to_markdown
- Converts DOCX (OOXML) documents to Markdown with configurable output.