Chunky

Chunky helps with the ingestion side of long-form AI and search pipelines: it extracts text from files, splits content into manageable chunks, adds optional overlap, and can turn those chunks into embeddings.

Everything is exported from:

import 'package:chunky/chunky.dart';

Features

Chunk in-memory strings or arbitrary Stream<String> sources.
Chunk files with automatic file-type detection through FileStringer.
Preserve context between chunks with word-aware overlap.
Attach chunk metadata (id, start, length, content) for indexing and traceability.
Generate embeddings with bounded concurrency through Embedder.
Ingest plain text, XLSX, DOCX, PDF, and many Pandoc-supported formats.
Replace or reorder built-in file handlers through the global fileStringers list.

Installation

Chunky currently depends on Flutter because PDF extraction uses syncfusion_flutter_pdf, so install it in a Flutter package or app:

flutter pub add chunky

Optional external tools:

pandoc when you want PandocStringer support for formats such as Markdown, EPUB, ODT, RTF, LaTeX, Org, and more.
ocrmypdf when you want PDF ingestion through PDFStringer.

If you only work with in-memory strings, you do not need either external binary.

Quick Start

import 'dart:io';

import 'package:chunky/chunky.dart';

Future<void> main() async {
  final chunker = Chunker(chunkSize: 300);

  await for (final chunk in chunker.transformString(
    'Chunk long-form text directly from memory.',
  )) {
    print('${chunk.id}: ${chunk.start}..${chunk.start + chunk.length}');
    print(chunk.content);
  }

  await for (final chunk in chunker.transformFile(File('notes.txt'))) {
    print('file chunk ${chunk.id}: ${chunk.content}');
  }

  final embedder = Embedder(
    chunker: chunker,
    overlap: 50,
    embedder: (content) async => <double>[content.length.toDouble()],
  );

  await for (final embedded in embedder.transform(
    Stream.value('Embed chunked content with overlap.'),
  )) {
    print(
      '${embedded.chunk.id}: ${embedded.embedding.length} dims from ${embedded.chunk.content}',
    );
  }
}

Usage

Chunk a String

Use transformString when your content is already in memory:

final chunker = Chunker(chunkSize: 500);

await for (final chunk in chunker.transformString(longArticle)) {
  print('chunk #${chunk.id}');
  print('start=${chunk.start}, length=${chunk.length}');
  print(chunk.content);
}

Chunk sizes are approximate rather than exact. Chunky tries to split content into readable segments near the requested size instead of cutting blindly at a character boundary.

Chunk a Stream

If your text arrives progressively, pass a Stream<String> directly:

final stream = Stream.fromIterable([
  'Section one...\n',
  'Section two...\n',
  'Section three...\n',
]);

await for (final chunk in Chunker(chunkSize: 250).transform(stream)) {
  print(chunk.content);
}

Add Overlap Between Chunks

Overlap helps preserve context for embeddings, search, and RAG:

final chunker = Chunker(chunkSize: 300);

await for (final chunk in chunker.transformWithOverlap(
  Stream.value(longText),
  overlap: 60,
)) {
  print('${chunk.id}: ${chunk.content}');
}

For files there is a convenience helper:

await for (final chunk in chunker.transformFileWithOverlap(
  File('report.md'),
  overlap: 80,
)) {
  print(chunk.content);
}

overlap is measured as a maximum character budget, but the overlap logic keeps whole words from the previous chunk instead of slicing in the middle of a token.

Chunk a File

transformFile delegates to FileStringer.streamFile, which picks the first handler that supports the file extension:

final chunker = Chunker(chunkSize: 400);
final file = File('knowledge-base.docx');

await for (final chunk in chunker.transformFile(file)) {
  print(chunk.content);
}

If you only want extracted text without chunking, use FileStringer.streamFile directly:

await for (final piece in FileStringer.streamFile(File('data.xlsx'))) {
  print(piece);
}

Generate Embeddings

Embedder composes chunking, overlap, and your embedding callback:

final embedder = Embedder(
  chunker: Chunker(chunkSize: 350),
  overlap: 75,
  embedder: (content) async {
    return myEmbeddingClient.embed(content);
  },
);

await for (final embedded in embedder.transform(
  Stream.value(documentText),
  semaphoreBuffer: 8,
)) {
  print(embedded.chunk.id);
  print(embedded.embedding);
}

semaphoreBuffer controls how many embedding jobs run in parallel. It defaults to 4.

Customize File Handlers

Chunky exposes the global fileStringers list so you can swap built-in handlers, change priority, or add your own:

fileStringers = const [
  XLSXFileStringer(),
  DOCXFileStringer(handleNumbering: true),
  PDFStringer(),
  PandocStringer(),
  TextFileStringer(),
];

This is useful if you want numbered DOCX paragraphs or if you need a custom stringer to run before the defaults.

Supported File Types

Some extensions are supported by more than one handler. Chunky uses the first matching stringer in fileStringers.

Handler	Formats	Notes
`TextFileStringer`	`txt`, `json`, `yaml`, `toml`, `xml`, `html`, `csv`	Reads line-by-line as UTF-8 text.
`XLSXFileStringer`	`xlsx`	Streams sheet names and comma-separated row values.
`DOCXFileStringer`	`docx`	Reads `word/document.xml` directly from the archive. Optional numbering support is available.
`PDFStringer`	`pdf`	Requires `ocrmypdf`, then falls back to direct PDF text extraction when OCR sidecar text is unnecessary.
`PandocStringer`	Many formats	Requires `pandoc` for formats such as `md`, `markdown`, `epub`, `odt`, `rtf`, `latex`, `org`, `tsv`, and more.

The default priority order is:

XLSXFileStringer
DOCXFileStringer
PDFStringer
PandocStringer
TextFileStringer

Main API Surface

Chunker(chunkSize: 300)
Chunker.transformString(String input)
Chunker.transform(Stream<String> rawFeed)
Chunker.transformWithOverlap(Stream<String> rawFeed, {int overlap = 50})
Chunker.transformFile(File file)
Chunker.transformFileWithOverlap(File file, {int overlap = 50})
FileStringer.streamFile(File file)
Embedder.transform(Stream<String> rawFeed, {int semaphoreBuffer = 4})

When to Use Chunky

Chunky is a good fit when you need to:

preprocess documents before embedding or vector indexing,
normalize mixed document types into text,
preserve chunk metadata for citations or traceability,
add overlap to improve retrieval quality,
build ingestion pipelines for RAG, semantic search, or summarization.