ollama_embedder 1.0.0 copy "ollama_embedder: ^1.0.0" to clipboard
ollama_embedder: ^1.0.0 copied to clipboard

CLI tool to generate text embeddings from project files for search, RAG pipelines and analysis

Pub License: MIT Coverage GitHub stars

ollama_embedder #

CLI tool written in Dart for generating text embeddings from files and folders using a local Ollama server.

Features #

Generate embeddings for files and directories – recursively walks directories and processes multiple files in a single run.
Work with a local Ollama server – checks installation, server availability and model presence before processing.
Two text‑preprocessing modestechnical (keeps code) and textual (focuses on pure text with [CODE] markers).
Advanced text cleaning – removes HTML noise, cookie banners, navigation, footers, emojis and decorative frames.
Smart chunking – splits long documents into overlapping chunks by paragraphs, sentences and word boundaries.
Robust embedding requests – retries on transient Ollama errors with helpful logging and hints.
Configurable behavior – tune server URL, model, timeouts, max file size, input/output paths and processing mode.
Structured JSON output – emits EmbeddingChunk arrays ready for ingestion into vector databases and RAG systems.
Test‑covered core – chunking, preprocessing and processing pipeline are covered by unit tests.

Installation #

  1. Install Dart SDK with a version compatible with pubspec.yaml (currently >=3.1.0 <4.0.0).
  2. Install Ollama (desktop or server):
    • Download it from https://ollama.ai and install.
    • Start the server:
      ollama serve
      
  3. Install the CLI globally from pub.dev:
    dart pub global activate ollama_embedder
    

Quick start #

Prerequisites #

  • Dart SDK installed (compatible with the version in pubspec.yaml, currently ^3.1.0 <4.0.0).
  • Ollama installed and running:
    • download from https://ollama.ai and install;
    • start the server:
      ollama serve
      
    • pull the embedding model you plan to use (for example):
      ollama pull nomic-embed-text
      

Key CLI options (see also --help):

  • -i, --input (required): file or directory to process.
  • -o, --output: directory where .embedding.json files will be written (by default a subdirectory like embedding_gen is used).
  • -u, --url: Ollama server URL (default http://localhost:11434).
  • -m, --model: embedding model name (default nomic-embed-text).
  • --timeout: request timeout in milliseconds (default 60000).
  • -v, --verbose: verbose logging (recommended for production to see retries and hints).
  • --mode: text‑processing mode – technical (keeps code) or textual (collapses code into [CODE] markers).

Examples:

ollama_embedder --input source
ollama_embedder -i source -u http://localhost:11434 -m nomic-embed-text
ollama_embedder -i source --verbose --mode textual

How it works #

The pipeline on a high level:

  1. Text preprocessing (TextPreprocessor):
    • normalize line breaks and whitespace, remove invisible characters;
    • strip HTML, cookie banners, footers, navigation;
    • replace URL/EMAIL/PATH/ID with markers;
    • carefully handle Markdown headings, lists and code;
    • either preserve or collapse code depending on TextProcessingMode.
  2. Chunking (ChunkingService):
    • if text length ≤ maxSingleChunk (default 3000 chars) – a single chunk;
    • otherwise split by paragraphs/sentences/words with overlaps (overlapChars).
  3. Embedding generation via Ollama (EmbeddingService):
    • POST /api/embeddings with model and prompt;
    • retries with exponential backoff for transient and server‑stability issues;
    • special handling when the model is missing.
  4. Saving results (EmbeddingProcessor):
    • build an array of EmbeddingChunk;
    • generate doc_id from the file’s relative path;
    • write to <original_path>.embedding.json in the output directory.

Output format #

Each processed source file gets a corresponding JSON file with an array of chunks:

[
  {
    "doc_id": "source/test.md",
    "chunk_id": 0,
    "clean_content": "Cleaned single-line chunk text without line breaks...",
    "vector": [0.123, 0.456, "..."],
    "metadata": {
      "source": "source/test.md",
      "section": "full_doc",
      "type": "text",
      "created_at": "2025-01-01T12:00:00.000Z"
    }
  }
]
  • doc_id: relative file path (normalized with / separators).
  • chunk_id: sequential chunk number within a document.
  • clean_content: cleaned text with all line breaks replaced by spaces.
  • vector: embedding vector (size depends on the chosen model).
  • metadata: arbitrary metadata map with basic technical information.

The EmbeddingChunk model is defined in lib/models/embedding_chunk.dart, and the Document model in lib/models/document.dart is convenient for integration with vector databases.

Default Configuration #

  • ollamaUrl: http://localhost:11434;
  • ollamaModel: nomic-embed-text;
  • ollamaTimeoutMs: 60000;
  • embeddingExtension: .embedding.json;
  • maxFileSize: 10 MB (larger files are skipped);
  • defaultOutputSubdir: embedding_gen;
  • defaultTextProcessingMode: technical.

The CliConfig class in lib/config/cli_config.dart combines these values and allows overriding them via CLI arguments.

Text preprocessing logic #

The core logic is implemented in lib/services/text_preprocessor.dart:

  • TextProcessingMode.technical:
    • preserves code blocks and inline code as much as possible;
    • whitespace and line‑break normalization do not break code structure;
    • useful for code‑centric use cases (code search, code‑RAG, hybrid search).
  • TextProcessingMode.textual:
    • collapses code into [CODE] markers;
    • focuses on natural‑language content (documentation, articles, descriptions).

Additionally:

  • HTML tags, comments (<script>, <style>) and entities are removed or decoded;
  • cookie banners, footers, navigation blocks and pseudographics are stripped;
  • URLs, e‑mails, paths and long IDs are replaced with [URL], [EMAIL], [PATH], [ID];
  • punctuation noise such as !!!?? is normalized.

Chunking #

Chunking is handled by lib/services/chunking_service.dart:

  • maxChars: maximum chunk size (default 1500 characters).
  • overlapChars: overlap size between chunks (default 200 characters).
  • maxSingleChunk: maximum length that is allowed to remain a single chunk (default 3000 characters).
  • Chunks are labeled with sections such as intro, urls, lists, code, auto.

This makes embeddings more robust when searching over text fragments and reduces context loss.

Skipped files #

EmbeddingProcessor intentionally skips:

  • hidden files (starting with .);
  • service files (LICENSE, README.md);
  • already generated .embedding.json files;
  • binary files (.png, .jpg, .pdf, .zip, .exe, .dll, etc.).

Test coverage #

The project has a solid automated test suite with overall line coverage around 78% across all core components:

File Coverage
text_preprocessor.dart ≈85% – both technical and textual modes, cleaning rules and markers
chunking_service.dart ≈83% – chunk boundaries, overlaps and section labelling
embedding_processor.dart ≈75% – file traversal, skipping logic and output structure
embedding_chunk.dart 100% – model structure and JSON (de)serialization

Test categories #

Text preprocessing: normalization, HTML/boilerplate removal, URL/EMAIL/PATH/ID markers, two processing modes.
Chunking logic: single vs multi‑chunk documents, overlaps, section tags (intro, urls, lists, code, auto).
Embedding pipeline: correct skipping of files, doc_id calculation, output file naming and locations.
Models & serialization: EmbeddingChunk and related models, JSON input/output stability.
Edge cases: very small and very large documents, empty/near‑empty content, service and binary files.

It is recommended to run the test suite after any changes to preprocessing, chunking, chunking configuration or output format logic.

Limitations and recommendations #

  • Make sure Ollama is running and the model is pulled: ollama pull <model>.
  • For large corpora, monitor Ollama server load (use --verbose to see retries and hints).
  • Avoid changing the .embedding.json format if external systems (vector DB, RAG service, etc.) already depend on it.
0
likes
150
points
4
downloads

Publisher

unverified uploader

Weekly Downloads

CLI tool to generate text embeddings from project files for search, RAG pipelines and analysis

Repository (GitHub)
View/report issues

Topics

#cli #embeddings #ollama #rag

Documentation

Documentation
API reference

License

MIT (license)

Dependencies

args, http, logging, path

More

Packages that depend on ollama_embedder