text_indexing #

Dart library for creating an inverted index on a collection of text documents.

THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.

Skip to section:

Overview
Usage
API
Definitions
References
Issues

Overview #

This library provides an interface and implementation classes that build and maintain an (inverted, positional, zoned) index for a collection of documents or corpus (see definitions).

Index construction flowchart

The indexer constructs three inverted index artifacts:

the dictionary that holds the vocabulary of terms and the frequency of occurrence for each term in the corpus;
the k-gram index that maps k-grams to terms in the dictionary; and
the postings index that holds a list of references to the documents for each term (the postings list).

In this implementation, a postings list is a hashmap of the document id (docId) to maps that point to positions of the term in the document's zones (fields). This allows query algorithms to score and rank search results based on the position(s) of a term in document fields, applying different weights to the zones.

Index artifacts

Refer to the references to learn more about information retrieval systems and the theory behind this library.

Usage #

In the pubspec.yaml of your flutter project, add the text_indexing dependency.

dependencies:
  text_indexing: <latest version>

In your code file add the text_indexing import.

import 'package:text_indexing/text_indexing.dart';

For small collections, instantiate a TextIndexer.inMemory, (optionally passing empty Dictionary and Postings hashmaps), then iterate over a collection of documents to add them to the index.

  // initialize an in=memory index for a JSON collection with two indexed fields
   final myIndex = InMemoryIndex(zones: {'name': 1.0, 'description': 0.5}, phraseLength: 2);

  // - initialize a in-memory [TextIndexer], passing in the index
  final indexer =TextIndexer(index: myIndex);

  // - iterate through the json collection "documents"
  await Future.forEach(documents.entries, (MapEntry<String, String> doc) async {
    // - index each document
    await indexer.index(doc.key, doc.value);
  });

The examples demonstrate the use of the TextIndexer.inMemory and TextIndexer.async factories.

API #

The API exposes the TextIndexer interface that builds and maintain an InvertedIndex for a collection of documents.

To maximise performance of the indexers the API manipulates nested hashmaps of DART core types int and String rather than defining strongly typed object models. To improve code legibility the API makes use of type aliases throughout.

InvertedIndex #

A mixin class implements the getTfIndex, getFtdPostings and getIdFtIndex methods.

Three implementation classes are provided:

the InMemoryIndex class is intended for fast indexing of a smaller corpus using in-memory dictionary, k-gram and postings hashmaps;
the AsyncCallbackIndex is intended for working with a larger corpus. It uses asynchronous callbacks to perform read and write operations on dictionary, k-gram and postings repositories; and
the CachedIndex is intended for working with a larger corpus. It uses asynchronous callbacks to perform read and write operations on [Dictionary], [KGramIndex] and [Postings] repositories, but keeps a cache of the most popular terms and k-grams in memory for faster indexing and searching.

TextIndexer #

TextIndexer is an interface for classes that construct and maintain a dictionary, inverted, positional, zoned index and k-gram index.

Text or documents can be indexed by calling the following methods:

indexText indexes text from a text document;
indexJson indexes the fields in a JSON document; and
indexCollection indexes the fields of all the documents in a JSON document collection.

Use the unnamed factory constructor to instantiate a TextIndexer with the index of your choice or extend TextIndexerBase.

Definitions #

The following definitions are used throughout the documentation:

corpus- the collection of documents for which an index is maintained.
character filter - filters characters from text in preparation of tokenization.
dictionary - is a hash of terms (vocabulary) to the frequency of occurence in the corpus documents.
document - a record in the corpus, that has a unique identifier (docId) in the corpus's primary key and that contains one or more text fields that are indexed.
index - an inverted index used to look up document references from the corpus against a vocabulary of terms.
document frequency (dFt) is number of documents in the corpus that contain a term.
index-elimination - selecting a subset of the entries in an index where the term is in the collection of terms in a search phrase.
inverse document frequency or iDft is equal to log (N / dft), where N is the total number of terms in the index. The IdFt of a rare term is high, whereas the [IdFt] of a frequent term is likely to be low.
JSON is an acronym for "Java Script Object Notation", a common format for persisting data. k-gram - a sequence of (any) k consecutive characters from a term. A k-gram can start with "$", dentoting the start of the [Term], and end with "$", denoting the end of the [Term]. The 3-grams for "castle" are { $ca, cas, ast, stl, tle, le$ }.
lemmatizer - lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).
postings - a separate index that records which documents the vocabulary occurs in. In this implementation we also record the positions of each term in the text to create a positional inverted index.
postings list - a record of the positions of a term in a document. A position of a term refers to the index of the term in an array that contains all the terms in the text.
term - a word or phrase that is indexed from the corpus. The term may differ from the actual word used in the corpus depending on the tokenizer used.
term filter - filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking a stemmer and / or lemmatizer.
stemmer - stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).
stopwords - common words in a language that are excluded from indexing.
term frequency (Ft) is the frequency of a term in an index or indexed object.
term position is the zero-based index of a term in an ordered array of terms tokenized from the corpus.
text - the indexable content of a document.
token - representation of a term in a text source returned by a tokenizer. The token may include information about the term such as its position(s) (term position) in the text or frequency of occurrence (term frequency).
token filter - returns a subset of tokens from the tokenizer output.
tokenizer - a function that returns a collection of tokens from text, after applying a character filter, term filter, stemmer and / or lemmatizer.
vocabulary - the collection of terms indexed from the corpus.

References #

Issues #

If you find a bug please fill an issue.

This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.

text_indexing 0.13.0+1
text_indexing: ^0.13.0+1 copied to clipboard

Metadata

text_indexing #

Overview #

Usage #

API #

InvertedIndex #

TextIndexer #

Definitions #

References #

Issues #

← Metadata

Publisher

Metadata

License

Dependencies

More

text_indexing 0.13.0+1 text_indexing: ^0.13.0+1 copied to clipboard

Metadata

text_indexing #

Overview #

Usage #

API #

InvertedIndex #

TextIndexer #

Definitions #

References #

Issues #

← Metadata

Publisher

Metadata

License

Dependencies

More

text_indexing 0.13.0+1
text_indexing: ^0.13.0+1 copied to clipboard