text_indexing 0.13.0+1 text_indexing: ^0.13.0+1 copied to clipboard
Dart library for creating an inverted index on a collection of text documents.
text_indexing #
Dart library for creating an inverted index on a collection of text documents.
THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Skip to section:
Overview #
This library provides an interface and implementation classes that build and maintain an (inverted, positional, zoned) index for a collection of documents or corpus
(see definitions).
The indexer constructs three inverted index
artifacts:
- the
dictionary
that holds thevocabulary
ofterms
and the frequency of occurrence for eachterm
in thecorpus
; - the
k-gram index
that mapsk-grams
toterms
in thedictionary
; and - the
postings
index that holds a list of references to thedocuments
for eachterm
(thepostings list
).
In this implementation, a postings list
is a hashmap of the document id (docId
) to maps that point to positions of the term
in the document's zones
(fields). This allows query algorithms to score and rank search results based on the position(s) of a term in document fields, applying different weights to the zones.
Refer to the references to learn more about information retrieval systems and the theory behind this library.
Usage #
In the pubspec.yaml
of your flutter project, add the text_indexing
dependency.
dependencies:
text_indexing: <latest version>
In your code file add the text_indexing
import.
import 'package:text_indexing/text_indexing.dart';
For small collections, instantiate a TextIndexer.inMemory
, (optionally passing empty Dictionary
and Postings
hashmaps), then iterate over a collection of documents to add them to the index.
// initialize an in=memory index for a JSON collection with two indexed fields
final myIndex = InMemoryIndex(zones: {'name': 1.0, 'description': 0.5}, phraseLength: 2);
// - initialize a in-memory [TextIndexer], passing in the index
final indexer =TextIndexer(index: myIndex);
// - iterate through the json collection "documents"
await Future.forEach(documents.entries, (MapEntry<String, String> doc) async {
// - index each document
await indexer.index(doc.key, doc.value);
});
The examples demonstrate the use of the TextIndexer.inMemory
and TextIndexer.async
factories.
API #
The API exposes the TextIndexer interface that builds and maintain an InvertedIndex for a collection of documents.
To maximise performance of the indexers the API manipulates nested hashmaps of DART core types int
and String
rather than defining strongly typed object models. To improve code legibility the API makes use of type aliases throughout.
InvertedIndex #
A mixin class implements the getTfIndex, getFtdPostings and getIdFtIndex methods.
Three implementation classes are provided:
- the InMemoryIndex class is intended for fast indexing of a smaller corpus using in-memory dictionary, k-gram and postings hashmaps;
- the AsyncCallbackIndex is intended for working with a larger corpus. It uses asynchronous callbacks to perform read and write operations on
dictionary
,k-gram
andpostings
repositories; and - the CachedIndex is intended for working with a larger corpus. It uses asynchronous callbacks to perform read and write operations on [Dictionary], [KGramIndex] and [Postings] repositories, but keeps a cache of the most popular terms and k-grams in memory for faster indexing and searching.
TextIndexer #
TextIndexer is an interface for classes that construct and maintain a dictionary, inverted, positional, zoned index and k-gram index.
Text or documents can be indexed by calling the following methods:
- indexText indexes text from a text document;
- indexJson indexes the fields in a
JSON
document; and - indexCollection indexes the fields of all the documents in a JSON document collection.
Use the unnamed factory constructor to instantiate a TextIndexer with the index of your choice or extend TextIndexerBase.
Definitions #
The following definitions are used throughout the documentation:
corpus
- the collection ofdocuments
for which anindex
is maintained.character filter
- filters characters from text in preparation of tokenization.dictionary
- is a hash ofterms
(vocabulary
) to the frequency of occurence in thecorpus
documents.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.index
- an inverted index used to look updocument
references from thecorpus
against avocabulary
ofterms
.document frequency (dFt)
is number of documents in thecorpus
that contain a term.index-elimination
- selecting a subset of the entries in an index where theterm
is in the collection ofterms
in a search phrase.inverse document frequency
oriDft
is equal to log (N /dft
), where N is the total number of terms in the index. TheIdFt
of a rare term is high, whereas the [IdFt] of a frequent term is likely to be low.JSON
is an acronym for"Java Script Object Notation"
, a common format for persisting data.k-gram
- a sequence of (any) k consecutive characters from aterm
. A k-gram can start with "$", dentoting the start of the [Term], and end with "$", denoting the end of the [Term]. The 3-grams for "castle" are { $ca, cas, ast, stl, tle, le$ }.lemmatizer
- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).postings
- a separate index that records whichdocuments
thevocabulary
occurs in. In this implementation we also record the positions of eachterm
in thetext
to create a positional invertedindex
.postings list
- a record of the positions of aterm
in adocument
. A position of aterm
refers to the index of theterm
in an array that contains all theterms
in thetext
.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.term filter
- filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking astemmer
and / orlemmatizer
.stemmer
- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).stopwords
- common words in a language that are excluded from indexing.term frequency (Ft)
is the frequency of aterm
in an index or indexed object.term position
is the zero-based index of aterm
in an ordered array ofterms
tokenized from thecorpus
.text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
such as its position(s) (term position
) in the text or frequency of occurrence (term frequency
).token filter
- returns a subset oftokens
from the tokenizer output.tokenizer
- a function that returns a collection oftoken
s fromtext
, after applying a character filter,term
filter, stemmer and / or lemmatizer.vocabulary
- the collection ofterms
indexed from thecorpus
.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.