text_indexing 0.8.0 text_indexing: ^0.8.0 copied to clipboard
Dart library for creating an inverted index on a collection of text documents.
text_indexing #
Dart library for creating an inverted index on a collection of text documents.
THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Skip to section:
Overview #
This library provides an interface and implementation classes that build and maintain an (inverted, positional, zoned) index for a collection of documents or corpus
(see definitions).
The TextIndexer constructs two artifacts:
- a
dictionary
that holds thevocabulary
ofterms
and the frequency of occurrence for eachterm
in thecorpus
; and - a
postings
map that holds a list of references to thedocuments
for eachterm
(thepostings list
).
In this implementation, our postings list
is a hashmap of the document id (docId
) to maps that point to positions of the term in the document's fields (zones). This allows query algorithms to score and rank search results based on the position(s) of a term in document fields, applying different weights to the zones.
Refer to the references to learn more about information retrieval systems and the theory behind this library.
Usage #
In the pubspec.yaml
of your flutter project, add the text_indexing
dependency.
dependencies:
text_indexing: <latest version>
In your code file add the text_indexing
import.
import 'package:text_indexing/text_indexing.dart';
For small collections, instantiate a TextIndexer.inMemory
, (optionally passing empty Dictionary
and Postings
hashmaps), then iterate over a collection of documents to add them to the index.
// - initialize a in-memory [TextIndexer] with defaults for all parameters
final indexer =TextIndexer.inMemory();
// - iterate through the sample data
await Future.forEach(documents.entries, (MapEntry<String, String> doc) async {
// - index each document
await indexer.index(doc.key, doc.value);
});
The examples demonstrate the use of the TextIndexer.inMemory
and TextIndexer.async
factories.
API #
The API exposes the TextIndexer interface that builds and maintain an index for a collection of documents.
Three implementations of the TextIndexer interface are provided:
- the TextIndexerBase abstract base class implements the
TextIndexer.index
,TextIndexer.indexJson
andTextIndexer.emit
methods; - the InMemoryIndexer class is for fast indexing of a smaller corpus using in-memory dictionary and postings hashmaps; and
- the AsyncIndexer class, aimed at working with a larger corpus and asynchronous dictionaries and postings.
To maximise performance of the indexers the API manipulates nested hashmaps of DART core types int
and String
rather than defining strongly typed object models. To improve code legibility and maintainability the API makes use of type aliases throughout.
Type Aliases #
Dictionary
is an alias forMap<Term, Ft>
, a hashmap ofTerm
toFt
.DictionaryEntry
is an alias forMapEntry<Term, Ft>
, an entry in aDictionary
.DocId
is an alias forString
, used whenever a document id is referenced.DocumentPostings
is an alias forMap<DocId, FieldPostings>
, a hashmap of document ids toFieldPostings
.DocumentPostingsEntry
is an alias forMapEntry<DocId, FieldPostings>
, an entry in aDocumentPostings
hashmap.FieldPostings
is an alias forMap<FieldName, TermPositions>
, a hashmap ofFieldName
s toTermPositions
in the field withFieldName
.FieldPostingsEntry
is an alias forMapEntry<FieldName, TermPositions>
, an entry in aFieldPostings
hashmap.Ft
is an lias forint
and denotes the frequency of aTerm
in an index or indexed object (the term frequency).JSON
is an alias forMap<String, dynamic>
, a hashmap known as"Java Script Object Notation" (JSON)
, a common format for persisting data.JsonCollection
is an alias forMap<String, Map<String, dynamic>>
, a hashmap ofDocId
toJSON
documents.Pt
is an alias forint
, used to denote the position of aTerm
inSourceText
indexed object (the term position).TermPositions
is an alias forList<Pt>
, an orderedSet
of unique zero-basedTerm
positions inSourceText
, sorted in ascending order.
InvertedPositionalZoneIndex Interface #
The InvertedPositionalZoneIndex
is an interface for an inverted, positional zoned index on a collection of documents.
The InvertedPositionalZoneIndex
exposes the analyzer
field, a text analyser that extracts tokens from text.
The InvertedPositionalZoneIndex
exposes the following methods:
getDictionary
Asynchronously retrieves aDictionary
for a collection ofTerm
s from aDictionary
repository;upsertDictionary
inserts entries into aDictionary
repository, overwriting any existing entries;getPostings
asynchronously retrievesPostings
for a collection ofTerm
s from aPostings
repository; andupsertPostings
inserts entries into aPostings
repository, overwriting any existing entries.
TextIndexer Interface #
The text indexing classes (indexers) in this library implement TextIndexer
, an interface intended for information retrieval software applications. The design of the TextIndexer
interface is consistent with information retrieval theory and is intended to construct and/or maintain two artifacts:
- a hashmap with the vocabulary as key and the document frequency as the values (the
dictionary
); and - another hashmap with the vocabulary as key and the postings lists for the linked
documents
as values (thepostings
).
The dictionary and postings can be asynchronous data sources or in-memory hashmaps. The TextIndexer
reads and writes to/from these artifacts using the TextIndexer.index
.
Text or documents can be indexed by calling the following methods:
TextIndexer.indexJson
indexes the fields in aJSON
document;TextIndexer.indexText
indexes text from a text document.- The
TextIndexer.indexCollection
method indexes text from a collection ofJSON
documents, emitting thePostings
for each document in theTextIndexer.postingsStream
.
The TextIndexer.emit
method adds an event to the postingsStream
.
Listen to TextIndexer.postingsStream
to handle the postings list emitted whenever a document is indexed.
Implementing classes override the following fields:
TextIndexer.index
is theInvertedPositionalZoneIndex
that provides access to the indexDictionary
andPostings
and aITextAnalyzer
;TextIndexer.postingsStream
emits aPostings
whenever a document is indexed.
Implementing classes override the following asynchronous methods:
TextIndexer.indexText
indexes a text document;TextIndexer.indexJson
indexes the fields in a JSON document;TextIndexer.indexCollection
indexes the fields of all the documents in JSON document collection; andTextIndexer.emit
adds an event to theTextIndexer.postingsStream
after updating theTextIndexer.index
;
TextIndexerBase Class #
The TextIndexerBase
is an abstract base class that implements the TextIndexer.indexText
, TextIndexer.indexJson
, TextIndexer.indexCollection
and TextIndexer.emit
methods and the TextIndexer.postingsStream
field.
The TextIndexerBase.index
is updated whenever TextIndexerBase.emit
is called.
Subclasses of TextIndexerBase
must implement:
TextIndexer.index
; and
TextIndexerBase.controller
, aBehaviorSubject<Postings>
that controls theTextIndex.postingsStream
.
InMemoryIndex Class #
The InMemoryIndex
is a InvertedPositionalZoneIndex
interface implementation with in-memory Dictionary
and Postings
hashmaps:
InMemoryIndex.analyzer
is theITextAnalyzer
used to tokenize text for theInMemoryIndex
;InMemoryIndex.dictionary
is the in-memory term dictionary for the indexer. Pass adictionary
instance at instantiation, otherwise an emptyDictionary
will be initialized; andInMemoryIndex.postings
is the in-memory postings hashmap for the indexer. Pass apostings
instance at instantiation, otherwise an emptyPostings
will be initialized.
InMemoryIndexer Class #
The InMemoryIndexer
is a subclass of TextIndexerBase that builds and maintains an in-memory Dictionary
and Postings
and can be initialized using the TextIndexer.inMemory
factory.
The InMemoryIndexer
is suitable for indexing a smaller corpus. The InMemoryIndexer
may have latency and processing overhead for large indexes or queries with more than a few terms. Consider running InMemoryIndexer
in an isolate to avoid slowing down the main thread.
An example of the use of the TextIndexer.inMemory
factory is included in the examples.
AsyncCallbackIndex Class #
The AsyncCallbackIndex
is a InvertedPositionalZoneIndex
implementation class
that uses asynchronous callbacks to perform read and write operations on Dictionary
and Postings
repositories:
AsyncCallbackIndex.analyzer
is theITextAnalyzer
used to tokenize text for theAsyncCallbackIndex
;AsyncCallbackIndex.termsLoader
synchronously retrieves aDictionary
for a vocabulary from a data source;AsyncCallbackIndex.dictionaryUpdater
is callback that passes aDictionary
subset for persisting toDictionary
repository;AsyncCallbackIndex.postingsLoader
asynchronously retrieves aPostings
for a vocabulary from a data source; andAsyncCallbackIndex.postingsUpdater
passes aPostings
subset for persisting to aPostings
repository.
AsyncIndexer Class #
The AsyncIndexer
is a subclass of TextIndexerBase that asynchronously reads and writes from / to a Dictionary
and Postings
using asynchronous callbacks. A AsyncIndexer
can be initialized using the TextIndexer.async
factory.
The AsyncIndexer
is suitable for indexing a large corpus but may have latency and processing overhead. Consider running AsyncIndexer
in an isolate to avoid slowing down the main thread.
An example of the use of the TextIndexer.async
factory is included in the examples.
Definitions #
The following definitions are used throughout the documentation:
corpus
- the collection ofdocuments
for which anindex
is maintained.dictionary
- is a hash ofterms
(vocabulary
) to the frequency of occurence in thecorpus
documents.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.index
- an inverted index used to look updocument
references from thecorpus
against avocabulary
ofterms
. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexedterm
in eachdocument
's fields (zones).index-elimination
- selecting a subset of the entries in an index where theterm
is in the collection ofterms
in a search phrase.postings
- a separate index that records whichdocuments
thevocabulary
occurs in. In this implementation we also record the positions of eachterm
in thetext
to create a positional invertedindex
.postings list
- a record of the positions of aterm
in adocument
. A position of aterm
refers to the index of theterm
in an array that contains all theterms
in thetext
.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
such as its position(s) in the text or frequency of occurrence.tokenizer
- a function that returns a collection oftoken
s fromtext
, after applying a character filter,term
filter, stemmer and / or lemmatizer.vocabulary
- the collection ofterms
indexed from thecorpus
.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.