text_indexing 0.0.2 text_indexing: ^0.0.2 copied to clipboard
Dart library for creating an inverted index on a collection of text documents.
text_indexing #
Dart library for creating an inverted index on a collection of text documents.
THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Skip to section:
Overview #
This library provides an interface and implementation classes that build and maintain an (inverted, positional) index for a collection of documents or corpus
(see definitions).
The TextIndexer constructs two artifacts:
- a
dictionary
that holds thevocabulary
ofterms
and the frequency of occurrence for eachterm
in thecorpus
; and - a
postings
map that holds a list of references to thedocuments
for eachterm
(thepostings list
).
In this implementation, our postings list
is a hashmap of the document id (docId
) to an ordered list of the positions
of the term
in the document to allow query algorithms to score and rank search results based on the position(s) of a term in the results.
Refer to the references to learn more about information retrieval systems and the theory behind this library.
Usage #
In the pubspec.yaml
of your flutter project, add the text_indexing
dependency.
dependencies:
text_indexing: <latest version>
In your code file add the text_indexing
import.
import 'package:text_indexing/text_indexing.dart';
For small collections, instantiate a InMemoryIndexer, passing empty Dictionary
and Postings
hashmaps, then iterate over a collection of documents.
// - initialize a [InMemoryIndexer]
final indexer = InMemoryIndexer(dictionary: {}, postings: {});
// - iterate through the sample data
await Future.forEach(documents.entries, (MapEntry<String, String> doc) async {
// - index each document
await indexer.index(doc.key, doc.value);
});
The examples demonstrate the use of the InMemoryIndexer and PersistedIndexer.
API #
The API exposes the TextIndexer interface that builds and maintain an index for a collection of documents.
Three implementations of the TextIndexer interface are provided:
- the TextIndexerBase abstract base class implements the
TextIndexer.index
andTextIndexer.emit
methods; - the InMemoryIndexer class is for fast indexing of a smaller corpus using in-memory dictionary and postings hashmaps; and
- the PersistedIndexer class, aimed at working with a larger corpus and asynchronous dictionaries and postings.
TextIndexer Interface #
The text indexing classes (indexers) in this library implement TextIndexer
, an interface intended for information retrieval software applications. The design of the TextIndexer
interface is consistent with information retrieval theory and is intended to construct and/or maintain two artifacts:
- a hashmap with the vocabulary as key and the document frequency as the values (the
dictionary
); and - another hashmap with the vocabulary as key and the postings lists for the linked
documents
as values (thepostings
).
The dictionary and postings can be asynchronous data sources or in-memory hashmaps. The TextIndexer
reads and writes to/from these artifacts using the TextIndexer.loadTerms
, TextIndexer.updateDictionary
, TextIndexer.loadTermPostings
and TextIndexer.upsertTermPostings
asynchronous methods.
The TextIndexer.index
method indexes text from a document, returning a list of PostingsList
that is also emitted by TextIndexer.postingsStream
. The TextIndexer.index
method calls TextIndexer.emit
, passing the list of PostingsList
.
The TextIndexer.emit
method is called by TextIndexer.index
, and adds an event to the postingsStream
.
Listen to TextIndexer.postingsStream
to handle the postings list emitted whenever a document is indexed.
Implementing classes override the following fields:
TextIndexer.tokenizer
is theTokenizer
instance used by the indexer to parse documents to tokens;TextIndexer.postingsStream
emits a list ofPostingsList
instances whenever a document is indexed.
Implementing classes override the following asynchronous methods:
TextIndexer.index
indexes text from a document, returning a list ofPostingsList
and adding it to theTextIndexer.postingsStream
by callingTextIndexer.emit
;emit
is called by index, and adds an event to thepostingsStream
after updating the dictionary and postings data stores;TextIndexer.loadTerms
returns aDictionaryTerm
map for a collection of terms from a dictionary;TextIndexer.updateDictionary
passes new or updatedDictionaryTerm
instances for persisting to a dictionary data store;TextIndexer.loadTermPostings
returns aPostingsEntry
map for a collection of terms from a postings source; andTextIndexer.upsertTermPostings
passes new or updatedPostingsEntry
instances for upserting to a postings data store.
TextIndexerBase Class #
The TextIndexerBase
is an abstract base class that implements the TextIndexer.index
and TextIndexer.emit
methods.
Subclasses of TextIndexerBase
may override the override TextIndexerBase.emit
method to perform additional actions whenever a document is indexed.
InMemoryIndexer Class #
The InMemoryIndexer
is a subclass of TextIndexerBase
that builds and maintains in-memory dictionary and postings hashmaps. These hashmaps are updated whenever InMemoryIndexer.emit
is called at the end of the InMemoryIndexer.index
method, so awaiting a call to InMemoryIndexer.index
will provide access to the updated InMemoryIndexer.dictionary
and InMemoryIndexer.postings
maps.
The InMemoryIndexer
is suitable for indexing a smaller corpus. An example of the use of InMemoryIndexer
is included in the examples.
PersistedIndexer Class #
The PersistedIndexer
is a subclass of TextIndexerBase that asynchronously reads and writes dictionary and postings data sources. These data sources are asynchronously updated whenever PersistedIndexer.emit
is called by the PersistedIndexer.index
method.
The PersistedIndexer
is suitable for indexing a large corpus but may incur some latency penalty and processing overhead. Consider running PersistedIndexer
in an isolate to avoid slowing down the main thread.
An example of the use of PersistedIndexer
is included in the package examples.
Definitions #
The following definitions are used throughout the documentation:
corpus
- the collection ofdocuments
for which anindex
is maintained.dictionary
- is a hash ofterms
(vocabulary
) to the frequency of occurence in thecorpus
documents.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.index
- an inverted index used to look updocument
references from thecorpus
against avocabulary
ofterms
. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexedterm
in eachdocument
.postings
- a separate index that records whichdocuments
thevocabulary
occurs in. In this implementation we also record the positions of eachterm
in thetext
to create a positional invertedindex
.postings list
- a record of the positions of aterm
in adocument
. A position of aterm
refers to the index of theterm
in an array that contains all theterms
in thetext
.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
such as its position(s) in the text or frequency of occurrence.tokenizer
- a function that returns a collection oftoken
s fromtext
, after applying a character filter,term
filter, stemmer and / or lemmatizer.vocabulary
- the collection ofterms
indexed from thecorpus
.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.