text_indexing 0.0.1 text_indexing: ^0.0.1 copied to clipboard
Dart library for creating an inverted index on a collection of text documents.
text_indexing #
Dart library for creating an inverted index on a collection of text document
s.
THIS PACKAGE IS PRE-RELEASE AND SUBJECT TO DAILY BREAKING CHANGES.
Objective #
The objective of this package is to provide an interface and implementation classes that build and maintain a term
dictionary
that holds the vocabulary
of term
s and the frequency of occurrence for each term
in the corpus
and a postings
map that holds the references to the document
s for each term
. In this implementation, our postings
include the positions of the term
in the document
s to allow search algorithms to derive relevance on a per document
basis.
Definitions #
The following definitions are used throughout the documentation:
corpus
- the collection ofdocument
s for which anindex
is maintained.dictionary
- is a hash ofterm
s (vocabulary
) to the frequency of occurence in thecorpus
document
s. In this implementation,Dictionary
is a type defintion for a hashmap with thevocabulary
as key and thedocument
frequency as the values.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.index
- an inverted index used to look updocument
references from thecorpus
against avocabulary
ofterm
s. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexedterm
in eachdocument
.postings
- a separate index that records whichdocument
s thevocabulary
occurs in. In this implementation we also record the positions of eachterm
in thedocument
to create a positional invertedindex
.postings list
- a record of the positions of aterm
in adocument
. A position of aterm
refers to the index of theterm
in an array that contains all theterm
s in thetext
.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
such as its position(s) in the text or frequency of occurrence.tokenizer
- a function that returns a collection oftoken
s fromtext
, after applying a character filter,term
filter, stemmer and / or lemmatizer.vocabulary
is the collection ofterm
s/words indexed from thecorpus
.
Interface #
The text indexing classes (indexers) in this library inherit from TextIndexer
, an interface intended for information retrieval software applications. The TextIndexer
interface is consistent with information retrieval theory.
The inverted index
is comprised of two artifacts:
- a
Dictionary
is a hashmap ofDictionaryEntry
s with thevocabulary
as key and thedocument
frequency as the values; and - a
Postings
a hashmap ofPostingsEntry
s with thevocabulary
as key and thepostings list
s for the linkeddocument
s as values.
The Dictionary
and Postings
can be asynchronous data sources or in-memory hashmaps. The TextIndexer
reads and writes to/from these artifacts using the loadTerms
, updateDictionary
, loadTermPostings
and upsertTermPostings
asynchronous methods.
The index
method indexes text
from a document
, returning a list of PostingsList
that is also emitted by postingsStream
. The index
method calls emit
, passing the list of PostingsList
.
The emit
method is called by index
, and adds an event to the postingsStream
.
Listen to postingsStream
to update your dictionary
and postings
map.
Implementing classes override the following fields:
Tokenizer
is theTokenizer
instance used by the indexer to parsedocument
s to tokens;postingsStream
emits a list ofPostingsList
instances whenever adocument
is indexed.
Implementing classes override the following asynchronous methods:
index
indexestext
from adocument
, returning a list ofPostingsList
and adding it to thepostingsStream
by callingemit
;emit
is called byindex
, and adds an event to thepostingsStream
after updating theDictionary
andPostings
;loadTerms
returns aDictionary
for avocabulary
from aDictionary
;updateDictionary
passes new or updatedDictionaryEntry
instances for persisting to aDictionary
;loadTermPostings
returnsPostingsEntry
entities for avocabulary
fromPostings
; andupsertTermPostings
passes new or updatedPostingsEntry
instances for upserting toPostings
.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press. 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Install #
In the pubspec.yaml
of your flutter project, add the following dependency:
dependencies:
text_indexing: ^0.0.1
In your code file add the following import:
import 'package:text_indexing/text_indexing.dart';
Usage #
Examples are provided for the InMemoryIndexer
and PersistedIndexer
, two implementations of the TextIndexer
interface that inherit from TextIndexerBase
.
TextIndexerBase
Class #
The TextIndexerBase
is an abstract base class that implements the TextIndexer.index
and TextIndexer.emit
methods.
Subclasses of TextIndexerBase
may override the override TextIndexerBase.emit
method to perform additional actions whenever a document
is indexed.
InMemoryIndexer
Class #
The InMemoryIndexer
is a subclass of TextIndexerBase
that builds and maintains in-memory Dictionary
and PostingMap
hashmaps. These hashmaps are updated whenever InMemoryIndexer.emit
is called at the end of the InMemoryIndexer.index
method, so awaiting a call to InMemoryIndexer.index
will provide access to the updated InMemoryIndexer.dictionary
and InMemoryIndexer.postings
collections.
The InMemoryIndexer
is suitable for indexing a smaller corpus
. An example of the use of InMemoryIndexer
is included in the examples.
PersistedIndexer
Class #
The PersistedIndexer
is a subclass of TextIndexerBase
that asynchronously reads and writes dictionary
and postings
data sources. These data sources are asynchronously updated whenever PersistedIndexer.emit
is called by the PersistedIndexer.index
method.
The PersistedIndexer
is suitable for indexing and searching a large corpus but may incur some latency penalty and processing overhead. An example of the use of PersistedIndexer
is included in the package examples.
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.