Create an inverted index on a collection of text documents.
THIS PACKAGE IS PRE-RELEASE AND SUBJECT TO DAILY BREAKING CHANGES.
Skip to section:
Overview
This library provides interfaces and implementation classes that build and maintain a (positional, zoned) inverted index for a collection of documents or corpus
(see definitions).
The indexer uses a tokenizer
to construct three artifacts:
- the
dictionary
holds thevocabulary
ofterms
and the frequency of occurrence for eachterm
in thecorpus
; - the
k-gram index
mapsk-grams
toterms
in thedictionary
; - the
keyword postings
index maps the keywords in the corpus to document references with the keyword score for the keyword in that document; and - the
postings
index holds a list of references to thedocuments
for eachterm
(postings list
). Thepostings list
includes the positions of theterm
in the document'szones
(fields), making thepostings
apositional, zoned inverted index
.
Refer to the references to learn more about information retrieval systems and the theory behind this library.
Performance
A sample data set consisting of stock data for the U.S. markets was used to benchmark performance of TextIndexer and InvertedIndex implementations. The data set contains 20,770 JSON documents with basic information on each stock and the JSON data file is 22MB in size.
For the benchmarking tests we created an implementation InvertedIndex class that uses Hive as local storage and benchmarked that against InMemoryIndex. Both indexes were given the same phrase length (2), k-gram length (2) and zones ('name', 'symbol', 'ticker', 'description', 'hashTag')
.
Benchmarking was performed as part of unit tests in the VS Code IDE on a Windows 10 workstation with an Intel(R) Core(TM) i9-7900X CPU running at 3.30GHz and 64GB of DDR4-2666 RAM.
Indexing the corpus
The typical times taken by TextIndexer to index our sample dataset to 243,700 terms and 18,276 k-grams using InMemoryIndex vs the Hive-based index is shown below.
InvertedIndex | Elapsed time | Per document |
---|---|---|
InMemoryIndex | ~15 seconds | 0.68 mS |
Hive-based Index | ~41 minutes | 112 mS |
Building the index while holding all the index artifacts in memory is 165 times faster than placing them in a Hive box (a relatively fast local storage option).
If memory and the size of the corpus allows, InMemoryIndex is a clear winner. The memory required for the postings
, in particular, may not make this practical for larger document collections. The AsyncCallbackIndex class provides the flexibility to access each of the three index hashmaps from a different data source, so implementing applications can, for example, hold the dictionary
and k-gram
indexes in memory, with the postings
in local storage. Alternatively in-memory caching may also provide performance improvements for a corpus with many repeated terms.
Regardless of the InvertedIndex implementation, applications should avoid running the TextIndexer in the same thread as the user interface to avoid UI "jank".
The dictionary
, k-gram index
and postings
are 8MB, 41MB and 362MB in size, respectively, for our sample index of 243,700 terms and 18,276 k-grams.
Querying the indexes
Having created a persisted index on our sample data set, we ran a query on a search phrase of 9 terms we know are present in the sample data. The query requires a few round trips to each of the three indexes to match the terms, calculate the inverse document term frequencies
etc. The elapsed time for retrieving the data from the InMemoryIndex vs the Hive-based index is shown below.
InvertedIndex | Elapsed time |
---|---|
InMemoryIndex | ~22 mS |
Hive-based Index | ~205 mS |
As expected, the InMemoryIndex is quicker than the Hive-based index, but the differences are unlikely to be material in a real-world application, even for predictive text or auto-correct applications.
Usage
In the pubspec.yaml
of your flutter project, add the text_indexing
dependency.
dependencies:
text_indexing: <latest version>
In your code file add the text_indexing
import.
// import the core classes
import 'package:text_indexing/text_indexing.dart';
// import the typedefs, if needed
import 'package:text_indexing/type_definitions.dart';
// import the extensions, if needed
import 'package:text_indexing/extensions.dart';
For small collections, instantiate a TextIndexer with a InMemoryIndex, (optionally passing empty Dictionary
and Postings
hashmaps).
Call the TextIndexer.indexCollection method to a a collection of documents to the index.
// initialize an in=memory index for a JSON collection with two indexed fields
final myIndex = InMemoryIndex(
zones: {'name': 1.0, 'description': 0.5},
nGramRange: NGramRange(1, 2));
// - initialize a `TextIndexer`, passing in the index
final indexer =TextIndexer(index: myIndex);
// - index the json collection `documents`
await indexer.indexCollection(documents);
The examples demonstrate the use of the TextIndexer with a InMemoryIndex and AsyncCallbackIndex.
API
The API exposes the TextIndexer interface that builds and maintains an InvertedIndex for a collection of documents.
To maximise performance of the indexers the API performs lookups in nested hashmaps of DART core types.
The API contains a fair amount of boiler-plate, but we aim to make the code as readable, extendable and re-usable as possible:
- We use an
interface > implementation mixin > base-class > implementation class pattern
:- the
interface
is an abstract class that exposes fields and methods but contains no implementation code. Theinterface
may expose a factory constructor that returns animplementation class
instance; - the
implementation mixin
implements theinterface
class methods, but not the input fields; - the
base-class
is an abstract class with theimplementation mixin
and exposes a default, unnamed generative const constructor for sub-classes. The intention is thatimplementation classes
extend thebase class
, overriding theinterface
input fields with final properties passed in via a const generative constructor; and - the class naming convention for this pattern is
"Interface" > "InterfaceMixin" > "InterfaceBase"
.
- the
- To improve code legibility the API makes use of type aliases, callback function definitions and extensions. The typedefs and extensions are not exported by the text_indexing library, but can be found in the type_definitions and extensions mini-libraries. Import these libraries seperately if needed.
InvertedIndex
The InvertedIndex interface exposes properties and methods for working with Dictionary, KGramIndex and Postings hashmaps.
Two implementation classes are provided:
- the InMemoryIndex class is intended for fast indexing of a smaller corpus using in-memory dictionary, k-gram and postings hashmaps; and
- the AsyncCallbackIndex is intended for working with a larger corpus. It uses asynchronous callbacks to perform read and write operations on
dictionary
,k-gram
andpostings
repositories.
N-Gram Range
InvertedIndex.nGramRange
is the range of N-gram lengths to generate. The minimum n-gram length is 1.
If n-gram length is greater than 1, the index vocabulary also contains n-grams up to nGramRange.max
long, concatenated from consecutive terms. The index size is increased by a factor of[nGramRange.max
. The nGramRange
default is NGramRange(1,2)
for InMemoryIndex and AsyncCallbackIndex.
Zones
InvertedIndex.zones
is a hashmap of zone names to their relative weight in the index.
If zones
is empty, all the text fields of the collection will be indexed, which may increase the size of the index significantly.
K-gram length (k)
InvertedIndex.k
is the length of k-gram entries in the k-gram index.
The preferred k-gram length is 3, or a tri-gram
. This results in a good compromise between the length of the k-gram index and search efficiency.
TextIndexer
TextIndexer is an interface for classes that construct and maintain a InvertedIndex.
Text or documents can be indexed by calling the following methods:
- indexText indexes text;
- indexJson indexes the fields in a
JSON
document; and - indexCollection indexes the fields of all the documents in a JSON document collection.
Definitions
The following definitions are used throughout the documentation:
corpus
- the collection ofdocuments
for which anindex
is maintained.cosine similarity
- similarity of two vectors measured as the cosine of the angle between them, that is, the dot product of the vectors divided by the product of their euclidian lengths (from Wikipedia).character filter
- filters characters from text in preparation of tokenization .Damerau–Levenshtein distance
- a metric for measuring theedit distance
between twoterms
by counting the minimum number of operations (insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change oneterm
into the other (from Wikipedia).dictionary (in an index)
- a hash ofterms
(vocabulary
) to the frequency of occurence in thecorpus
documents.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.document frequency (dFt)
- the number of documents in thecorpus
that contain a term.edit distance
- a measure of how dissimilar two terms are by counting the minimum number of operations required to transform one string into the other (from Wikipedia).etymology
- the study of the history of the form of words and, by extension, the origin and evolution of their semantic meaning across time (from Wikipedia).Flesch reading ease score
- a readibility measure calculated from sentence length and word length on a 100-point scale. The higher the score, the easier it is to understand the document (from Wikipedia).Flesch-Kincaid grade level
- a readibility measure relative to U.S. school grade level. It is also calculated from sentence length and word length (from Wikipedia).IETF language tag
- a standardized code or tag that is used to identify human languages in the Internet. (from Wikepedia).index
- an inverted index used to look updocument
references from thecorpus
against avocabulary
ofterms
.index-elimination
- selecting a subset of the entries in an index where theterm
is in the collection ofterms
in a search phrase.inverse document frequency (iDft)
- a normalized measure of how rare aterm
is in the corpus. It is defined aslog (N / dft)
, where N is the total number of terms in the index. TheiDft
of a rare term is high, whereas theiDft
of a frequent term is likely to be low.Jaccard index
measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets (from Wikipedia).Map<String, dynamic>
is an acronym for"Java Script Object Notation"
, a common format for persisting data.k-gram
- a sequence of (any) k consecutive characters from aterm
. Ak-gram
can start with "$", denoting the start of the term, and end with "$", denoting the end of the term. The 3-grams for "castle" are { $ca, cas, ast, stl, tle, le$ }.lemma or lemmatizer
- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).n-gram
(sometimes also called Q-gram) is a contiguous sequence ofn
items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. Then-grams
typically are collected from a text or speechcorpus
. When the items are words,n-grams
may also be called shingles (from Wikipedia).Natural language processing (NLP)
is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data (from Wikipedia).Part-of-Speech (PoS) tagging
is the task of labelling every word in a sequence of words with a tag indicating what lexical syntactic category it assumes in the given sequence (from Wikipedia).Phonetic transcription
- the visual representation of speech sounds (or phones) by means of symbols. The most common type of phonetic transcription uses a phonetic alphabet, such as the International Phonetic Alphabet (from Wikipedia).postings
- a separate index that records whichdocuments
thevocabulary
occurs in. In a positionalindex
, the postings also records the positions of eachterm
in thetext
to create a positional invertedindex
.postings list
- a record of the positions of aterm
in adocument
. A position of aterm
refers to the index of theterm
in an array that contains all theterms
in thetext
. In a zonedindex
, thepostings lists
records the positions of eachterm
in thetext
azone
.stem or stemmer
- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form (generally a written word form) (from Wikipedia).stopwords
- common words in a language that are excluded from indexing.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.term filter
- filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking astemmer
and / orlemmatizer
.term expansion
- finding terms with similar spelling (e.g. spelling correction) or synonyms for a term.term frequency (Ft)
- the frequency of aterm
in an index or indexed object.term position
- the zero-based index of aterm
in an ordered array ofterms
tokenized from thecorpus
.text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
such as its position(s) (term position
) in the text or frequency of occurrence (term frequency
).token filter
- returns a subset oftokens
from the tokenizer output.tokenizer
- a function that returns a collection oftoken
s fromtext
, after applying a character filter,term
filter, stemmer and / or lemmatizer.vocabulary
- the collection ofterms
indexed from thecorpus
.zone
- the field or zone of a document that a term occurs in, used for parametric indexes or where scoring and ranking of search results attribute a higher score to documents that contain a term in a specific zone (e.g. the title rather that the body of a document).
References
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
- Wikipedia (4), "Synonym", from Wikipedia, the free encyclopedia
- Wikipedia (5), "Jaccard Index", from Wikipedia, the free encyclopedia
- Wikipedia (6), "Flesch–Kincaid readability tests", from Wikipedia, the free encyclopedia
- Wikipedia (7), "Edit distance", from Wikipedia, the free encyclopedia
- Wikipedia (8), "Damerau–Levenshtein distance", from Wikipedia, the free encyclopedia
- Wikipedia (9), "Natural language processing", from Wikipedia, the free encyclopedia
- Wikipedia (10), "IETF language tag", from Wikipedia, the free encyclopedia
- Wikipedia (11), "Phonetic transcription", from Wikipedia, the free encyclopedia
- Wikipedia (12), "Etymology", from Wikipedia, the free encyclopedia
- Wikipedia (13), "Part-of-speech tagging", from Wikipedia, the free encyclopedia
- Wikipedia (14), "N-gram", from Wikipedia, the free encyclopedia
- Wikipedia (15), "Cosine similarity", from Wikipedia, the free encyclopedia
Issues
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.
Libraries
- extensions
- A mini-lbrary of
text_indexing
that exports all the extensions in thetext_indexing
package. - text_indexing
- Dart library for creating an inverted index on a collection of text documents.
- type_definitions
- A mini-lbrary of
text_indexing
that exports all the typedefs in thetext_indexing
package.