free_text_search
Search a inverted positional index and return ranked references to documents relevant to the search phrase.
THIS PACKAGE IS IN BETA DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Skip to section:
Overview
The components of this library:
- parse a free-text phrase with query modifiers to a query;
- search the
dictionary
andpostings
of a textindex
for the query terms; - perform iterative scoring and ranking of the returned dictionary entries and postings; and
- return ranked references to documents relevant to the search phrase.
Query phrases can include modifiers broadly consistent with Google search modifiers.
Refer to the references to learn more about information retrieval systems and the theory behind this library.
Usage
In the pubspec.yaml
of your flutter project, add the free_text_search
dependency.
dependencies:
free_text_search: <latest version>
In your code file add the free_text_search
import.
import 'package:free_text_search/free_text_search.dart';
To parse a phrase simply pass it to the QueryParser.parse
method, including any modifiers as shown in the snippet below.
// A phrase with all the modifiers
const phrase =
'"athletics track" +surfaced arena OR stadium "Launceston" -hobart NOT help-me';
// Pass the phrase to a QueryParser instance parse method
final queryTerms = await QueryParser().parse(phrase);
// The following terms and their `[MODIFIER]` properties are returned
// "athletics track" [EXACT]
// "athletics" [OR]
// "track" [OR]
// "surfaced" [IMPORTANT]
// "arena" [AND]
// "stadium" [OR]
// "Launceston" [EXACT]
// "launceston" [OR]
// "hobart" [NOT]
// "help-me" [NOT]
// "help" [NOT]
The examples demonstrate the use of the QueryParser and PersistedIndexer.
API
FreeTextSearch class
The FreeTextSearch
class exposes the search
method that returns a list of SearchResult instances in descending order of relevance.
The length of the returned collection of SearchResult can be limited by passing a limit parameter to search
. The default limit is 20.
After parsing the phrase to terms, the Postings
and Dictionary
for the query terms are asynchronously retrieved from the index:
FreeTextSearch.dictionaryLoader
retrievesDictionary
;FreeTextSearch.postingsLoader
retrievesPostings
;FreeTextSearch.configuration
is used to tokenize the query phrase (defaults toEnglish.configuration
); and- provide a custom
tokenFilter
if you want to manipulate tokens or restrict tokenization to tokens that meet specific criteria (default isTextAnalyzer.defaultTokenFilter
.
Ensure that the FreeTextSearch.configuration
and FreeTextSearch.tokenFilter
match the TextAnalyzer
used to construct the index on the target collection that will be searched.
SearchResult class
The SearchResult
model represents a ranked search result of a query against a text index:
SearchResult.docId
is the unique identifier of the document result in the corpus; andSearchResult.relevance
is the relevance score awarded to the document by the scoring and ranking algorithm. Higher scores indicate increased relevance of the document.
QueryParser class
The QueryParser
parses free text queries, returning a collection of QueryTerm objects that enumerate each term and its QueryTermModifier.
The QueryParser.configuration
and QueryParser.tokenFilter
should match the TextAnalyzer
used to construct the index on the target collection that will be searched.
The QueryParser.parse
method parses a phrase to a collection of QueryTerms that includes:
- all the original words in the phrase, except query modifiers ('AND', 'OR', '"', '+', '-', 'NOT);
- derived versions of all words returned by the
QueryParser.configuration.termFilter
, including child words and stems or lemmas of exact phrases; and
A QueryTerm for a derived version of a term always has its QueryTerm.modifier
property set to QueryTermModifier.OR
, unless the term was marked QueryTermModifier.NOT
in the query phrase.
FreeTextQuery class
The FreeTextQuery
enumerates the properties of a text search query:
FreeTextQuery.phrase
is the unmodified search phrase, including all modifiers and tokens; andFreeTextQuery.terms
is the ordered list of all terms extracted from thephrase
used to look up results in an inverted index.
QueryTerm class
The QueryTerm
object extends Token
, and enumerates the properties of a term in a free text query phrase:
QueryTerm.term
is the term that will be looked up in the index;QueryTerm.termPosition
is the zero-based position of theterm
in an ordered list of all the terms in the source text; andFreeTextQuery.modifier
is the QueryTermModifier applied for this term. The default modifieris
QueryTermModifier.AND`.
QueryTermModifier Enumeration
The phrase can include the following modifiers to guide the the search results scoring/ranking algorithm:
- terms or phrases wrapped in double quotes will be marked
QueryTermModifier.EXACT
(e.g."athletics track"
); - terms preceded by
"OR"
are markedQueryTermModifier.OR
and are alternatives to the preceding term; - terms preceded by
"NOT" or "-"
are markedQueryTermModifier.NOT
to rank results lower if they include these terms; - terms following the plus sign
"+"
are markedQueryTermModifier.IMPORTANT
to rank results that include these terms higher; and - all other terms are marked as
QueryTermModifier.AND
.
Definitions
The following definitions are used throughout the documentation:
corpus
- the collection ofdocuments
for which anindex
is maintained.dictionary
- is a hash ofterms
(vocabulary
) to the frequency of occurence in thecorpus
documents.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.index
- an inverted index used to look updocument
references from thecorpus
against avocabulary
ofterms
. The implementation in this library relies on a positional inverted index, that also includes the positions of the indexedterm
in eachdocument
.postings
- a separate index that records whichdocuments
thevocabulary
occurs in. .postings list
- a record of the positions of aterm
in adocument
and its fields. A position of aterm
refers to the index of theterm
in an array that contains all theterms
in thetext
.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
such as its position(s) in the text or frequency of occurrence.tokenizer
- a function that returns a collection oftoken
s fromtext
, after applying a character filter,term
filter, stemmer and / or lemmatizer.vocabulary
- the collection ofterms
indexed from thecorpus
.
References
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Issues
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.
Libraries
- free_text_search
- Dart library for creating an inverted index on a collection of text documents.