free_text_search 0.0.1-beta.2 free_text_search: ^0.0.1-beta.2 copied to clipboard
Search a inverted positional index and return ranked references to documents relevant to the search phrase.
free_text_search #
Search a inverted positional index and return ranked references to documents relevant to the search phrase.
THIS PACKAGE IS IN BETA DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Objective #
The compoments of this library:
- parse a free-text phrase to a query;
- search the
dictionary
andpostings
of a textindex
for the queryterms
; - perform iterative scoring and ranking of the returned dictionary entries and postings; and
- return ranked references to documents relevant to the search phrase.
API #
class FreeTextQuery
#
class QueryParser
#
Usage #
TODO: describe usage.
Definitions #
The following definitions are used throughout the documentation:
corpus
- the collection ofdocuments
for which anindex
is maintained.dictionary
- is a hash ofterms
(vocabulary
) to the frequency of occurence in thecorpus
documents.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.index
- an inverted index used to look updocument
references from thecorpus
against avocabulary
ofterms
. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexedterm
in eachdocument
.postings
- a separate index that records whichdocuments
thevocabulary
occurs in. In this implementation we also record the positions of eachterm
in thetext
to create a positional invertedindex
.postings list
- a record of the positions of aterm
in adocument
. A position of aterm
refers to the index of theterm
in an array that contains all theterms
in thetext
.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
such as its position(s) in the text or frequency of occurrence.tokenizer
- a function that returns a collection oftoken
s fromtext
, after applying a character filter,term
filter, stemmer and / or lemmatizer.vocabulary
- the collection ofterms
indexed from thecorpus
.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.