free_text_search #

Search a inverted positional index and return ranked references to documents relevant to the search phrase.

THIS PACKAGE IS IN BETA DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.

Objective #

The compoments of this library:

parse a free-text phrase to a query;
search the dictionary and postings of a text index for the query terms;
perform iterative scoring and ranking of the returned dictionary entries and postings; and
return ranked references to documents relevant to the search phrase.

Free text search overview

API #

class `FreeTextQuery` #

class `QueryParser` #

Usage #

TODO: describe usage.

Definitions #

The following definitions are used throughout the documentation:

corpus- the collection of documents for which an index is maintained.
dictionary - is a hash of terms (vocabulary) to the frequency of occurence in the corpus documents.
document - a record in the corpus, that has a unique identifier (docId) in the corpus's primary key and that contains one or more text fields that are indexed.
index - an inverted index used to look up document references from the corpus against a vocabulary of terms. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexed term in each document.
postings - a separate index that records which documents the vocabulary occurs in. In this implementation we also record the positions of each term in the text to create a positional inverted index.
postings list - a record of the positions of a term in a document. A position of a term refers to the index of the term in an array that contains all the terms in the text.
term - a word or phrase that is indexed from the corpus. The term may differ from the actual word used in the corpus depending on the tokenizer used.
text - the indexable content of a document.
token - representation of a term in a text source returned by a tokenizer. The token may include information about the term such as its position(s) in the text or frequency of occurrence.
tokenizer - a function that returns a collection of tokens from text, after applying a character filter, term filter, stemmer and / or lemmatizer.
vocabulary - the collection of terms indexed from the corpus.

References #

Issues #

If you find a bug please fill an issue.

This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.

free_text_search 0.0.1-beta.2
free_text_search: ^0.0.1-beta.2 copied to clipboard

Metadata

free_text_search #

Objective #

API #

class `FreeTextQuery` #

class `QueryParser` #

Usage #

Definitions #

References #

Issues #

← Metadata

Publisher

Metadata

License

Dependencies

More

free_text_search 0.0.1-beta.2 free_text_search: ^0.0.1-beta.2 copied to clipboard

Metadata

free_text_search #

Objective #

API #

class FreeTextQuery #

class QueryParser #

Usage #

Definitions #

References #

Issues #

← Metadata

Publisher

Metadata

License

Dependencies

More

free_text_search 0.0.1-beta.2
free_text_search: ^0.0.1-beta.2 copied to clipboard

class `FreeTextQuery` #

class `QueryParser` #