free_text_search 0.0.1-beta.1 copy "free_text_search: ^0.0.1-beta.1" to clipboard
free_text_search: ^0.0.1-beta.1 copied to clipboard

unlistedoutdated

Parse a free-text phrase to a query, search a text index and return ranked references to documents relevant to the search phrase.

free_text_search #

Parse a free-text phrase to a query, search a text index and return ranked references to documents relevant to the search phrase.

THIS PACKAGE IS IN BETA DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.

Objective #

The compoments of this library:

  • parse a free-text phrase to a query;
  • search the dictionary and postings of a text index for the query terms;
  • perform scoring and ranking of the returned dictionary entries and postings through an iterative process; and
  • return ranked references to documents relevant to the search phrase.

Free text search overview

Definitions #

The following definitions are used throughout the documentation:

  • corpus- the collection of documents for which an index is maintained.
  • dictionary - is a hash of terms (vocabulary) to the frequency of occurence in the corpus documents.
  • document - a record in the corpus, that has a unique identifier (docId) in the corpus's primary key and that contains one or more text fields that are indexed.
  • index - an inverted index used to look up document references from the corpus against a vocabulary of terms. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexed term in each document.
  • postings - a separate index that records which documents the vocabulary occurs in. In this implementation we also record the positions of each term in the text to create a positional inverted index.
  • postings list - a record of the positions of a term in a document. A position of a term refers to the index of the term in an array that contains all the terms in the text.
  • term - a word or phrase that is indexed from the corpus. The term may differ from the actual word used in the corpus depending on the tokenizer used.
  • text - the indexable content of a document.
  • token - representation of a term in a text source returned by a tokenizer. The token may include information about the term such as its position(s) in the text or frequency of occurrence.
  • tokenizer - a function that returns a collection of tokens from text, after applying a character filter, term filter, stemmer and / or lemmatizer.
  • vocabulary - the collection of terms indexed from the corpus.

API #

class FreeTextQuery #

class QueryParser #

Usage #

TODO: describe usage.

Issues #

If you find a bug please fill an issue.

This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.

0
likes
0
pub points
0%
popularity

Publisher

verified publishergmconsult.dev

Parse a free-text phrase to a query, search a text index and return ranked references to documents relevant to the search phrase.

Homepage
Repository (GitHub)
View/report issues

License

unknown (license)

Dependencies

porter_2_stemmer, text_analysis, text_indexing

More

Packages that depend on free_text_search