text_analysis library

DART text analyzer that extracts tokens from JSON documents for use in information retrieval systems.

Classes

English: A TextAnalyzer implementation for English language analysis.
LatinLanguageAnalyzer: A TextAnalyzer implementation for Latin languages analysis.
NGramRange: Enumerates a range of N-gram sizes (minimum and maximum length).
Porter2Stemmer: DART implementation of the Porter Stemming Algorithm (see https://snowballstem.org/algorithms/), used for reducing a word to its word stem, base or root form.
SimilarityIndex: Object model for a suggestion as alternate for a term. Used in spelling correction and term expansion.
TermCoOccurrenceGraph: A RAKE co-occurrence graph for evaluating the score of keywords extracted from text.
TermCoOccurrenceGraphBase: Base class that implements TermCoOccurrenceGraph and mixes in TermCoOccurrenceGraphMixin.
TermSimilarity: A static/abstract class that exposes methods for computing similarity of terms.
TextAnalyzer: An interface exposes language-specific properties and methods used in text analysis.
TextDocument: The TextDocument object model enumerates properties for analysing a text document:
Token: A Token represents a term (word) present in a text source:

PartOfSpeech: In grammar, a part-of-speech is a category of words that have similar grammatical properties.
PoSTag: Part of speech tags are used in natural language processing as part of Part-of-Speech tagging.