text_analysis 0.9.1 | Dart package

Text analyzer that extracts tokens and k-grams from text. #

THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.

Skip to section:

Overview
Usage
API
Definitions
References
Issues

Overview #

To tokenize text in preparation of constructing a dictionary from a corpus of documents in an information retrieval system.

The tokenization process comprises the following steps:

a term splitter splits text to a list of terms at appropriate places like white-space and mid-sentence punctuation;
a character filter manipulates terms prior to stemming and tokenization (e.g. changing case and / or removing non-word characters);
a term filter manipulates the terms by splitting compound or hyphenated terms or applying stemming and lemmatization. The termFilter can also filter out stopwords; and
the tokenizer converts the resulting terms to a collection of tokens that contain the term and a pointer to the position of the term in the source text.

This library also includes an extension method Set<KGram> kGrams([int k = 3]) on Term that parses a set of k-grams of length k from a term. The default k-gram length is 3 (tri-gram).

Text analysis

Refer to the references to learn more about information retrieval systems and the theory behind this library.

Usage #

In the pubspec.yaml of your flutter project, add the following dependency:

dependencies:
  text_analysis: <latest version>

In your code file add the following import:

import 'package:text_analysis/text_analysis.dart';

Basic English text analysis can be performed by using a TextAnalyzer instance with the default configuration and no token filter:

  /// Use a TextAnalyzer instance to tokenize the [text] using the default 
  /// [English] configuration.
  final document = await TextAnalyzer().tokenize(text);

For more complex text analysis:

implement a TextAnalyzerConfiguration for a different language or tokenizing non-language documents;
implement a custom ITextAnalyzeror extend TextAnalyzerBase; and/or
pass in a TokenFilter function to a TextAnalyzer to manipulate the tokens after tokenization as shown in the examples.

API #

The key members of the text_analysis library are briefly described in this section. Please refer to the documentation for details.

Skip to:

Type definitions
Object models
Interfaces
Implementation classes

Type definitions #

The API uses the following function type definitions and type aliases to improve code readability:

CharacterFilter is a function that manipulates terms prior to stemming and tokenization (e.g. changing case and / or removing non-word characters);
Ft is an lias for int and denotes the frequency of a Term in an index or indexed object (the term frequency).
IdFt is an alias for double, where it represents the inverse document frequency of a term, defined as idft = log (N / dft), where N is the total number of terms in the index and dft is the document frequency of the term (number of documents that contain the term).
JsonTokenizer is a function that returns Token collection from the fields in a JSON document hashmap of Zone to value;
kGram is an alias for String, used in the context of a sequence of k consecutive characters in a Term.
SourceText, Zone and Term are all aliases for the DART core type String when used in different contexts;
StopWords is an alias for Set<String>;
TermFilter is a function that manipulates a Term collection by splitting compound or hyphenated terms or applying stemming and lemmatization. The TermFilter can also filter out stopwords;
TermSplitter is a function that splits SourceText to an orderd list of Term at appropriate places like white-space and mid-sentence punctuation;
TokenFilter is a function that returns a subset of a Token collection, preserving its sort order; and
Tokenizer is a function that converts SourceText to a Token collection, preserving the order of the Term instances.

Object models #

The text_analysis library includes the following object-model classes:

a Token represents a Term present in a List<Token> with its position and optional zone name.

Interfaces #

The text_analysis library exposes two interfaces:

the TextAnalyzerConfiguration interface; and
the ITextAnalyzer interface.

TextAnalyzerConfiguration Interface

The TextAnalyzerConfiguration interface exposes language-specific properties and methods used in text analysis:

a TextAnalyzerConfiguration.termSplitter to split the text into terms;
a TextAnalyzerConfiguration.characterFilter to remove non-word characters.
a TextAnalyzerConfiguration.termFilter to apply a stemmer/lemmatizer or stopword list.

ITextAnalyzer Interface

The ITextAnalyzer is an interface for a text analyser class that extracts tokens from text for use in full-text search queries and indexes:

ITextAnalyzer.configuration is a TextAnalyzerConfiguration used by the ITextAnalyzer to tokenize source text.
Provide a ITextAnalyzer.tokenFilter to manipulate tokens or restrict tokenization to tokens that meet criteria for either index or count;
the ITextAnalyzer.tokenize function tokenizes text to a List<Token> that contains all the Tokens in the text; and
the ITextAnalyzer.tokenizeJson function tokenizes a JSON hashmap to a List<Token> that contains all the Tokens in the document.

Implementation classes #

The latest version provides the following implementation classes:

the English class implements TextAnalyzerConfiguration and provides text analysis configuration properties for the English language;
the TextAnalyzerBase abstract class implements ITextAnalyzer.tokenize and ITextAnalyzer.tokenizeJson; and
the TextAnalyzer class extends TextAnalyzerBase and implements ITextAnalyzer.tokenFilter and ITextAnalyzer.configuration as final fields with their values passed in as (optional) parameters (with defaults) at initialization.

English class

A basic TextAnalyzerConfiguration implementation for English language analysis.

The termFilter applies the following algorithm:

apply the characterFilter to the term;
if the resulting term is empty or contained in kStopWords, return an empty collection; else
insert the filterered term in the return value;
split the term at commas, periods, hyphens and apostrophes unless preceded and ended by a number;
if the term can be split, add the split terms to the return value, unless the (split) terms are in kStopWords or are empty strings.

The characterFilter function:

returns the term if it can be parsed as a number; else
converts the term to lower-case;
changes all quote marks to single apostrophe +U0027;
removes enclosing quote marks;
changes all dashes to single standard hyphen;
removes all non-word characters from the term;
replaces all characters except letters and numbers from the end of the term.

TextAnalyzerBase class

The TextAnalyzerBase class implements the ITextAnalyzer.tokenize method:

tokenizes source text using the configuration;
manipulates the output by applying tokenFilter; and, finally
returns a List<Token> enumerating all the Tokens in the source text.

TextAnalyzer Class

The TextAnalyzer class extends TextAnalyzerBase:

implements configuration and tokenFilter as final fields passed in as optional parameters at instantiation;
configuration is used by the TextAnalyzer to tokenize source text and defaults to English.configuration; and
provide nullable function tokenFilter if you want to manipulate tokens or restrict tokenization to tokens that meet specific criteria. The default is TextAnalyzer.defaultTokenFilter, that applies thePorter2Stemmer).

Definitions #

The following definitions are used throughout the documentation:

corpus- the collection of documents for which an index is maintained.
character filter - filters characters from text in preparation of tokenization.
dictionary - is a hash of terms (vocabulary) to the frequency of occurence in the corpus documents.
document - a record in the corpus, that has a unique identifier (docId) in the corpus's primary key and that contains one or more text zones/fields that are indexed.
index - an inverted index used to look up document references from the corpus against a vocabulary of terms. k-gram - a sequence of (any) k consecutive characters from a term. A k-gram can start with "$", dentoting the start of the [Term], and end with "$", denoting the end of the [Term]. The 3-grams for "castle" are { $ca, cas, ast, stl, tle, le$ }.
lemmatizer - lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).
postings - a separate index that records which documents the vocabulary occurs in. In this implementation we also record the positions of each term in the text to create a positional inverted index.
postings list - a record of the positions of a term in a document. A position of a term refers to the index of the term in an array that contains all the terms in the text.
term - a word or phrase that is indexed from the corpus. The term may differ from the actual word used in the corpus depending on the tokenizer used.
term filter - filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking a stemmer and / or lemmatizer.
stemmer - stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).
stopwords - common words in a language that are excluded from indexing.
term frequency (Ft) is the frequency of a term in an index or indexed object.
term position is the zero-based index of a term in an ordered array of terms tokenized from the corpus.
text - the indexable content of a document.
token - representation of a term in a text source returned by a tokenizer. The token may include information about the term position in the source text.
token filter - returns a subset of tokens from the tokenizer output.
tokenizer - a function that returns a collection of tokens from the terms in a text source after applying a character filter and term filter.
vocabulary - the collection of terms indexed from the corpus.

References #

Issues #

If you find a bug please fill an issue.

This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.

text_analysis 0.9.1
text_analysis: ^0.9.1 copied to clipboard

Metadata

Text analyzer that extracts tokens and k-grams from text. #

Overview #

Usage #

API #

Type definitions #

Object models #

Interfaces #

TextAnalyzerConfiguration Interface

ITextAnalyzer Interface

Implementation classes #

English class

TextAnalyzerBase class

TextAnalyzer Class

Definitions #

References #

Issues #

← Metadata

Publisher

Metadata

License

Dependencies

More

text_analysis 0.9.1 text_analysis: ^0.9.1 copied to clipboard

Metadata

Text analyzer that extracts tokens and k-grams from text. #

Overview #

Usage #

API #

Type definitions #

Object models #

Interfaces #

TextAnalyzerConfiguration Interface

ITextAnalyzer Interface

Implementation classes #

English class

TextAnalyzerBase class

TextAnalyzer Class

Definitions #

References #

Issues #

← Metadata

Publisher

Metadata

License

Dependencies

More

text_analysis 0.9.1
text_analysis: ^0.9.1 copied to clipboard