text_analysis 0.3.1 text_analysis: ^0.3.1 copied to clipboard
Text analyzer that extracts tokens from text for use in full-text search queries and indexes.
text_analysis #
Text analyzer that extracts tokens from text for use in full-text search queries and indexes.
THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Skip to section:
Overview #
To tokenize text in preparation of constructing a dictionary
from a corpus
of documents
in an information retrieval system.
The tokenization process comprises the following steps:
- a
term splitter
splits text to a list of terms at appropriate places like white-space and mid-sentence punctuation; - a
character filter
manipulates terms prior to stemming and tokenization (e.g. changing case and / or removing non-word characters); - a
term filter
manipulates the terms by splitting compound or hyphenated terms or applying stemming and lemmatization. ThetermFilter
can also filter outstopwords
; and - the
tokenizer
converts the resulting terms to a collection oftokens
that contain the term and a pointer to the position of the term in the source text.
Refer to the references to learn more about information retrieval systems and the theory behind this library.
Usage #
In the pubspec.yaml
of your flutter project, add the following dependency:
dependencies:
text_analysis: <latest version>
In your code file add the following import:
import 'package:text_analysis/text_analysis.dart';
Basic English text analysis can be performed by using a TextAnalyzer
instance with the default configuration and no token filter:
/// Use a TextAnalyzer instance to tokenize the [text] using the default
/// [English] configuration.
final document = await TextAnalyzer().tokenize(text);
For more complex text analysis:
- implement a
TextAnalyzerConfiguration
for a different language or tokenizing non-language documents; - implement a custom
ITextAnalyzer
or extendTextAnalyzerBase
; and/or - pass in a
TokenFilter
function to aTextAnalyzer
to manipulate the tokens after tokenization as shown in the examples.
API #
The package exposes two interfaces:
- the TextAnalyzerConfiguration interface; and
- the ITextAnalyzer interface.
The latest version provides the following implementation classes:
- implementation class English, implements TextAnalyzerConfiguration and provides text analysis configuration properties for the English language; and
- the TextAnalyzerBase abstract class implements
ITextAnalyzer.tokenize
; and - the TextAnalyzer class extends TextAnalyzerBase and implements
ITextAnalyzer.tokenFilter
andITextAnalyzer.configuration
as final fields with their values passed in as (optional) parameters (with defaults) at initialization.
TextAnalyzerConfiguration Interface #
The TextAnalyzerConfiguration
interface exposes language-specific properties and methods used in text analysis:
- a
TextAnalyzerConfiguration.sentenceSplitter
splits the text at sentence endings such as periods, exclamations and question marks or line endings; - a
TextAnalyzerConfiguration.termSplitter
to split the text into terms; - a
TextAnalyzerConfiguration.characterFilter
to remove non-word characters. - a
TextAnalyzerConfiguration.termFilter
to apply a stemmer/lemmatizer or stopword list.
ITextAnalyzer Interface #
The ITextAnalyzer
is an interface for a text analyser class that extracts tokens from text for use in full-text search queries and indexes
ITextAnalyzer.configuration
is a TextAnalyzerConfiguration
used by the ITextAnalyzer
to tokenize source text.
Provide a ITextAnalyzer.tokenFilter
to manipulate tokens or restrict tokenization to tokens that meet criteria for either index or count.
The tokenize
function tokenizes source text using the ITextAnalyzer.configuration
and then manipulates the output by applying ITextAnalyzer.tokenFilter
.
English class #
A basic TextAnalyzerConfiguration implementation for English
language analysis.
The termFilter
applies the following algorithm:
- apply the
characterFilter
to the term; - if the resulting term is empty or contained in
kStopWords
, return an empty collection; else - insert the filterered term in the return value;
- split the term at commas, periods, hyphens and apostrophes unless preceded and ended by a number;
- if the term can be split, add the split terms to the return value, unless the (split) terms are in
kStopWords
or are empty strings.
The characterFilter
function:
- returns the term if it can be parsed as a number; else
- converts the term to lower-case;
- changes all quote marks to single apostrophe +U0027;
- removes enclosing quote marks;
- changes all dashes to single standard hyphen;
- removes all non-word characters from the term;
- replaces all characters except letters and numbers from the end of the term.
The sentenceSplitter
inserts_kSentenceDelimiter
at sentence breaks and then splits the source text into a list of sentence strings (sentence breaks are characters that match English.reLineEndingSelector
or English.reSentenceEndingSelector
). Empty sentences are removed.
TextAnalyzerBase Class #
The TextAnalyzerBase
class implements the ITextAnalyzer.tokenize
method:
- tokenizes source text using the
configuration
; - manipulates the output by applying
tokenFilter
; and, finally - returns a
TextSource
enumerating the source text,Sentence
collection andToken
collection.
TextAnalyzer Class #
The TextAnalyzer
class extends TextAnalyzerBase:
- implements
configuration
andtokenFilter
as final fields passed in as optional parameters at instantiation; configuration
is used by theTextAnalyzer
to tokenize source text and defaults toEnglish.configuration
; and- provide nullable function
tokenFilter
if you want to manipulate tokens or restrict tokenization to tokens that meet specific criteria. The default isTextAnalyzer.defaultTokenFilter
, that applies thePorter2Stemmer
).
Definitions #
The following definitions are used throughout the documentation:
corpus
- the collection ofdocuments
for which anindex
is maintained.character filter
- filters characters from text in preparation of tokenization.dictionary
- is a hash ofterms
(vocabulary
) to the frequency of occurence in thecorpus
documents.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.index
- an inverted index used to look updocument
references from thecorpus
against avocabulary
ofterms
. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexedterm
in eachdocument
.lemmatizer
- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).postings
- a separate index that records whichdocuments
thevocabulary
occurs in. In this implementation we also record the positions of eachterm
in thetext
to create a positional invertedindex
.postings list
- a record of the positions of aterm
in adocument
. A position of aterm
refers to the index of theterm
in an array that contains all theterms
in thetext
.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.term filter
- filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking astemmer
and / orlemmatizer
.stemmer
- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).stopwords
- common words in a language that are excluded from indexing.text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
position in the source text.token filter
- returns a subset oftokens
from the tokenizer output.tokenizer
- a function that returns a collection oftoken
s from the terms in a text source after applying acharacter filter
andterm filter
.vocabulary
- the collection ofterms
indexed from thecorpus
.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.