text_analysis 0.6.5 text_analysis: ^0.6.5 copied to clipboard
Text analyzer that extracts tokens from text for use in full-text search queries and indexes.
text_analysis #
Text analyzer that extracts tokens from text for use in full-text search queries and indexes.
THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Skip to section:
Overview #
To tokenize text in preparation of constructing a dictionary
from a corpus
of documents
in an information retrieval system.
The tokenization process comprises the following steps:
- a
term splitter
splits text to a list of terms at appropriate places like white-space and mid-sentence punctuation; - a
character filter
manipulates terms prior to stemming and tokenization (e.g. changing case and / or removing non-word characters); - a
term filter
manipulates the terms by splitting compound or hyphenated terms or applying stemming and lemmatization. ThetermFilter
can also filter outstopwords
; and - the
tokenizer
converts the resulting terms to a collection oftokens
that contain the term and a pointer to the position of the term in the source text.
Refer to the references to learn more about information retrieval systems and the theory behind this library.
Usage #
In the pubspec.yaml
of your flutter project, add the following dependency:
dependencies:
text_analysis: <latest version>
In your code file add the following import:
import 'package:text_analysis/text_analysis.dart';
Basic English text analysis can be performed by using a TextAnalyzer
instance with the default configuration and no token filter:
/// Use a TextAnalyzer instance to tokenize the [text] using the default
/// [English] configuration.
final document = await TextAnalyzer().tokenize(text);
For more complex text analysis:
- implement a
TextAnalyzerConfiguration
for a different language or tokenizing non-language documents; - implement a custom
ITextAnalyzer
or extendTextAnalyzerBase
; and/or - pass in a
TokenFilter
function to aTextAnalyzer
to manipulate the tokens after tokenization as shown in the examples.
API #
The key members of the text_analysis
library are briefly described in this section. Please refer to the documentation for details.
Skip to:
Type definitions #
The API uses the following function type definitions and type aliases to improve code readability:
SourceText
,FieldName
andTerm
are all aliases for the DART core typeString
when used in different contexts;StopWords
is an alias forSet<String>
;CharacterFilter
is a function that manipulates terms prior to stemming and tokenization (e.g. changing case and / or removing non-word characters);JsonTokenizer
is a function that returnsToken
collection from the fields in a JSON document hashmap ofFieldName
to value;SentenceSplitter
is a function that returns a list of sentences fromSourceText
. In English, theSourceText
is split at sentence endings marks such as periods, question marks and exclamation marks;TermFilter
is a function that manipulates aTerm
collection by splitting compound or hyphenated terms or applying stemming and lemmatization. TheTermFilter
can also filter outstopwords
;TermSplitter
is a function that splitsSourceText
to an orderd list ofTerm
at appropriate places like white-space and mid-sentence punctuation;TokenFilter
is a function that returns a subset of aToken
collection, preserving its sort order; andTokenizer
is a function that convertsSourceText
to aToken
collection, preserving the order of theTerm
instances.
Object models #
The text_analysis
library includes the following object-model classes:
- a
Token
represents aTerm
present in aTextSource
with itsposition
and optionalfield name
. - a
Sentence
represents aTextSource
not containing sentence ending punctuation such as periods, question-marks and exclamations, except where used in tokens, identifiers or other terms; and - A
TextSource
represents aTextSource
that has been analyzed to enumerateSentence
andToken
collections.
Interfaces #
The text_analysis
library exposes two interfaces:
- the TextAnalyzerConfiguration interface; and
- the ITextAnalyzer interface.
TextAnalyzerConfiguration Interface
The TextAnalyzerConfiguration
interface exposes language-specific properties and methods used in text analysis:
- a
TextAnalyzerConfiguration.sentenceSplitter
splits the text at sentence endings such as periods, exclamations and question marks or line endings; - a
TextAnalyzerConfiguration.termSplitter
to split the text into terms; - a
TextAnalyzerConfiguration.characterFilter
to remove non-word characters. - a
TextAnalyzerConfiguration.termFilter
to apply a stemmer/lemmatizer or stopword list.
ITextAnalyzer Interface
The ITextAnalyzer
is an interface for a text analyser class that extracts tokens from text for use in full-text search queries and indexes:
ITextAnalyzer.configuration
is aTextAnalyzerConfiguration
used by theITextAnalyzer
to tokenize source text.- Provide a
ITextAnalyzer.tokenFilter
to manipulate tokens or restrict tokenization to tokens that meet criteria for either index or count; - the
ITextAnalyzer.tokenize
function tokenizes text to aTextSource
object that contains all theToken
s in the text; and - the
ITextAnalyzer.tokenizeJson
function tokenizes a JSON hashmap to aTextSource
object that contains all theToken
s in the document.
Implementation classes #
The latest version provides the following implementation classes:
- the English class implements TextAnalyzerConfiguration and provides text analysis configuration properties for the English language;
- the TextAnalyzerBase abstract class implements
ITextAnalyzer.tokenize
; and - the TextAnalyzer class extends TextAnalyzerBase and implements
ITextAnalyzer.tokenFilter
andITextAnalyzer.configuration
as final fields with their values passed in as (optional) parameters (with defaults) at initialization.
English class
A basic TextAnalyzerConfiguration implementation for English
language analysis.
The termFilter
applies the following algorithm:
- apply the
characterFilter
to the term; - if the resulting term is empty or contained in
kStopWords
, return an empty collection; else - insert the filterered term in the return value;
- split the term at commas, periods, hyphens and apostrophes unless preceded and ended by a number;
- if the term can be split, add the split terms to the return value, unless the (split) terms are in
kStopWords
or are empty strings.
The characterFilter
function:
- returns the term if it can be parsed as a number; else
- converts the term to lower-case;
- changes all quote marks to single apostrophe +U0027;
- removes enclosing quote marks;
- changes all dashes to single standard hyphen;
- removes all non-word characters from the term;
- replaces all characters except letters and numbers from the end of the term.
The sentenceSplitter
inserts_kSentenceDelimiter
at sentence breaks and then splits the source text into a list of sentence strings (sentence breaks are characters that match English.reLineEndingSelector
or English.reSentenceEndingSelector
). Empty sentences are removed.
TextAnalyzerBase Class
The TextAnalyzerBase
class implements the ITextAnalyzer.tokenize
method:
- tokenizes source text using the
configuration
; - manipulates the output by applying
tokenFilter
; and, finally - returns a
TextSource
enumerating the source text,Sentence
collection andToken
collection.
TextAnalyzer Class
The TextAnalyzer
class extends TextAnalyzerBase:
- implements
configuration
andtokenFilter
as final fields passed in as optional parameters at instantiation; configuration
is used by theTextAnalyzer
to tokenize source text and defaults toEnglish.configuration
; and- provide nullable function
tokenFilter
if you want to manipulate tokens or restrict tokenization to tokens that meet specific criteria. The default isTextAnalyzer.defaultTokenFilter
, that applies thePorter2Stemmer
).
Definitions #
The following definitions are used throughout the documentation:
corpus
- the collection ofdocuments
for which anindex
is maintained.character filter
- filters characters from text in preparation of tokenization.dictionary
- is a hash ofterms
(vocabulary
) to the frequency of occurence in thecorpus
documents.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.index
- an inverted index used to look updocument
references from thecorpus
against avocabulary
ofterms
. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexedterm
in eachdocument
.lemmatizer
- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).postings
- a separate index that records whichdocuments
thevocabulary
occurs in. In this implementation we also record the positions of eachterm
in thetext
to create a positional invertedindex
.postings list
- a record of the positions of aterm
in adocument
. A position of aterm
refers to the index of theterm
in an array that contains all theterms
in thetext
.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.term filter
- filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking astemmer
and / orlemmatizer
.stemmer
- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).stopwords
- common words in a language that are excluded from indexing.text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
position in the source text.token filter
- returns a subset oftokens
from the tokenizer output.tokenizer
- a function that returns a collection oftoken
s from the terms in a text source after applying acharacter filter
andterm filter
.vocabulary
- the collection ofterms
indexed from thecorpus
.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.