Tokenize text, compute document readbility and compare terms in Natural Language Processing.
Skip to section:
Overview
The text_analysis
package provides methods to tokenize text, compute readibility scores for a document and evaluate similarity of terms
. It is intended to be used in Natural Language Processing (NLP
) as part of an information retrieval system.
It is split into three libraries:
- text_analysis is the core library that exports the tokenization, analysis and string similarity functions;
- extensions exports extension methods also provided as static methods of the TextSimilarity class; and
- type_definitions exports all the typedefs used in this package.
Refer to the references to learn more about information retrieval systems and the theory behind this library.
Tokenization
Tokenization comprises the following steps:
- a
term splitter
splits text to a list of terms at appropriate places like white-space and mid-sentence punctuation; - a
character filter
manipulates terms prior to tokenization (e.g. changing case and / or removing non-word characters); - a
term filter
manipulates the terms by splitting compound or hyphenated terms or applying stemming and lemmatization. ThetermFilter
can also filter outstopwords
; and - the
tokenizer
converts terms to a collection oftokens
that contain tokenized versions of the term and a pointer to the position of the tokenized term (n-gram) in the source text. The tokens are generated for keywords, terms and/or n-grams, depending on theTokenizingStrategy
selected. The desired n-gram range can be passed in when tokenizing the text or document.
Readability
The TextDocument enumerates a text document's paragraphs, sentences, terms and tokens and computes readability measures:
- the average number of words in each sentence;
- the average number of syllables per word;
- the
Flesch reading ease score
, a readibility measure calculated from sentence length and word length on a 100-point scale; and Flesch-Kincaid grade level
, a readibility measure relative to U.S. school grade level. TheTextDocument
also includes a co-occurrence graph generated using the Rapid Keyword Extraction (RAKE) algorithm, from which the keywords (and keyword scores) can be obtained.
String Comparison
The following measures of term
similarity are provided as extensions on String:
Damerau–Levenshtein distance
is the minimum number of single-character edits (transpositions, insertions, deletions or substitutions) required to change oneterm
into another;edit similarity
is a normalized measure ofDamerau–Levenshtein distance
on a scale of 0.0 to 1.0, calculated by dividing the the difference between the maximum edit distance (sum of the length of the two terms) and the computededitDistance
, by the maximum edit distance;length distance
returns the absolute value of the difference in length between two terms;character similarity
returns the similarity two terms as it relates to the collection of unique characters in each term on a scale of 0.0 to 1.0;length similarity
returns the similarity in length between two terms on a scale of 0.0 to 1.0 on a log scale (1 - the log of the ratio of the term lengths); andJaccard similarity
measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.
The TermSimilarity class enumerates all the similarity measures of two terms and provides the TermSimilarity.similarity
property that combines the four measures into a single value.
The TermSimilarity class also provides a function for splitting terms into k-grams
, used in spell correction algorithms.
Usage
In the pubspec.yaml
of your flutter project, add the following dependency:
dependencies:
text_analysis: <latest version>
In your code file add the text_analysis library import. This will also import the Porter2Stemmer
class from the porter_2_stemmer
package.
// import the core classes
import 'package:text_analysis/text_analysis.dart';
To use the package's extensions and/or type definitions, also add any of the following imports:
// import the implementation classes, if needed
import 'package:text_indexing/implementation.dart';
// import the typedefs, if needed
import 'package:text_indexing/type_definitions.dart';
// import the extensions, if needed
import 'package:text_indexing/extensions.dart';
Basic English tokenization can be performed by using a English.analyzer
static const instance with no token filter:
// Use the static English.analyzer instance to tokenize the text using the
// English analyzer.
final tokens = await English.analyzer.tokenizer(readabilityExample,
strategy: TokenizingStrategy.all, nGramRange: NGramRange(1, 2));
To analyze text or a document, hydrate a TextDocument to obtain the text statistics and readibility scores:
// get some sample text
final sample =
'The Australian platypus is seemingly a hybrid of a mammal and reptilian creature.';
// hydrate the TextDocument
final textDoc = await TextDocument.analyze(
sourceText: sample,
analyzer: English.analyzer,
nGramRange: NGramRange(1, 3));
// print the `Flesch reading ease score`
print(
'Flesch Reading Ease: ${textDoc.fleschReadingEaseScore().toStringAsFixed(1)}');
// prints "Flesch Reading Ease: 37.5"
To compare terms, call the desired extension on the term
, or the static method from the TermSimilarity class:
// define a misspelt term
const term = 'bodrer';
// a collection of auto-correct options
const candidates = [
'bord',
'board',
'broad',
'boarder',
'border',
'brother',
'bored'
];
// get a list of the terms orderd by descending similarity
final matches = term.matches(candidates);
// same as TermSimilarity.matches(term, candidates))
// print matches
print('Ranked matches: $matches');
// prints:
// Ranked matches: [border, boarder, bored, brother, board, bord, broad]
//
Please see the examples for more details.
API
The key interfaces of the text_analysis
library are briefly described in this section. Please refer to the documentation for details.
The API contains a fair amount of boiler-plate, but we aim to make the code as readable, extendable and re-usable as possible:
- We use an
interface > implementation mixin > base-class > implementation class pattern
:- the
interface
is an abstract class that exposes fields and methods but contains no implementation code. Theinterface
may expose a factory constructor that returns animplementation class
instance; - the
implementation mixin
implements theinterface
class methods, but not the input fields; - the
base-class
is an abstract class with theimplementation mixin
and exposes a default, unnamed generative const constructor for sub-classes. The intention is thatimplementation classes
extend thebase class
, overriding theinterface
input fields with final properties passed in via a const generative constructor.
- the
- To maximise performance of the indexers the API performs lookups in nested hashmaps of DART core types. To improve code legibility the API makes use of type aliases, callback function definitions and extensions. The typedefs and extensions are not exported by the text_analysis library, but can be found in the type_definitions, implementation and extensions mini-libraries. Import these libraries seperately if needed.
TermSimilarity
The TermSimilarity class provides the following measures of similarity between two terms:
characterSimilarity
returns the similarity two terms as it relates to the collection of unique characters in each term on a scale of 0.0 to 1.0;editDistance
returns theDamerau–Levenshtein distance
, the minimum number of single-character edits (transpositions, insertions, deletions or substitutions) required to change oneterm
into another;editSimilarity
returns a normalized measure ofDamerau–Levenshtein distance
on a scale of 0.0 to 1.0, calculated by dividing the the difference between the maximum edit distance (sum of the length of the two terms) and the computededitDistance
, by by the maximum edit distance;lengthDistance
returns the absolute value of the difference in length between two terms;lengthSimilarity
returns the similarity in length between two terms on a scale of 0.0 to 1.0 on a log scale (1 - the log of the ratio of the term lengths);jaccardSimilarity
returns the Jaccard Similarity Index of two terms.
To compare one term with a collection of other terms, the following static methods are also provided:
editDistanceMap
returns a hashmap ofterms
to theireditSimilarity
with a term;editSimilarityMap
returns a hashmap ofterms
to theireditSimilarity
with a term;lengthSimilarityMap
returns a hashmap ofterms
to theirlengthSimilarity
with a term;jaccardSimilarityMap
returns a hashmap ofterms
to Jaccard Similarity Index with a term;termSimilarityMap
returns a hashmap ofterms
to termSimilarity with a term;termSimilarities
,editSimilarities
,characterSimilarities
,lengthSimilarities
andjaccardSimilarities
all return a list of SimilarityIndex values for candidate terms; andmatches
returns the best matches fromterms
for a term, in descending order of term similarity (best match first).
String comparisons are NOT case-sensitive.
The TextSimilarity class relies on extension methods that can be imported from the extensions library.
TextAnalyzer
The TextAnalyzer interface exposes language-specific properties and methods used in text analysis:
- characterFilter is a function that manipulates text prior to stemming and tokenization;
- termFilter is a filter function that returns a collection of terms from a term. It returns an empty collection if the term is to be excluded from analysis or, returns multiple terms if the term is split (at hyphens) and / or, returns modified term(s), such as applying a stemmer algorithm;
- termSplitter returns a list of terms from text;
- sentenceSplitter splits text into a list of sentences at sentence and line endings;
- paragraphSplitter splits text into a list of paragraphs at line endings;
- stemmer is a language-specific function that returns the stem of a term;
- lemmatizer is a language-specific function that returns the lemma of a term;
- tokenizer and jsonTokenizer are callbacks that return a collection of tokens from text or a document;
- keywordExtractor is a splitter function that returns an ordered collection of keyword phrases from text;
- termExceptions is a hashmap of words to token terms for special words that should not be re-capitalized, stemmed or lemmatized;
- stopWords are terms that commonly occur in a language and that do not add material value to the analysis of text; and
- syllableCounter returns the number of syllables in a word or text.
The LatinLanguageAnalyzer implements the TextAnalyzer
interface methods for languages that use the Latin/Roman alphabet/character set.
The English implementation of TextAnalyzer is included in this library and mixes in the LatinLanguageAnalyzerMixin
.
TextDocument
The TextDocument object model enumerates a text document's paragraphs, sentences, terms, keywords, n-grams, syllable count and tokens and provides functions that return text analysis measures:
- averageSentenceLength is the average number of words in sentences;
- averageSyllableCount is the average number of syllables per word in terms;
- wordCount the total number of words in the sourceText;
- fleschReadingEaseScore is a readibility measure calculated from sentence length and word length on a 100-point scale. The higher the score, the easier it is to understand the document;
- fleschKincaidGradeLevel is a readibility measure relative to U.S. school grade level. It is also calculated from sentence length and word length .
The TextDocumentMixin implements the averageSentenceLength, averageSyllableCount, wordCount, fleschReadingEaseScore and fleschKincaidGradeLevel methods.
A TextDocument can be hydrated with the unnamed factory constructor or using the analyze or analyzeJson static methods. Alternatively, extend TextDocumentBase class.
Definitions
The following definitions are used throughout the documentation:
corpus
- the collection ofdocuments
for which anindex
is maintained.cosine similarity
- similarity of two vectors measured as the cosine of the angle between them, that is, the dot product of the vectors divided by the product of their euclidian lengths (from Wikipedia).character filter
- filters characters from text in preparation of tokenization .Damerau–Levenshtein distance
- a metric for measuring theedit distance
between twoterms
by counting the minimum number of operations (insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change oneterm
into the other (from Wikipedia).dictionary (in an index)
- a hash ofterms
(vocabulary
) to the frequency of occurence in thecorpus
documents.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.document frequency (dFt)
- the number of documents in thecorpus
that contain a term.edit distance
- a measure of how dissimilar two terms are by counting the minimum number of operations required to transform one string into the other (from Wikipedia).etymology
- the study of the history of the form of words and, by extension, the origin and evolution of their semantic meaning across time (from Wikipedia).Flesch reading ease score
- a readibility measure calculated from sentence length and word length on a 100-point scale. The higher the score, the easier it is to understand the document (from Wikipedia).Flesch-Kincaid grade level
- a readibility measure relative to U.S. school grade level. It is also calculated from sentence length and word length (from Wikipedia).IETF language tag
- a standardized code or tag that is used to identify human languages in the Internet. (from Wikepedia).index
- an inverted index used to look updocument
references from thecorpus
against avocabulary
ofterms
.index-elimination
- selecting a subset of the entries in an index where theterm
is in the collection ofterms
in a search phrase.inverse document frequency (iDft)
- a normalized measure of how rare aterm
is in the corpus. It is defined aslog (N / dft)
, where N is the total number of terms in the index. TheiDft
of a rare term is high, whereas theiDft
of a frequent term is likely to be low.Jaccard index
measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets (from Wikipedia).Map<String, dynamic>
is an acronym for"Java Script Object Notation"
, a common format for persisting data.k-gram
- a sequence of (any) k consecutive characters from aterm
. Ak-gram
can start with "$", denoting the start of the term, and end with "$", denoting the end of the term. The 3-grams for "castle" are { $ca, cas, ast, stl, tle, le$ }.lemma or lemmatizer
- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).n-gram
(sometimes also called Q-gram) is a contiguous sequence ofn
items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. Then-grams
typically are collected from a text or speechcorpus
. When the items are words,n-grams
may also be called shingles (from Wikipedia).Natural language processing (NLP)
is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data (from Wikipedia).Part-of-Speech (PoS) tagging
is the task of labelling every word in a sequence of words with a tag indicating what lexical syntactic category it assumes in the given sequence (from Wikipedia).Phonetic transcription
- the visual representation of speech sounds (or phones) by means of symbols. The most common type of phonetic transcription uses a phonetic alphabet, such as the International Phonetic Alphabet (from Wikipedia).postings
- a separate index that records whichdocuments
thevocabulary
occurs in. In a positionalindex
, the postings also records the positions of eachterm
in thetext
to create a positional invertedindex
.postings list
- a record of the positions of aterm
in adocument
. A position of aterm
refers to the index of theterm
in an array that contains all theterms
in thetext
. In a zonedindex
, thepostings lists
records the positions of eachterm
in thetext
azone
.stem or stemmer
- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form (generally a written word form) (from Wikipedia).stopwords
- common words in a language that are excluded from indexing.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.term filter
- filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking astemmer
and / orlemmatizer
.term expansion
- finding terms with similar spelling (e.g. spelling correction) or synonyms for a term.term frequency (Ft)
- the frequency of aterm
in an index or indexed object.term position
- the zero-based index of aterm
in an ordered array ofterms
tokenized from thecorpus
.text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
such as its position(s) (term position
) in the text or frequency of occurrence (term frequency
).token filter
- returns a subset oftokens
from the tokenizer output.tokenizer
- a function that returns a collection oftoken
s fromtext
, after applying a character filter,term
filter, stemmer and / or lemmatizer.vocabulary
- the collection ofterms
indexed from thecorpus
.zone
- the field or zone of a document that a term occurs in, used for parametric indexes or where scoring and ranking of search results attribute a higher score to documents that contain a term in a specific zone (e.g. the title rather that the body of a document).
References
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
- Wikipedia (4), "Synonym", from Wikipedia, the free encyclopedia
- Wikipedia (5), "Jaccard Index", from Wikipedia, the free encyclopedia
- Wikipedia (6), "Flesch–Kincaid readability tests", from Wikipedia, the free encyclopedia
- Wikipedia (7), "Edit distance", from Wikipedia, the free encyclopedia
- Wikipedia (8), "Damerau–Levenshtein distance", from Wikipedia, the free encyclopedia
- Wikipedia (9), "Natural language processing", from Wikipedia, the free encyclopedia
- Wikipedia (10), "IETF language tag", from Wikipedia, the free encyclopedia
- Wikipedia (11), "Phonetic transcription", from Wikipedia, the free encyclopedia
- Wikipedia (12), "Etymology", from Wikipedia, the free encyclopedia
- Wikipedia (13), "Part-of-speech tagging", from Wikipedia, the free encyclopedia
- Wikipedia (14), "N-gram", from Wikipedia, the free encyclopedia
- Wikipedia (15), "Cosine similarity", from Wikipedia, the free encyclopedia
Issues
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.
Libraries
- extensions
- Exports the extension methods exposed by this package. Also exports
the extensions from the
porter_2_stemmer
package. - implementation
- DART text analyzer that extracts tokens from JSON documents for use in information retrieval systems.
- text_analysis
- DART text analyzer that extracts tokens from JSON documents for use in information retrieval systems.
- type_definitions
- Exports all the type definitions used in the
text_analysis
library.