text_analysis 0.12.0-1 copy "text_analysis: ^0.12.0-1" to clipboard
text_analysis: ^0.12.0-1 copied to clipboard

outdated

Text analyzer that extracts tokens from text for use in full-text search queries and indexes.

GM Consult Pty Ltd

Tokenize text, compute document readbility and term similarity. #

THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.

Skip to section:

Overview #

The text_analysis library provides methods to tokenize text, compute readibility scores for a document and compute similarity of words. It is intended to be used in Natural Language Proceesing (NLP) as part of an information retrieval system.

Refer to the references to learn more about information retrieval systems and the theory behind this library.

Tokenization

Tokenization comprises the following steps:

  • a term splitter splits text to a list of terms at appropriate places like white-space and mid-sentence punctuation;
  • a character filter manipulates terms prior to stemming and tokenization (e.g. changing case and / or removing non-word characters);
  • a term filter manipulates the terms by splitting compound or hyphenated terms or applying stemming and lemmatization. The termFilter can also filter out stopwords; and
  • the tokenizer converts the resulting terms to a collection of tokens that contain the term and a pointer to the position of the term in the source text.

A String extension method Set<KGram> kGrams([int k = 2]) that parses a set of k-grams of length k from a term. The default k-gram length is 3 (tri-gram).

Text analysis

Readibility #

The TextDocument enumerates a text document's paragraphs, sentences, terms and tokens and computes readability measures:

  • the average number of words in each sentence;
  • the average number of syllables for words;
  • the Flesch reading ease score, a readibility measure calculated from sentence length and word length on a 100-point scale; and
  • Flesch-Kincaid grade level, a readibility measure relative to U.S. school grade level.

String Comparison #

The following String extension methods can be used for comparing terms:

  • lengthDistance returns a normalized measure of difference between two terms on a log (base 2) scale;
  • lengthSimilarity returns the similarity in length between two terms on a scale of 0 to 1.0 (equivalent to 1-lengthSimilarity with a lower bound of 0.0);
  • lengthSimilarityMap returns a hashmap of terms to their lengthSimilarity with a term;
  • jaccardSimilarity returns the Jaccard Similarity Index of two terms;
  • jaccardSimilarityMap returns a hashmap of terms to Jaccard Similarity Index with a term;
  • termSimilarity returns a similarity index value between 0.0 and 1.0, defined as the product of jaccardSimilarity and lengthSimilarity. A term similarity of 1.0 means the two terms are equal in length and have an identical collection of k-grams;
  • termSimilarityMap returns a hashmap of terms to termSimilarity with a term; and
  • matches returns the best matches from terms for a term, in descending order of term similarity (best match first).

Usage #

In the pubspec.yaml of your flutter project, add the following dependency:

dependencies:
  text_analysis: <latest version>

In your code file add the following import:

import 'package:text_analysis/text_analysis.dart';

To use the package's extensions, type definitions or the porter_2_stemmer library, also add any of the following imports:

import 'package:text_analysis/extensions.dart';
import 'package:text_analysis/type_definitions.dart';
import 'package:text_analysis/package_exports.dart';

Basic English tokenization can be performed by using a TextTokenizer instance with the default text analyzer and no token filter:

  /// Use a TextTokenizer instance to tokenize the [text] using the default 
  /// [English] analyzer.
  final document = await TextTokenizer().tokenize(text);

To analyze text or a document, hydrate a [TextDocument] to obtain the text statistics and readibility scores:

      // get some sample text
      final sample =
          'The Australian platypus is seemingly a hybrid of a mammal and reptilian creature.';

      // hydrate the TextDocument
      final textDoc = await TextDocument.analyze(sourceText: sample);

      // print the `Flesch reading ease score`
      print(
          'Flesch Reading Ease: ${textDoc.fleschReadingEaseScore().toStringAsFixed(1)}');
      // prints "Flesch Reading Ease: 37.5"

For more complex text analysis:

  • implement a TextAnalyzer for a different language or non-language documents;
  • implement a custom TextTokenizeror extend TextTokenizerBase; and/or
  • pass in a TokenFilter function to a TextTokenizer to manipulate the tokens after tokenization as shown in the examples; and/or extend [TextDocumentBase].

Please see the examples for more details.

API #

The key interfaces of the text_analysis library are briefly described in this section. Please refer to the documentation for details.

TextAnalyzer

The TextAnalyzer interface exposes language-specific properties and methods used in text analysis:

  • characterFilter is a function that manipulates text prior to stemming and tokenization;
  • termFilter is a filter function that returns a collection of terms from a term. It returns an empty collection if the term is to be excluded from analysis or, returns multiple terms if the term is split (at hyphens) and / or, returns modified term(s), such as applying a stemmer algorithm;
  • termSplitter returns a list of terms from text;
  • sentenceSplitter splits text into a list of sentences at sentence and line endings;
  • paragraphSplitter splits text into a list of paragraphs at line endings; and
  • syllableCounter returns the number of syllables in a word or text.

The English implementation of TextAnalyzer is included in this library.

TextTokenizer

The TextTokenizer extracts tokens from text for use in full-text search queries and indexes. It uses a TextAnalyzer and token filter in the tokenize and tokenizeJson methods that return a list of tokens from text or a document.

An unnamed factory constructor hydrates an implementation class.

TextDocument #

The TextDocument object model enumerates a text document's paragraphs, sentences, terms and tokens and provides functions that return text analysis measures:

Definitions #

  • corpus- the collection of documents for which an index is maintained.
  • character filter - filters characters from text in preparation of tokenization.
  • Damerau–Levenshtein distance - a metric for measuring the edit distance between two terms by counting the minimum number of operations (insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one term into the other.
  • dictionary - is a hash of terms (vocabulary) to the frequency of occurence in the corpus documents.
  • document - a record in the corpus, that has a unique identifier (docId) in the corpus's primary key and that contains one or more text fields that are indexed.
  • document frequency (dFt) - the number of documents in the corpus that contain a term.
  • edit distance - a measure of how dissimilar two terms are by counting the minimum number of operations required to transform one string into the other (Wikipedia (7)).
  • Flesch reading ease score - a readibility measure calculated from sentence length and word length on a 100-point scale. The higher the score, the easier it is to understand the document (Wikipedia(6)).
  • Flesch-Kincaid grade level - a readibility measure relative to U.S. school grade level. It is also calculated from sentence length and word length (Wikipedia(6)).
  • index - an inverted index used to look up document references from the corpus against a vocabulary of terms.
  • index-elimination - selecting a subset of the entries in an index where the term is in the collection of terms in a search phrase.
  • inverse document frequency (iDft) - is a normalized measure of how rare a term is in the corpus. It is defined as log (N / dft), where N is the total number of terms in the index. The iDft of a rare term is high, whereas the iDft of a frequent term is likely to be low.
  • Jaccard index measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets (from Wikipedia).
  • JSON is an acronym for "Java Script Object Notation", a common format for persisting data.
  • k-gram - a sequence of (any) k consecutive characters from a term. A k-gram can start with "$", denoting the start of the term, and end with "$", denoting the end of the term. The 3-grams for "castle" are { $ca, cas, ast, stl, tle, le$ }.
  • lemmatizer - lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).
  • Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data (from Wikipedia).
  • postings - a separate index that records which documents the vocabulary occurs in. In a positional index, the postings also records the positions of each term in the text to create a positional inverted index.
  • postings list - a record of the positions of a term in a document. A position of a term refers to the index of the term in an array that contains all the terms in the text. In a zoned index, the postings lists records the positions of each term in the text a zone.
  • term - a word or phrase that is indexed from the corpus. The term may differ from the actual word used in the corpus depending on the tokenizer used.
  • term filter - filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking a stemmer and / or lemmatizer.
  • stemmer - stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).
  • stopwords - common words in a language that are excluded from indexing.
  • term frequency (Ft) is the frequency of a term in an index or indexed object.
  • term position is the zero-based index of a term in an ordered array of terms tokenized from the corpus.
  • text - the indexable content of a document.
  • token - representation of a term in a text source returned by a tokenizer. The token may include information about the term such as its position(s) (term position) in the text or frequency of occurrence (term frequency).
  • token filter - returns a subset of tokens from the tokenizer output.
  • tokenizer - a function that returns a collection of tokens from text, after applying a character filter, term filter, stemmer and / or lemmatizer.
  • vocabulary - the collection of terms indexed from the corpus.
  • zone is the field or zone of a document that a term occurs in, used for parametric indexes or where scoring and ranking of search results attribute a higher score to documents that contain a term in a specific zone (e.g. the title rather that the body of a document).

References #

Issues #

If you find a bug please fill an issue.

This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.

21
likes
0
pub points
81%
popularity

Publisher

verified publishergmconsult.com.au

Text analyzer that extracts tokens from text for use in full-text search queries and indexes.

Homepage
Repository (GitHub)
View/report issues

License

unknown (license)

Dependencies

collection, porter_2_stemmer

More

Packages that depend on text_analysis