text_analysis 0.1.0+1 text_analysis: ^0.1.0+1 copied to clipboard
Text analyzer that extracts tokens from text for use in full-text search queries and indexes.
text_analysis #
Text analyzer that extracts tokens from text for use in full-text search queries and indexes.
THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Objective #
The objective of this package is to provide utilities for analyzing and manipulating text in preparation of constructing a dictionary
from a corpus
of documents
as part of text indexing in an information retrieval application.
The design of the package is consistent with information retrieval theory.
Definitions #
The following definitions are used throughout the documentation:
corpus
- the collection ofdocuments
for which anindex
is maintained.dictionary
- is a hash ofterms
(vocabulary
) to the frequency of occurence in thecorpus
documents.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.index
- an inverted index used to look updocument
references from thecorpus
against avocabulary
ofterms
. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexedterm
in eachdocument
.lemmatizer
- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).postings
- a separate index that records whichdocuments
thevocabulary
occurs in. In this implementation we also record the positions of eachterm
in thetext
to create a positional invertedindex
.postings list
- a record of the positions of aterm
in adocument
. A position of aterm
refers to the index of theterm
in an array that contains all theterms
in thetext
.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.stemmer
- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
such as its position(s) in the text or frequency of occurrence.tokenizer
- a function that returns a collection oftoken
s fromtext
, after applying a character filter,term
filter,stemmer
and / orlemmatizer
.vocabulary
- the collection ofterms
indexed from thecorpus
.
Interfaces #
The package relies on two key interfaces:
- the
ITextAnalyzer
interface; and - the
TextAnalyzerConfiguration
interface.
Interface ITextAnalyzer
#
The ITextAnalyzer
is an interface for a text analyser class that extracts tokens from text for use in full-text search queries and indexes.
ITextAnalyzer.configuration
is a TextAnalyzerConfiguration
used by the [ITextAnalyzer] to tokenize source text.
Provide a ITextAnalyzer.tokenFilter
to manipulate tokens or restrict tokenization to tokens that meet criteria for either index or count.
The tokenize
function tokenizes source text using the ITextAnalyzer.configuration
and then manipulates the output by applying ITextAnalyzer.tokenFilter
.
Interface TextAnalyzerConfiguration
#
The TextAnalyzerConfiguration
interface exposes language-specific properties and methods used in text analysis:
- a
characterFilter
that manipulates terms prior to stemming and tokenization (e.g. changing case and / or removing non-word characters); - a
termFilter
that returns a collection of terms from a term by splitting compound or hyphenated terms or applying stemming and lemmatization. ThetermFilter
can also filter out stopwords by returning an empty collection; - a
sentenceSplitter
returns a list of sentences from text by splitting the text and sentence endings such as periods, exclamations and question marks or line endings; and - a
termSplitter
returns a list of terms from text by splitting the text at appropriate places like white-space and mid-sentence punctuation.
Implementations #
The latest version
provides the following implementation classes:
- implementation class
English
, implementsTextAnalyzerConfiguration
and provides text analysis configuration properties for the English language; and - the
TextAnalyzer
class implementsITextAnalyzer.tokenize
using a token filter and text analysis configuration passed in as parameters at initialization.
Refer to the package API reference for more details.
Usage #
Basic English text analysis can be performed by using a TextAnalyzer
instance with the default configuration and no token filter:
/// Use a TextAnalyzer instance to tokenize the [text] using the default
/// English configuration.
final document = await TextAnalyzer().tokenize(text);
For more complex requirements, override TextAnalyzerConfiguration
and/or pass in a TokenFilter
function to manipulate the tokens after tokenization as shown in the examples.
Install #
In the pubspec.yaml
of your flutter project, add the following dependency:
dependencies:
text_analysis: <latest version>
In your code file add the following import:
import 'package:text_analysis/text_analysis.dart';
Examples #
Examples are provided.
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia