porter_2_stemmer 3.0.0 porter_2_stemmer: ^3.0.0 copied to clipboard
Reduce a word to its root form using the Porter English stemming algorithm.
Reduce a word to its root form using the Porter English stemming algorithm #
Version 2.0.0 BREAKING CHANGES. The algorithm implementation has been moved to the Porter2StemmerMixin. The Porter2Stemmer is now an interface with a factory constructor. Please extend the Porter2StemmerBase implementation class in stead of the Porter2Stemmer interface.
DART implementation of the Porter Stemming Algorithm, used for reducing a word to its word stem, base or root form.
Skip to section:
Overview #
This library is a DART implementation of the popular "Porter" English stemming algorithm for use in information retrieval applications. It exports the Porter2Stemmer class, and the Porter2StemmerExtension String extension,
The Porter stemming algorithm is Copyright (c) 2001, Dr Martin Porter and Copyright (c) 2002, Richard Boulton and licensed under the BSD 3-Clause License.
As of version 1.0.0, the Porter2Stemmer class achieves 99.66% accuracy when measured against the sample (Snowball) vocabulary. Taking into account the differences in implementation, this increases to 99.99%, or failure of 4 out of 29,417 terms.
Refer to the references to learn more about the theory behind information retrieval systems.
Departures from Snowball implementation #
Terms that match the following criteria (after stripping opening/closing quotation marks and the possessive apostrophy "'s") are returned unchanged as they are considered to be acronyms, identifiers or non-language terms that have a specific meaning:
- terms that are in all-capitals, e.g. TSLA; and
- terms that contain any non-word characters (anything other than letters, apostrophes and hyphens), e.g. apple.com, alibaba:xnys.
This behaviour can be overriden by pre-processing text with a character filter that converts terms to lower-case and/or strips out non-word characters.
In this implementation of the English (Porter2) stemming algorithm:
- all quotation marks and apostrophies are converted to the standard single quote character U+0027;
- all leading and trailing quotation marks are stripped from the term before processing begins;
- in Step 5, the trailing "e" is not removed from stems that end in "ue". For example, "tongues" is stemmed as tongue (strict implementation returns "tongu") and "picturesque" is returned unchanged rather than stemmed to "picturesqu"); and
- the
exceptions
andkInvariantExceptions
are checked after every step in the algorithm to ensure exceptions are not missed at intermediate steps.
Additional default exceptions have been implemented as follows in the latest version:
-
Terms that have no stem: { 'sky': 'sky', 'bye': 'bye', 'ugly': 'ugly', 'early': 'early', 'only': 'only', 'goodbye': 'goodbye', 'commune': 'commune', 'skye': 'skye', 'news': 'news', 'howe': 'howe', 'atlas': 'atlas', 'cosmos': 'cosmos', 'bias': 'bias', 'andes': 'andes' }
-
Terms that have no stem at the end of Step 1(a): { 'inning': 'inning', 'proceed': 'proceed', 'goodbye': 'goodbye', 'commune': 'commune', 'herring': 'herring', 'earring': 'earring', 'outing': 'outing', 'exceed': 'exceed', 'canning': 'canning', 'succeed': 'succeed', 'doing': 'do' }
Validation #
A validator test is included in the repository test folder.
The Porter2Stemmer: VALIDATOR
test iterates through a hashmap of terms to expected stems that contains 29,417 term/stem pairs.
As of version 1.0.0, the Porter2Stemmer class achieves 99.66% accuracy when measured against the sample (Snowball) vocabulary. Taking into account the differences in implementation, this increases to 99.99%, or failure of 4 out of 29,417 terms. The failed stems are:
- "congeners" => "congener" (expected "congen");
- "fluently" => "fluent" (expected "fluentli");
- "harkye" => "harki" (expected "harky"); and
- "lookye" => "looki" (expected "looky").
API #
The API exposes the Porter2Stemmer class, an English language stemmer
utility class and the Porter2StemmerExtension String extension.
We use an interface > implementation mixin > base-class > implementation class pattern
:
- the
interface
is an abstract class that exposes fields and methods but contains no implementation code. Theinterface
may expose a factory constructor that returns animplementation class
instance; - the
implementation mixin
implements theinterface
class methods, but not the input fields; - the
base-class
is an abstract class with theimplementation mixin
and exposes a default, unnamed generative const constructor for sub-classes. The intention is thatimplementation classes
extend thebase class
, overriding theinterface
input fields with final properties passed in via a const generative constructor; and - the class naming convention for this pattern is
"Interface" > "InterfaceMixin" > "InterfaceBase"
.
Porter2Stemmer interface #
The Porter2Stemmer
interface exposes the Porter2Stemmer.stem
function that reduces a term to its word stem, base or root form by stepping through the five steps of the Porter2 (English) stemming algorithm
.
Terms that match a key in Porter2Stemmer.exceptions
(after stripping quotation marks and possessive apostrophy 's
) are stemmed by returning the corresponding value from Porter2Stemmer.exceptions
.
The default exceptions used by Porter2Stemmer
are:
{ 'skis': 'ski',
'skies': 'sky',
'dying': 'die',
'lying': 'lie',
'tying': 'tie',
'idly': 'idl',
'gently': 'gentl',
'singly': 'singl' }
Porter2StemmerMixin #
Each of the five stemmer steps is implemented as a public function that takes a term as parameter and returns a manipulated String after applying the step algorithm. The steps may therefore be overriden in sub-classes.
Terms that match the following criteria (after stripping quotation marks and possessive apostrophy "s") are returned unchanged as they are considered to be acronyms, identifiers or non-language terms that have a specific meaning:
- terms that are in all-capitals, e.g. TSLA;
- terms that contain any non-word characters (anything other than letters, apostrophes and hyphens), e.g. apple.com, alibaba:xnys
Terms that match a key in Porter2Stemmer.exceptions
(after stripping quotation marks and possessive apostrophy 's
) are stemmed by returning the corresponding value from Porter2Stemmer.exceptions
.
Terms may be converted to lowercase before processing if stemming of all-capitals terms is desired. Split terms that contain non-word characters to stem the term parts separately.
The algorithm steps are described fully here.
Porter2StemmerBase class #
The Porter2StemmerBase
class is an implementation class that mixes in the Porter2StemmerMixin. A const constructor is provided for sub classes.
Porter2StemmerExtension #
The [Porter2StemmerExtension]provides an extension method String.stemPorter2
that reduces a term to its word stem, base or root form using the Porter2 (English) stemming algorithm
.
Pass the exceptions
parameter (a hashmap of String:String) to apply custom exceptions to the algorithm. The default exceptions are the static const Porter2Stemmer.kExceptions
.
This extension method is a shortcut to Porter2Stemmer.stem
method.
Usage #
In the pubspec.yaml
of your flutter project, add the following dependency:
dependencies:
porter_2_stemmer: ^1.0.0
In your library add the following import:
// Import the `Porter2Stemmer` class as well as
// the String extension `term.stemPorter2()`
import 'package:porter_2_stemmer/porter_2_stemmer.dart';
// Optionally, import the stemmer's implementation extensions
import 'package:porter_2_stemmer/extensions.dart';
// Optionally, import the stemmer's implementation constants
import 'package:porter_2_stemmer/constants.dart';
A String extension is provided, and is the simplest way to get stemming.
final stem = term.stemPorter2();
To implement custom exceptions to the algorithm, provide the exceptions parameter (a hashmap of String:String) that provides the term (key) and its stem (value).
The code below instantiates a Porter2Stemmer instance, passing in a custom exception for the term "TSLA".
// Preserve the default exceptions.
final exceptions = Map<String, String>.from(Porter2Stemmer.kExceptions);
// Add a custom exception for "TSLA".
exceptions['TSLA'] = 'tesla';
// Instantiate the [Porter2Stemmer] instance using the custom [exceptions]
final stemmer = Porter2Stemmer(exceptions: exceptions);
// Get the stem for the [term].
final stem = stemmer.stem(term);
Examples are provided for both the extension method and the class method with custom exceptions.
Definitions #
The following definitions are used throughout the documentation:
corpus
- the collection ofdocuments
for which anindex
is maintained.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.lemmatizer
- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).Porter2 stemming algorithm
- a English languagestemmer
developed as part of "Snowball", a small string processing language designed for creating stemming algorithms for use in information retrieval.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.stemmer
- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
such as its position(s) in the text or frequency of occurrence.tokenizer
- a function that returns a collection oftoken
s fromtext
, after applying a character filter,term
filter,stemmer
and / orlemmatizer
.vocabulary
- the collection ofterms
indexed from thecorpus
.
References #
- Porter, Dr Martin and Boulton, Richard, 2002, "The English (Porter2) stemming algorithm", Snowball, 2002
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.