porter_2_stemmer 1.0.0+4 porter_2_stemmer: ^1.0.0+4 copied to clipboard
DART implementation of the Porter stemming algorithm, used for reducing a word to its word stem, base or root form.
porter_2_stemmer #
DART implementation of the Porter Stemming Algorithm, used for reducing a word to its word stem, base or root form.
Objective #
The objective of this package is to provide an English language stemmer
utility class and string extension by implementing the English (Porter2) stemming algorithm.
The Porter Stemming Algorithm is Copyright (c) 2001, Dr Martin Porter and Copyright (c) 2002, Richard Boulton and licensed under the BSD 3-Clause License.
The design of the package is consistent with information retrieval theory.
As of version 1.0.0, the Porter2Stemmer
achieves 99.66% accuracy when measured against the sample (Snowball) vocabulary. Taking into account the differences in implementation, this increases to 99.99%, or failure of 4 out of 29,417 terms.
Definitions #
The following definitions are used throughout the documentation:
corpus
- the collection ofdocuments
for which anindex
is maintained.document
- a record in thecorpus
, that has a unique identifier (docId
) in thecorpus
's primary key and that contains one or more text fields that are indexed.lemmatizer
- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).Porter2 stemming algorithm
- a English languagestemmer
developed as part of "Snowball", a small string processing language designed for creating stemming algorithms for use in information retrieval.term
- a word or phrase that is indexed from thecorpus
. Theterm
may differ from the actual word used in the corpus depending on thetokenizer
used.stemmer
- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).text
- the indexable content of adocument
.token
- representation of aterm
in a text source returned by atokenizer
. The token may include information about theterm
such as its position(s) in the text or frequency of occurrence.tokenizer
- a function that returns a collection oftoken
s fromtext
, after applying a character filter,term
filter,stemmer
and / orlemmatizer
.vocabulary
- the collection ofterms
indexed from thecorpus
.
Implementation #
This package provides the Porter2Stemmer
, an English language stemmer
utility class and the Porter2StemmerExtension
String extension.
class Porter2Stemmer.stemPorter2
#
The Porter2Stemmer
class exposes the Porter2Stemmer.stem
function that reduces a term to its word stem, base or root form by stepping through the five steps of the Porter2 (English) stemming algorithm
. Each of the five stemmer steps is implemented as public function that takes a term as parameter and returns a manipulated String after applying the step algorithm. The steps may therefore be overriden in sub-classes.
Terms that match the following criteria (after stripping quotation marks and possessive apostrophy "s") are returned unchanged as they are considered to be acronyms, identifiers or non-language terms that have a specific meaning:
- terms that are in all-capitals, e.g. TSLA;
- terms that contain any non-word characters (anything other than letters, apostrophes and hyphens), e.g. apple.com, alibaba:xnys
Terms that match a key in Porter2Stemmer.exceptions
(after stripping quotation marks and possessive apostrophy "s") are stemmed by returning the corresponding value from Porter2Stemmer.exceptions
.
Terms may be converted to lowercase before processing if stemming of all-capitals terms is desired. Split terms that contain non-word characters to stem the term parts separately.
The algorithm steps are described fully here.
extension Porter2StemmerExtension
on String
#
The Porter2StemmerExtension
extension provides an extension method String.stemPorter2
that reduces a term to its word stem, base or root form using the Porter2 (English) stemming algorithm
.
Pass the exceptions
parameter (a hashmap of String:String) to apply custom exceptions to the algorithm. The default exceptions are the static const Porter2Stemmer.kExceptions
.
This extension method is a shortcut to [Porter2Stemmer.stem] method.
Usage #
A String extension is provided, and is the simplest way to get stemming.
final stem = term.stemPorter2();
To implement custom exceptions to the algorithm, provide the exceptions parameter (a hashmap of String:String) that provides the term (key) and its stem (value).
The code below instantiates a Porter2Stemmer instance, passing in a custom exception for the term "TSLA".
// Preserve the default exceptions.
final exceptions = Map<String, String>.from(Porter2Stemmer.kExceptions);
// Add a custom exception for "TSLA".
exceptions['TSLA'] = 'tesla';
// Instantiate the [Porter2Stemmer] instance using the custom [exceptions]
final stemmer = Porter2Stemmer(exceptions: exceptions);
// Get the stem for the [term].
final stem = stemmer.stem(term);
Install #
In the pubspec.yaml
of your flutter project, add the following dependency:
dependencies:
porter_2_stemmer: ^1.0.0
In your library add the following import:
import 'package:porter_2_stemmer/porter_2_stemmer.dart';
Examples #
Examples are provided for both the extension method and the class method with custom exceptions.
Departures from Snowball implementation #
Terms that match the following criteria (after stripping opening/closing quotation marks and the possessive apostrophy "'s") are returned unchanged as they are considered to be acronyms, identifiers or non-language terms that have a specific meaning:
- terms that are in all-capitals, e.g. TSLA; and
- terms that contain any non-word characters (anything other than letters, apostrophes and hyphens), e.g. apple.com, alibaba:xnys.
This behaviour can be overriden by pre-processing text with a character filter that converts terms to lower-case and/or strips out non-word characters.
In this implementation of the English (Porter2) stemming algorithm:
- all quotation marks and apostrophies are converted to the standard single quote character U+0027;
- all leading and trailing quotation marks are stripped from the term before processing begins;
- in Step 5, the trailing "e" is not removed from stems that end in "ue". For example, "tongues" is stemmed as tongue (strict implementation returns "tongu") and "picturesque" is returned unchanged rather than stemmed to "picturesqu"); and
- the
exceptions
andkInvariantExceptions
are checked after every step in the algorith to ensure exceptions are not missed at intermediate steps.
Additional default exceptions have been implemented as follows in the latest version:
/// Collection of default exceptions used by [Porter2Stemmer].
static const kExceptions = {
'skis': 'ski',
'skies': 'sky',
'dying': 'die',
'lying': 'lie',
'tying': 'tie',
'idly': 'idl',
'gently': 'gentl',
'singly': 'singl',
};
/// Collection of terms that have no stem.
static const kInvariantExceptions = {
'sky': 'sky',
'bye': 'bye',
'ugly': 'ugly',
'early': 'early',
'only': 'only',
'goodbye': 'goodbye',
'commune': 'commune',
'skye': 'skye',
'news': 'news',
'howe': 'howe',
'atlas': 'atlas',
'cosmos': 'cosmos',
'bias': 'bias',
'andes': 'andes',
};
/// Collection of terms that have no stem at the end of Step 1(a).
static const kStep1AExceptions = {
'inning': 'inning',
'proceed': 'proceed',
'goodbye': 'goodbye',
'commune': 'commune',
'herring': 'herring',
'earring': 'earring',
'outing': 'outing',
'exceed': 'exceed',
'canning': 'canning',
'succeed': 'succeed',
'doing': 'do'
};
Validation #
A validator test is included in the repository test folder.
The Porter2Stemmer: VALIDATOR
test iterates through a hashmap of terms to expected stems that contains 29,417 term/stem pairs.
As of version 1.0.0, the Porter2Stemmer
achieves 99.66% accuracy when measured against the sample (Snowball) vocabulary. Taking into account the differences in implementation, this increases to 99.99%, or failure of 4 out of 29,417 terms. The failed stems are:
- "congeners" => "congener" (expected "congen");
- "fluently" => "fluent" (expected "fluentli");
- "harkye" => "harki" (expected "harky"); and
- "lookye" => "looki" (expected "looky").
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.
References #
- Porter, Dr Martin and Boulton, Richard, 2002, "The English (Porter2) stemming algorithm", snowballstem.org/
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press. 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia