porter_2_stemmer 0.0.8 copy "porter_2_stemmer: ^0.0.8" to clipboard
porter_2_stemmer: ^0.0.8 copied to clipboard

outdated

DART implementation of the Porter stemming algorithm, used for reducing a word to its word stem, base or root form.

porter_2_stemmer #

DART implementation of the Porter Stemming Algorithm, used for reducing a word to its word stem, base or root form.

Install #

In the pubspec.yaml of your flutter project, add the following dependency:

dependencies:
  porter_2_stemmer: <latest_version>

In your library add the following import:

import 'package:porter_2_stemmer/porter_2_stemmer.dart';

Usage #

A string extension is provided, and is the simplest way to get stemming:

import 'package:porter_2_stemmer/porter_2_stemmer.dart';

/// Iterate through a collection of terms/words and print the stem for each
/// term.
void main() {
  //

  /// collection of terms/words for which stems are printed.
  final terms = [
    'sky’s',
    'skis',
    'TSLA',
    'APPLE:NASDAQ',
    'consolatory',
    '"news"',
    "mother's",
    'generally',
    'consignment'
  ];

  // print a heading
  print('Example usage of Porter2Stemmer extension');

  /// Iterate through the [terms] and print the stem for each term.
  for (final term in terms) {
    // Get the stem for the [term] by calling the stem2Porter() extension
    // method.
    final stem = term.stemPorter2();

    // Print the [term => stem].
    print('$term => $stem');
  }
}

To implement custom exceptions to the algorithm, provide the exceptions parameter (a hashmap of String:String) that provides the term (key) and its stem (value).

The next example instantiates a Porter2Stemmer instance, and passes in aa custom exception for the term "TSLA".

import 'package:porter_2_stemmer/porter_2_stemmer.dart';

/// Instantiates a [Porter2Stemmer] instance using a custom exception for
/// the term "TSLA".
///
/// Prints the terms and their stems.
void main() {
  //

  // collection of terms/words for which stems are printed.
  final terms = [
    'sky’s',
    'skis',
    'TSLA',
    'APPLE:NASDAQ',
    'apple.com',
    'consolatory',
    '"news"',
    "mother's",
    'generally',
    'consignment'
  ];

  // print a heading
  print('Example usage of Porter2Stemmer.stem method');

  // Preserve the default exceptions.
  final exceptions = Map<String, String>.from(Porter2Stemmer.kExceptions);

  // Add a custom exception for "TSLA".
  exceptions['TSLA'] = 'tesla';

  // Instantiate the [Porter2Stemmer] instance using the custom [exceptions]
  final stemmer = Porter2Stemmer(exceptions: exceptions);

  /// Iterate through the [terms] and print the stem for each term.
  for (final term in terms) {
    // Get the stem for the [term].
    final stem = stemmer.stem(term);

    // Print the [term => stem].
    print('$term => $stem');
  }
}

What is the Porter Stemming Algorithm? #

A stemmer is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up information retrieval systems.

The English (Porter2) stemming algorithm implemented in Porter2Stemmer was developed as part of "Snowball", a small string processing language designed for creating stemming algorithms for use in information retrieval.

The Porter Stemming Algorithm is Copyright (c) 2001, Dr Martin Porter and Copyright (c) 2002, Richard Boulton and licensed under the BSD 3-Clause License.

Departures from Snowball implementation #

In this implementation of the English (Porter2) stemming algorithm:

  • all quotation marks and apostrophies are converted to a standard single quote character U+0027 (also ASCII hex 27);
  • all leading and trailing quotation marks are stripped from the term before processing begins.
  • in Step 5, the trailing "e" is not removed from stems that end in "ue". For example, "tongues" is stemmed as tongue (strict implementation returns "tongu") and "picturesque" is returned unchanged rather than stemmed to "picturesqu").

Terms that match the following criteria (after stripping quotation marks and the possessive apostrophy "'s") are returned unchanged as they are considered to be acronyms, identifiers or non-language terms that have a specific meaning:

  • terms that are in all-capitals, e.g. TSLA;
  • terms that contain any non-word characters (anything other than letters, apostrophes and hyphens), e.g. apple.com, alibaba:xnys.

This behaviour can be overriden by pre-processing text with a character filter to change terms to lower-case and strip out non-word characters.

The default exceptions are:


  /// Collection of default exceptions used by [Porter2Stemmer].
  static const kExceptions = {
    'skis': 'ski',
    'skies': 'sky',
    'dying': 'die',
    'lying': 'lie',
    'tying': 'tie',
    'idly': 'idl',
    'gently': 'gentl',
    'singly': 'singl',
  };

  /// Collection of terms that have no stem.
  static const kInvariantExceptions = {
    'sky': 'sky',
    'bye': 'bye',
    'ugly': 'ugly',
    'early': 'early',
    'only': 'only',
    'goodbye': 'goodbye',
    'commune': 'commune',
    'skye': 'skye',
    'news': 'news',
    'howe': 'howe',
    'atlas': 'atlas',
    'cosmos': 'cosmos',
    'bias': 'bias',
    'andes': 'andes',
  };

    /// Collection of terms that have no stem at the end of Step 1(a).
  static const kStep1AExceptions = {
    'inning': 'inning',
    'proceed': 'proceed',
    'goodbye': 'goodbye',
    'commune': 'commune',
    'herring': 'herring',
    'earring': 'earring',
    'outing': 'outing',
    'exceed': 'exceed',
    'canning': 'canning',
    'succeed': 'succeed',
    'doing': 'do'
  };

Validation #

A validator test is included in the repository as part of the test folder.

The 'Porter2Stemmer: VALIDATOR' test iterates through a hashmap of terms to expected stems that contains 29,417 term/stem pairs.

As of <latest_version>, the Porter2Stemmer achieves 99.66% accuracy when measured against the sample (Snowball) vocabulary. Taking into account the differences in implementation, this increases to 99.99%, or failure of 4/29,417 terms. The failed stems are:

  • "congeners" => "congener" (expected "congen");
  • "fluently" => "fluent" (expected "fluentli");
  • "harkye" => "harki" (expected "harky"); and
  • "lookye" => "looki" (expected "looky").

Contributions #

Feel free to contribute to this project:

  • If you find a bug or want a feature, but don't know how to fix/implement it, please fill an issue.
  • If you fixed a bug or implemented a feature, please send a pull request.

This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.

3
likes
0
pub points
52%
popularity

Publisher

verified publishergmconsult.com.au

DART implementation of the Porter stemming algorithm, used for reducing a word to its word stem, base or root form.

Homepage
Repository (GitHub)
View/report issues

License

unknown (license)

More

Packages that depend on porter_2_stemmer