jaccardDistanceOf function

int jaccardDistanceOf(
  1. String source,
  2. String target, {
  3. int ngram = 1,
  4. bool ignoreCase = false,
  5. bool ignoreWhitespace = false,
  6. bool ignoreNumbers = false,
  7. bool alphaNumericOnly = false,
})

Returns the Jaccard distance between two strings.

Parameters

  • source is the variant list
  • target is the prototype list
  • if ignoreCase is true, the character case shall be ignored.
  • if ignoreWhitespace is true, space, tab, newlines etc whitespace characters will be ignored.
  • if ignoreNumbers is true, numbers will be ignored.
  • if alphaNumericOnly is true, only letters and digits will be matched.
  • ngram is the size a single item group. If n = 1, each individual items are considered separately. If n = 2, two consecutive items are grouped together and treated as one.

Details

Jaccard distance measures the total number of characters that is present in one string but not the other. It is calculated by subtracting the length of intersection between the source and target set from their union.

See Also: tverskyIndex, jaccardIndex


Complexity: Time O(n log n) | Space O(n)

Implementation

int jaccardDistanceOf(
  String source,
  String target, {
  int ngram = 1,
  bool ignoreCase = false,
  bool ignoreWhitespace = false,
  bool ignoreNumbers = false,
  bool alphaNumericOnly = false,
}) {
  source = cleanupString(
    source,
    ignoreCase: ignoreCase,
    ignoreWhitespace: ignoreWhitespace,
    ignoreNumbers: ignoreNumbers,
    alphaNumericOnly: alphaNumericOnly,
  );
  target = cleanupString(
    target,
    ignoreCase: ignoreCase,
    ignoreWhitespace: ignoreWhitespace,
    ignoreNumbers: ignoreNumbers,
    alphaNumericOnly: alphaNumericOnly,
  );
  if (ngram < 2) {
    return jaccardDistance(
      source.codeUnits,
      target.codeUnits,
    );
  } else {
    return jaccardDistance(
      splitStringToSet(source, ngram),
      splitStringToSet(target, ngram),
    );
  }
}