tverskyIndexOf function - algorithmic.string.metrics library

Finds the Tversky similarity index between two strings.

Parameters

source is the variant string
target is the prototype string
alpha is the variant coefficient. Default is 0.5
beta is the prototype coefficient. Default is 0.5
if ignoreCase is true, the character case shall be ignored.
if ignoreWhitespace is true, space, tab, newlines etc whitespace characters will be ignored.
if ignoreNumbers is true, numbers will be ignored.
if alphaNumericOnly is true, only letters and digits will be matched.
ngram is the size a single item group. If n = 1, each individual items are considered separately. If n = 2, two consecutive items are grouped together and treated as one.

TIPS: You can pass both ignoreNumbers and alphaNumericOnly to true to ignore everything else except letters.

Details

Tversky index is an asymmetric similarity measure between sets that compares a variant with a prototype. It is a generalization of the Sørensen–Dice coefficient and Jaccard index.

It may return NaN dependending on the values of alpha and beta.

See Also: tverskyIndex

Complexity: Time O(n log n) | Space O(n)

Implementation

double tverskyIndexOf( String source, String target, { int ngram = 1, double alpha = 0.5, double beta = 0.5, bool ignoreCase = false, bool ignoreWhitespace = false, bool ignoreNumbers = false, bool alphaNumericOnly = false, }) { source = cleanupString( source, ignoreCase: ignoreCase, ignoreWhitespace: ignoreWhitespace, ignoreNumbers: ignoreNumbers, alphaNumericOnly: alphaNumericOnly, ); target = cleanupString( target, ignoreCase: ignoreCase, ignoreWhitespace: ignoreWhitespace, ignoreNumbers: ignoreNumbers, alphaNumericOnly: alphaNumericOnly, ); if (ngram < 2) { return tverskyIndex( source.codeUnits, target.codeUnits, alpha: alpha, beta: beta, ); } else { return tverskyIndex( splitStringToSet(source, ngram), splitStringToSet(target, ngram), alpha: alpha, beta: beta, ); } }