tverskyIndexOf function
Finds the Tversky similarity index between two strings.
Parameters
source
is the variant stringtarget
is the prototype stringalpha
is the variant coefficient. Default is 0.5beta
is the prototype coefficient. Default is 0.5- if
ignoreCase
is true, the character case shall be ignored. - if
ignoreWhitespace
is true, space, tab, newlines etc whitespace characters will be ignored. - if
ignoreNumbers
is true, numbers will be ignored. - if
alphaNumericOnly
is true, only letters and digits will be matched. ngram
is the size a single item group. If n = 1, each individual items are considered separately. If n = 2, two consecutive items are grouped together and treated as one.
TIPS: You can pass both
ignoreNumbers
andalphaNumericOnly
to true to ignore everything else except letters.
Details
Tversky index is an asymmetric similarity measure between sets that compares a variant with a prototype. It is a generalization of the Sørensen–Dice coefficient and Jaccard index.
It may return NaN
dependending on the values of alpha
and beta
.
See Also: tverskyIndex
Complexity: Time O(n log n)
| Space O(n)
Implementation
double tverskyIndexOf(
String source,
String target, {
int ngram = 1,
double alpha = 0.5,
double beta = 0.5,
bool ignoreCase = false,
bool ignoreWhitespace = false,
bool ignoreNumbers = false,
bool alphaNumericOnly = false,
}) {
source = cleanupString(
source,
ignoreCase: ignoreCase,
ignoreWhitespace: ignoreWhitespace,
ignoreNumbers: ignoreNumbers,
alphaNumericOnly: alphaNumericOnly,
);
target = cleanupString(
target,
ignoreCase: ignoreCase,
ignoreWhitespace: ignoreWhitespace,
ignoreNumbers: ignoreNumbers,
alphaNumericOnly: alphaNumericOnly,
);
if (ngram < 2) {
return tverskyIndex(
source.codeUnits,
target.codeUnits,
alpha: alpha,
beta: beta,
);
} else {
return tverskyIndex(
splitStringToSet(source, ngram),
splitStringToSet(target, ngram),
alpha: alpha,
beta: beta,
);
}
}