jaroWinklerSimilarity<E> function
Find the Jaro-Winkler similarity index between two list of items.
Parameters
source
andtarget
are two list of items.threshold
is the minimum Jaro distance above which the Winkler's increment is to be applied.maxPrefixSize
is the maximum prefix length to consider. If absent, the whole matching prefix is considered.prefixScale
is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. The length of the considered common prefix is at most 4. If absent, the default prefix scale is used.
Details
The Jaro similarity index between two list of items is the weighted sum of percentage of matched items from each list and transposed items. Winkler increased this measure for matching initial characters.
See also: jaroSimilarity
If n
is the length of source
and m
is the length of target
,
Complexity: Time O(nm)
| Space O(n+m)
Implementation
double jaroWinklerSimilarity<E>(
List<E> source,
List<E> target, {
int? maxPrefixSize,
double? prefixScale,
double threshold = 0.7,
}) {
double jaro = jaroSimilarity(source, target);
if (jaro > threshold) {
// maximum length to find prefix match
int len = min(source.length, target.length);
if (maxPrefixSize != null && len > maxPrefixSize) {
len = maxPrefixSize;
}
// Find matching prefix
int l = 0;
while (l < len && source[l] == target[l]) {
l++;
}
// Add Winkler bonus with jaro similarity index
double p = prefixScale ?? min(0.1, 1 / max(source.length, target.length));
jaro += l * p * (1 - jaro);
}
return jaro;
}