string/text_similarity_utils library
Text similarity score (cosine similarity over TF vectors) — roadmap #437.
Functions
-
cosineSimilarity(
Map< String, int> a, Map<String, int> b) → double - Cosine similarity between two term-frequency maps (0.0 to 1.0).
-
termFrequencies(
List< String> tokens) → Map<String, int> - Term frequencies for a list of tokens (e.g. words).
-
textSimilarity(
String a, String b) → double -
Returns cosine similarity of
aandbwhen treated as bags of words. -
textToTf(
String s) → Map< String, int> -
Tokenizes
sby splitting on non-letters and lowercasing; returns TF map.