string/text_similarity_utils library

Text similarity score (cosine similarity over TF vectors) — roadmap #437.

Functions

cosineSimilarity(Map<String, int> a, Map<String, int> b) double
Cosine similarity between two term-frequency maps (0.0 to 1.0).
termFrequencies(List<String> tokens) Map<String, int>
Term frequencies for a list of tokens (e.g. words).
textSimilarity(String a, String b) double
Returns cosine similarity of a and b when treated as bags of words.
textToTf(String s) Map<String, int>
Tokenizes s by splitting on non-letters and lowercasing; returns TF map.