string/duplicate_doc_utils library

Near-duplicate document detector via fingerprints — roadmap #438.

Functions

clusterNearDuplicates(List<String> documents, {double threshold = 0.85}) List<List<int>>
Groups documents into near-duplicate clusters (greedy).
isNearDuplicate(String a, String b, {double threshold = 0.85}) bool
Returns true if a and b are near-duplicates (cosine similarity >= threshold).