document_analysis

A collection of document-analysis processing.

Getting Started

In your Dart (or Flutter) project pubspec.yaml add the dependency:

dependencies:
  ...
  document_analysis: ^0.1.2

Vector Distance Measurement

Because this is document-based analysis. Distance measurement must range between 0-1 (normalized, unlike Euclidean distance). Current distance measurement available:

Call: jaccardDistance(vector1, vector2)

  • Input vector is List<double>

Usage:

List<double> vector1 = [0, 1, 1.5, 3, 2, 0.5];
List<double> vector2 = [1, 3, 3.5, 4, 0.5, 0];

print("Jaccard: ${jaccardDistance(vector1, vector2)}");//0.333...
print("Cosine: ${cosineDistance(vector1, vector2)}");//0.156...

Document Similarity

Current document-similarity function available are based on:

  • Word Frequency Wiki
  • Term-Frequency Inverse-Document-Frequency (TF-IDF) Wiki
  • Hybrid TF-IDF

Call: wordFrequencySimilarity(doc1, doc2, distanceFunction: jaccardDistance):

  • doc1, doc2: Input document (String)
  • distanceFunction: Vector distance measurement (vector1, vector2)=>double

Usage:

String doc1 = "Report: Xiaomi topples Fitbit and Apple as world's largest wearables vendor";
String doc2 = "Xiaomi topples Fitbit and Apple as world's largest wearables vendor: Strategy Analytics";

print("${wordFrequencySimilarity(doc1, doc2, distanceFunction: jaccardDistance)}");//0.769...
print("${wordFrequencySimilarity(doc1, doc2, distanceFunction: cosineDistance)}");//0.870...

Matrix Creation

Word-vector matrix from collection of documents, current available:

  • Word Frequency
  • TF-IDF
  • Hybrid TF-IDF

Call: wordFrequencyMatrix([doc1, doc2])

  • [...]: All document, List<String>

Usage:

String doc1 = "Report: Xiaomi topples Fitbit and Apple as world's largest wearables vendor";
String doc2 = "Xiaomi topples Fitbit and Apple as world's largest wearables vendor: Strategy Analytics";

print(wordFrequencyMatrix([doc1, doc2]));
//[[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0], [0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]

Document Tokenizer

Tokenize document (String) into multiple metrics:

Call: documentTokenizer(List<String> documentList, {minLen = 1, String Function(String) stemmer, List<String> stopwords})

  • documentList: All document in a List
  • minLen: Minimum word occurrence to be considered in tokenization
  • stemmer: Stemming function
  • stopwords: Collection of common words that should be ignored in document analysis

Usage:

documentTokenizer([doc1, doc2, doc3]);

Outputs TokenizationOutput.

class TokenizationOutput{
  ///Count for each words in all documents
  Map<String, double> bagOfWords = {};
  ///How often a certain word occur across all documents (unique word occurence - max 1 per document)
  Map<String, double> wordInDocumentOccurrence = {};
  ///List of 'Bag of Words' for each document
  List<Map<String, double>> documentBOW = [];
  ///Total number of word in each document
  List<int> documentTotalWord = [];
  ///Total distinct word in all documents
  int numberOfDistintWords = 0;
  ///Total number of word in all documents
  int totalNumberOfWords = 0;
}

General Info

Stemmer & Stopwords passable from document_similarity --> matrix_creator --> tokenizer.

Remarks

  • Hybrid TF-IDF based on Sharifi, B., Hutton, M.-A. & Kalita, J. K., 2010. Experiments in microblog summarization. Social Computing (SocialCom), 2010 IEEE Second International Conference on, pp. 49-56.

Libraries

document_analysis