document_analysis 0.1.3

  • Readme
  • Changelog
  • Example
  • Installing
  • 62

document_analysis #

A collection of document-analysis processing.

Getting Started #

In your Dart (or Flutter) project pubspec.yaml add the dependency:

  document_analysis: ^0.1.2

Vector Distance Measurement #

Because this is document-based analysis. Distance measurement must range between 0-1 (normalized, unlike Euclidean distance). Current distance measurement available:

Call: jaccardDistance(vector1, vector2)

  • Input vector is List<double>


List<double> vector1 = [0, 1, 1.5, 3, 2, 0.5];
List<double> vector2 = [1, 3, 3.5, 4, 0.5, 0];

print("Jaccard: ${jaccardDistance(vector1, vector2)}");//0.333...
print("Cosine: ${cosineDistance(vector1, vector2)}");//0.156...

Document Similarity #

Current document-similarity function available are based on:

  • Word Frequency Wiki
  • Term-Frequency Inverse-Document-Frequency (TF-IDF) Wiki
  • Hybrid TF-IDF

Call: wordFrequencySimilarity(doc1, doc2, distanceFunction: jaccardDistance):

  • doc1, doc2: Input document (String)
  • distanceFunction: Vector distance measurement (vector1, vector2)=>double


String doc1 = "Report: Xiaomi topples Fitbit and Apple as world's largest wearables vendor";
String doc2 = "Xiaomi topples Fitbit and Apple as world's largest wearables vendor: Strategy Analytics";

print("${wordFrequencySimilarity(doc1, doc2, distanceFunction: jaccardDistance)}");//0.769...
print("${wordFrequencySimilarity(doc1, doc2, distanceFunction: cosineDistance)}");//0.870...

Matrix Creation #

Word-vector matrix from collection of documents, current available:

  • Word Frequency
  • TF-IDF
  • Hybrid TF-IDF

Call: wordFrequencyMatrix([doc1, doc2])

  • [...]: All document, List<String>


String doc1 = "Report: Xiaomi topples Fitbit and Apple as world's largest wearables vendor";
String doc2 = "Xiaomi topples Fitbit and Apple as world's largest wearables vendor: Strategy Analytics";

print(wordFrequencyMatrix([doc1, doc2]));
//[[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0], [0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]

Document Tokenizer #

Tokenize document (String) into multiple metrics:

Call: documentTokenizer(List<String> documentList, {minLen = 1, String Function(String) stemmer, List<String> stopwords})

  • documentList: All document in a List
  • minLen: Minimum word occurrence to be considered in tokenization
  • stemmer: Stemming function
  • stopwords: Collection of common words that should be ignored in document analysis


documentTokenizer([doc1, doc2, doc3]);

Outputs TokenizationOutput.

class TokenizationOutput{
  ///Count for each words in all documents
  Map<String, double> bagOfWords = {};
  ///How often a certain word occur across all documents (unique word occurence - max 1 per document)
  Map<String, double> wordInDocumentOccurrence = {};
  ///List of 'Bag of Words' for each document
  List<Map<String, double>> documentBOW = [];
  ///Total number of word in each document
  List<int> documentTotalWord = [];
  ///Total distinct word in all documents
  int numberOfDistintWords = 0;
  ///Total number of word in all documents
  int totalNumberOfWords = 0;

General Info #

Stemmer & Stopwords passable from document_similarity --> matrix_creator --> tokenizer.

Remarks #

  • Hybrid TF-IDF based on Sharifi, B., Hutton, M.-A. & Kalita, J. K., 2010. Experiments in microblog summarization. Social Computing (SocialCom), 2010 IEEE Second International Conference on, pp. 49-56.

[0.1.3] #

  • It's now possible to propagate stemmer and stopword params between functions

[0.1.2] #

  • Added optional stopwords parameter in tokenizer

[0.1.1+1] #

  • Fix health suggestion
  • Run dartfmt

[0.1.1] #

  • Updated example
  • Updated documentation

[0.1.0] - First Release. #

  • First Release


Examples #

Document_Similarity Sample #

Word-freq similarity sample in dart project. Call it directly dart document_similarity.dart

Flutter Sample #

Usage: Copy-paste the codes into a new Flutter Project (e.g. in main.dart), then run them. Simple similarity check between 2 documents/string with flutter GUI.

Include WF, TF-IDF, Hybrid TF-IDF sample.

Vector_Measurement Sample #

Call it directly dart vector_measurement.dart

Notice For Flutter Developer #

Dart SDK binaries in Flutter usually located in <flutter-sdk-path>\bin\cache\dart-sdk\bin.

Use this package as a library

1. Depend on it

Add this to your package's pubspec.yaml file:

  document_analysis: ^0.1.3

2. Install it

You can install packages from the command line:

with pub:

$ pub get

with Flutter:

$ flutter pub get

Alternatively, your editor might support pub get or flutter pub get. Check the docs for your editor to learn more.

3. Import it

Now in your Dart code, you can use:

import 'package:document_analysis/document_analysis.dart';
Describes how popular the package is relative to other packages. [more]
Code health derived from static analysis. [more]
Reflects how tidy and up-to-date the package is. [more]
Weighted score of the above. [more]
Learn more about scoring.

We analyzed this package on Mar 24, 2020, and provided a score, details, and suggestions below. Analysis was completed with status completed using:

  • Dart: 2.7.1
  • pana: 0.13.6

Health suggestions

Format lib/src/tokenizer.dart.

Run dartfmt to format lib/src/tokenizer.dart.


Package Constraint Resolved Available
Direct dependencies
Dart SDK >=2.2.2 <3.0.0