dictosaurus 0.0.1-beta.4
dictosaurus: ^0.0.1-beta.4 copied to clipboard
Extensions on String that provide dictionary and thesaurus functions (**PRE-RELEASE**).
Extensions on String that provide dictionary and thesaurus functions. #
THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Skip to section:
Overview #
The dictosaurus package provides language reference utilities used in information retrieval systems. It relies on three key indexes:
- a
definitions index(see definitions) that maps the words in a language (vocabulary) to their definitions (meanings); - a
synonyms indexthat maps thevocabularyto a collection of synonyms, which may be empty; and - a
k-gram indexthat maps k-grams to thevocabulary.

Three utility classes provide dictionary and thesaurus functions:
- the Dictionary class exposes the Future function, looking up the meaning of the
termin avocabulary; - the Thesaurus class exposes the Future<Set function, looking up the synonyms of the
termin asynonyms index; and - the AutoCorrect class exposes the Future<Map<String, List function that returns a set of unique alternative spellings for
termby converting thetermtok-gramsand then finding the best matches for the (misspelt)termfrom thek-gram index, ordered in descending order of relevance (i.e. best match first).
The DictoSaurus composition class leverages a Dictionary, Thesaurus and AutoCorrect which it uses to expose the Future, Future<Set and Future<Map<String, List functions.
The DictoSaurus also exposes the Future<List function that looks up the term in the Dictionary, Thesaurus and AutoCorrect classes to return a term-expansion in descending order of relevance (best match first). If the term is found in the Dictionary it will appear as the first element of the returned list. If it is not found in the Dictionary it will not be in the returned list as it is likely to be misspelt.
The DictoSaurus.english static const instance uses the included vocabulary, synonymsIndex and kGramIndex hashmaps. For other languages or a custom implementation, initialize the DictoSaurus using the DictoSaurus.async factory constructor whichuses asynchronous callbacks to vocabulary, synonymsIndex and kGramIndex APIs. The DictoSaurus.async factory constructor has a named, required parameter TextAnalyzerConfiguration configuration. The optional named parameter int k (the k-gram length) defaults to 3 (tri-gram).
If the DictoSaurus is used as a term expander in an information retrieval system, the DictoSaurus.configuration must use the same tokenizing algorithm as the index.
Refer to the references to learn more about information retrieval systems.
Usage #
In the pubspec.yaml of your flutter project, add the following dependency:
dependencies:
dictosaurus: <latest_version>
In your code file add the following import:
import 'package:dictosaurus/dictosaurus.dart';
TODO: describe usage.
API #
The API exposes
We use an interface > implementation mixin > base-class > implementation class pattern:
- the
interfaceis an abstract class that exposes fields and methods but contains no implementation code. Theinterfacemay expose a factory constructor that returns animplementation classinstance; - the
implementation mixinimplements theinterfaceclass methods, but not the input fields; - the
base-classis an abstract class with theimplementation mixinand exposes a default, unnamed generative const constructor for sub-classes. The intention is thatimplementation classesextend thebase class, overriding theinterfaceinput fields with final properties passed in via a const generative constructor; and - the class naming convention for this pattern is
"Interface" > "InterfaceMixin" > "InterfaceBase".
Definitions #
The following definitions are used throughout the documentation:
corpus- the collection ofdocumentsfor which anindexis maintained.character filter- filters characters from text in preparation of tokenization.dictionary- is a hash ofterms(vocabulary) to the frequency of occurence in thecorpusdocuments.document- a record in thecorpus, that has a unique identifier (docId) in thecorpus's primary key and that contains one or more text zones/fields that are indexed.index- an inverted index used to look updocumentreferences from thecorpusagainst avocabularyofterms.k-gram- a sequence of (any) k consecutive characters from aterm. A k-gram can start with "$", dentoting the start of theterm, and end with "$", denoting the end of theterm. The 3-grams for "castle" are { $ca, cas, ast, stl, tle, le$ }.lemmatizer- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia (2)).postings- a separate index that records whichdocumentsthevocabularyoccurs in. In this implementation we also record the positions of eachtermin thetextto create a positional invertedindex.postings list- a record of the positions of atermin adocument. A position of atermrefers to the index of thetermin an array that contains all thetermsin thetext.synonym- a word, morpheme, or phrase that means exactly or nearly the same as another word, morpheme, or phrase in a given language (from Wikipedia (4)).stemmer- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia (3)).stopwords- common words in a language that are excluded from indexing.term- a word or phrase that is indexed from thecorpus. Thetermmay differ from the actual word used in the corpus depending on thetokenizerused.term filter- filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking astemmerand / orlemmatizer.term frequency (Ft)is the frequency of atermin an index or indexed object.term positionis the zero-based index of atermin an ordered array oftermstokenized from thecorpus.text- the indexable content of adocument.token- representation of atermin a text source returned by atokenizer. The token may include information about thetermposition in the source text.token filter- returns a subset oftokensfrom the tokenizer output.tokenizer- a function that returns a collection oftokens from the terms in a text source after applying acharacter filterandterm filter.vocabulary- the collection oftermsindexed from thecorpusor the words in a language.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
- Wikipedia (4), "Synonym", from Wikipedia, the free encyclopedia
- Wikipedia (5), "Jaccard Index", from Wikipedia, the free encyclopedia
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.
