Hyphenator

Implementation of an hyphenation algorithm for various languages, based on TeX definitions.

  • Offers a Widget hyphenating a String and wrapping the result based on the available width.
  • Offers various function calls to hyphenate a String at all possible positions.

The package seems to work fine for western languages, other languages have to be evaluated.

The tex patterns used in the algorithm can be at tug.org.

Test the live demo https://xerik.github.io/hyphenatorx/.

Wrapping and Scaling Text

The package text_wrap_auto_size uses hyphenatorx for wrapping and auto scaling text - with and without hyphenation.

Quickstart

import 'package:hyphenatorx/widget/texthyphenated.dart';

// sub-
// di-
// vi-
// sion
TextHyphenated('subdivision', 
  'en_us',
  style: TextStyle(fontSize: 56))
import 'package:hyphenatorx/hyphenatorx.dart';
import 'package:hyphenatorx/languages/language_en_us.dart';

final hyphenator = Hyphenator(Language_en_us(), symbol: '_');

// 'sub_di_vi_sion_ _sub_di_vi_sion'
print(
  hyphenator.hyphenateText('subdivision subdivision', 
    hyphenAtBoundaries: true));

// sub_di_vi_sion sub_di_vi_sion
print(hyphenator.hyphenateText('subdivision subdivision'));

// sub_di_vi_sion
print(hyphenator.hyphenateWord('subdivision'));

// ['sub', 'di', 'vi', 'sion']
print(hyphenator.syllablesWord('subdivision'));      

Usage

Hyphenator instantiates a language specific configuration:

As a Dart object

Without synchronous loading. The data is compiled into your project. All 71 langauge files together have a size of 13.3 MB.

From JSON

With asynchronous loading. The data will be loaded as needed from the local assets folder. This option is less memory intensive.

Languages

Available languages are given by the enum Language.

Cache

Internally, hyphenated as well as non-hyphenated words are cached. Complete texts are not cached.

Hyphen symbol

The hyphen symbol can be defined, the default is the soft-wrap '\u{00AD}'.

Asynchronous Instantiation

Select the appropriate Language.language_XX value.

Or set a language code like en_us.

import 'package:hyphenatorx/hyphenatorx.dart';
import 'package:hyphenatorx/languages/language_en_us.dart';

final hyphernator = await Hyphenator.loadAsync(
    Language.language_en_us, 
    symbol: '_');

// OR THIS:

final hyphernator = await Hyphenator.loadAsyncByAbbr(
    'en_us', 
    symbol: '_');

Synchronous Instantiation

Instatiate the appropriate Language_XX() object.

import 'package:hyphenatorx/hyphenatorx.dart';
import 'package:hyphenatorx/languages/language_en_us.dart';

final config = Language_en_us();

final hyphenator = Hyphenator(
  config,
  symbol: '_',
);

Widget Usage

This Widget outputs a Text. It hyphenates and wraps the input String depending on the available width.

import 'package:hyphenatorx/widget/texthyphenated.dart';

TextHyphenated('subdivision', 
  'en_us',
  style: TextStyle(fontSize: 56))

// Wraps the output according to the available width:
// 
// sub-
// di-
// vi-
// sion

Function Call

Inject the hyphenation symbol at all possible positions.

final hyphenator = Hyphenator(Language_en_us());

expect(
  hyphenator.hyphenateText('subdivision subdivision', 
    hyphenAtBoundaries: true), 
  'sub_di_vi_sion_ _sub_di_vi_sion');

expect(
  hyphenator.hyphenateText('subdivision subdivision'), 
  'sub_di_vi_sion sub_di_vi_sion');

expect(
  hyphenator.hyphenateWord('subdivision'),
  'sub_di_vi_sion');

expect(
  hyphenator.syllablesWord('subdivision'),
  ['sub', 'di', 'vi', 'sion']);      

Access the wrapped hyphenation result respecting the given width.

final hyphenator = Hyphenator(Language_en_us());

WrapResult wrap = hyphenator.wrap(
  final Text text, final TextStyle style, final maxWidth);

// The hyphenated text with hyphens and newlines:
// 
// sub-
// di-
// vi-
// sion- 
print( wrap.textStr ); 

// Whether the returned text is equal to 
// or smaller than maxWidth. 
//
// If FALSE, try a different font size.

print( wrap.isSizeMatching );  

Manual Hyphenation

Iterate through the token tree for a custom approach of hyphenation. Before and after each token a valid hyphen could be added.

final text = """A vast subdivision of culture, 
    composed of many creative endeavors and disciplines.""";

final hyphenator = Hyphenator(Language_en_us());
final List<TextPartToken> tokens = hyphenator.hyphenateTextToTokens(text);

tokens.forEach((part) {
  if (part is NewlineToken) {
    print(part.text); // = is always a single newline
  } else if (part is TabsAndSpacesToken) {
    print(part.text); // tabs and spaces found in `text`
  } else if (part is WordToken) {
    part.parts.forEach((syllableAndSurrounding) {
      print(syllableAndSurrounding.text); // sub / di / vi / sion.
    });
  }
});

// A
// vast
// sub
// di
// vi
// sion

Languages and Abbreviations

The abbreviations correspond with the tex file names found at tug.org.

Strings

List<String> abbr = Hyphenator.languageAbbr();

print(abbr); 

[af, as, bg, bn, ca, cop, cs, cy, da, de_1901, de_1996, de_ch_1901, el_monoton, el_polyton, en_gb, en_us, eo, es, et, eu, fi, fr, fur, ga, gl, grc, gu, hi, hr, hsb, hu, hy, ia, id, is, it, ka, kmr, kn, la_x_classic, la, lt, lv, ml, mn_cyrl_x_lmc, mn_cyrl, mr, mul_ethi, nb, nl, nn, or, pa, pl, pms, pt, rm, ro, ru, sa, sh_cyrl, sk, sl, sv, ta, te, th, tk, tr, uk, zh_latn_pinyin]

Enum

As Islandic is been abbreviated "is", which is a Dart keyword, the prefix "language" had been added.

enum Language { language_af,language_as,language_bg,language_bn,
language_ca,language_cop,language_cs,language_cy,language_da,
language_de_1901,language_de_1996,language_de_ch_1901,
language_el_monoton,language_el_polyton,language_en_gb
language_en_us,language_eo,language_es,language_et,
language_eu,language_fi,language_fr,language_fur,language_ga,
language_gl,language_grc,language_gu,language_hi,language_hr,
language_hsb,language_hu,language_hy,language_ia,language_id,
language_is,language_it,language_ka,language_kmr,language_kn,
language_la_x_classic,language_la,language_lt,language_lv,
language_ml,language_mn_cyrl_x_lmc,language_mn_cyrl,language_mr,
language_mul_ethi,language_nb,language_nl,language_nn,
language_or,language_pa,language_pl,language_pms,language_pt,
language_rm,language_ro,language_ru,language_sa,language_sh_cyrl,
language_sk,language_sl,language_sv,language_ta,language_te,
language_th,language_tk,language_tr,language_uk,
language_zh_latn_pinyin }

Performance

Old machine:

  • Instantiation via Dart EN_US file: 30 milliseconds
  • Hyphenating text with 258 words: 46-56 milliseconds

Internal is-letter-testing impacts performance the most. At the moment, a binary search is performed over a combined set of (complete?) alphabets from various languages, plus an extra check for languages not included. Not terrible efficient, needs improvement.

Generate JSON and Dart files

dart run ./tool/tex2dart.dart

The tool will delete assets and lib/languages before generating new files. It processes tex files located in tool\tex\.

Source

This package is a copy and extension of hyphenator.

Issues

Given this is a generic hyphenator, several issues are to be expected. Please open one at Github.

Libraries

hyphenatorx
languages/language_af
languages/language_as
languages/language_bg
languages/language_bn
languages/language_ca
languages/language_cop
languages/language_cs
languages/language_cy
languages/language_da
languages/language_de_1901
languages/language_de_1996
languages/language_de_ch_1901
languages/language_el_monoton
languages/language_el_polyton
languages/language_en_gb
languages/language_en_us
languages/language_eo
languages/language_es
languages/language_et
languages/language_eu
languages/language_fi
languages/language_fr
languages/language_fur
languages/language_ga
languages/language_gl
languages/language_grc
languages/language_gu
languages/language_hi
languages/language_hr
languages/language_hsb
languages/language_hu
languages/language_hy
languages/language_ia
languages/language_id
languages/language_is
languages/language_it
languages/language_ka
languages/language_kmr
languages/language_kn
languages/language_la
languages/language_la_x_classic
languages/language_lt
languages/language_lv
languages/language_ml
languages/language_mn_cyrl
languages/language_mn_cyrl_x_lmc
languages/language_mr
languages/language_mul_ethi
languages/language_nb
languages/language_nl
languages/language_nn
languages/language_or
languages/language_pa
languages/language_pl
languages/language_pms
languages/language_pt
languages/language_rm
languages/language_ro
languages/language_ru
languages/language_sa
languages/language_sh_cyrl
languages/language_sk
languages/language_sl
languages/language_sv
languages/language_ta
languages/language_te
languages/language_th
languages/language_tk
languages/language_tr
languages/language_uk
languages/language_zh_latn_pinyin
languages/languageconfig
token/linewrapperhyphen
token/linewrappernohyphen
widget/texthyphenated
token/tokens
token/wrapresult