text_counter
A lightweight Dart utility for accurately counting characters and words in over 100 languages, including CJK (Chinese, Japanese, Korean), RTL (Right-to-Left) scripts like Arabic and Hebrew, and mixed-language texts.
text_counter uses Microsoft Word-compatible word counting logic, ensuring consistent and familiar results across different writing systems. This makes it ideal for applications requiring accurate text metrics โ such as content editors, writing tools, and input validation systems.
โจ Features
- โ
Uses Microsoft Word's word counting rules:
- Words are split by whitespace and common punctuation.
- Hyphenated words (e.g., "state-of-the-art") are counted as a single word.
- Numbers and symbols are treated appropriately based on context.
- โ
Language-aware counting strategies:
- CJK (Character-based): Each character is counted individually (used for Chinese, Japanese, Korean, etc.).
- Latin & RTL scripts (Word-based): Standard word-based counting using appropriate delimiters and tokenization.
- ๐ Automatic language/script detection for mixed-language texts.
- โก Lightweight and dependency-free: No external libraries required.
- ๐ Supports over 100 languages out of the box.
๐ฆ Installation
Add this to your package's pubspec.yaml:
dependencies:
text_counter: ^0.1.0
Then run:
dart pub get
๐งช Usage
Basic Example
import 'package:text_counter/text_counter.dart';
void main() {
print('Chinese: ${TextCounter.count("ไฝ ๅฅฝ๏ผไธ็", languageCode: "zh")}'); // 5
print('Japanese: ${TextCounter.count("ใใใซใกใฏไธ็", languageCode: "ja")}'); // 7
print('Korean: ${TextCounter.count("์๋
ํ์ธ์ ์ธ์", languageCode: "ko")}'); // 7
print('Arabic: ${TextCounter.count("ู
ุฑุญุจุง ุจุงูุนุงูู
", languageCode: "ar")}'); // 2
print('Hebrew: ${TextCounter.count("ืฉืืื ืขืืื", languageCode: "he")}'); // 2
print('English: ${TextCounter.count("Hello world", languageCode: "en")}'); // 2
const mixed = "Hello ไฝ ๅฅฝ ู
ุฑุญุจุง ใใใซใกใฏ";
print('Mixed Text "$mixed": ${TextCounter.count(mixed)}'); // 9
}
๐บ๏ธ Supported Languages
| Script Type | Language Codes |
|---|---|
| CJK (Character-based) | zh, yue, ja, ko, th, hi, bn, ta, te, kn,ml, si, km, my, lo, tl, jw, su, bo, dz |
| RTL (Word-based) | ml, si, km, my, lo, tl, jw, su, bo, dz |
| Latin (Word-based) | All other ISO 639-1 language codes not listed above, including: en,de,es,fr,it,pt,nl,tr,pl,ca,sv,id,fi,vi,hi,uk,el,ms,cs,ro,da,hu,no,th... |
If no
languageCodeis provided, the library automatically detects script types and applies appropriate counting rules.
๐ ๏ธ How It Works
- For CJK languages, each ideographic or logographic character is counted individually.
- For Latin and RTL scripts, standard word boundaries are detected using whitespace and punctuation patterns similar to those used by Microsoft Word.
- In mixed-language texts, the counter dynamically switches between counting methods depending on the script being used.
๐งฉ Use Cases
- Content management systems
- Rich-text editors
- Writing apps with word limits
- Language learning platforms
- Analytics dashboards
- Form validation utilities
๐ API Reference
int TextCounter.count(String text, {String? languageCode});
text: The input string to be analyzed.languageCode: Optional BCP 47 language code (e.g.,"en"for English,"zh"for Chinese). If omitted, auto-detection is used.
๐ License
MIT License โ see LICENSE