betto_charset_detector

A pure-Dart charset detection library that works on all Dart and Flutter target platforms — mobile, web, and desktop. No Flutter plugin infrastructure, no native code, no FFI.

Features

Detects UTF-8, UTF-16 BE/LE, UTF-32 BE/LE via byte-order mark (BOM)
Validates UTF-8 structural integrity without a BOM
Probes legacy 8-bit and CJK encodings via the charset package:
- Western: windows-1252, iso-8859-1, iso-8859-2, iso-8859-15
- CJK: shift-jis, euc-jp, euc-kr, gbk
Falls back to windows-1252 (the WHATWG Encoding spec default)
Returns lowercase IANA encoding labels consistent with dart:convert and the browser TextDecoder API
Works on web (no dart:io required), mobile, and desktop

Getting started

Add betto_charset_detector to your pubspec.yaml:

dependencies:
  betto_charset_detector: ^0.1.0-dev.1

Run dart pub get (or flutter pub get).

Usage

detectCharset accepts any Uint8List — bytes from an HTTP response, a dart:html FileReader, a Flutter file-picker result, or a local file. No dart:io import is required.

import 'dart:typed_data';
import 'package:betto_charset_detector/betto_charset_detector.dart';

void main() {
  // Works on web, mobile, and desktop — any source of bytes.
  final bytes = Uint8List.fromList([0xEF, 0xBB, 0xBF, 104, 101, 108, 108, 111]);
  final encoding = detectCharset(bytes);
  print('Detected encoding: $encoding'); // utf-8
}

On native platforms (server, mobile, desktop) you can also read bytes directly from the filesystem:

import 'dart:io';
import 'package:betto_charset_detector/betto_charset_detector.dart';

void main() {
  final bytes = File('my_file.txt').readAsBytesSync();
  final encoding = detectCharset(bytes);
  print('Detected encoding: $encoding'); // e.g. "utf-8" or "shift-jis"
}

Return values

detectCharset always returns a lowercase IANA label string:

Detected encoding	Returned label
UTF-8 (with BOM)	`utf-8`
UTF-8 (no BOM)	`utf-8`
UTF-16 big-endian	`utf-16be`
UTF-16 little-endian	`utf-16le`
UTF-32 big-endian	`utf-32be`
UTF-32 little-endian	`utf-32le`
Windows-1252	`windows-1252`
ISO-8859-1	`iso-8859-1`
ISO-8859-2	`iso-8859-2`
ISO-8859-15	`iso-8859-15`
Shift-JIS	`shift-jis`
EUC-JP	`euc-jp`
EUC-KR	`euc-kr`
GBK	`gbk`

Detection algorithm

Detection proceeds through three ordered stages, falling through only when necessary:

BOM inspection — deterministic; longest-match first (4-byte UTF-32 before 2-byte UTF-16).
UTF-8 structural validation — a leading 8 KB sample is decoded with utf8.decode(allowMalformed: false). Valid decode → utf-8.
Candidate probe — the sample is tested against each candidate encoding using Charset.canDecode from the charset package. CJK encodings are promoted when >15% of sample bytes are ≥ 0x80.

Fallback: windows-1252.

Only the first 8 KB of input is read; the boundary is a hard byte cut (not aligned to character boundaries). This is a documented trade-off: the edge case of a valid UTF-8 file whose multi-byte sequence spans the 8 KB boundary is vanishingly rare and the consequence (misdetected as windows-1252) is mild.

Empty input returns utf-8 (empty byte sequences are valid UTF-8).

betto_charset_detector

Features

Getting started

Usage

Return values

Detection algorithm

Additional information

Libraries

betto_charset_detector package