mlx_localllm 0.2.12
mlx_localllm: ^0.2.12 copied to clipboard
A high-performance local LLM plugin for macOS using Apple's MLX framework.
mlx_localllm #

English #
A Flutter plugin for high-performance localized LLM inference on macOS using Apple's MLX framework. It provides zero-latency interaction by loading models directly into the application process.
Features #
- 🚀 Native Acceleration: Built on Apple's MLX for optimal performance on Apple Silicon.
- 📦 In-Process Inference: No external sidecars or servers (like Ollama) required.
- 🌐 Robust Downloader: Built-in support for Hugging Face and mirrors with resilient chunk-based downloading.
- 🛠️ Full Lifecycle: From model discovery and download to stateful conversational inference.
Platform Support #
| Platform | Support | Architecture | Min OS |
|---|---|---|---|
| macOS | ✅ | Apple Silicon (ARM64) / Intel (x86_64 Mock) | 14.0+ |
| iOS | ✅ | ARM64 | 17.0+ |
| Android | ❌ | - | - |
| Windows | ❌ | - | - |
| Linux | ❌ | - | - |
| Web | ❌ | - | - |
Requirements #
- macOS: 14.0 or higher.
- iOS: 17.0 or higher.
- Hardware: Apple Silicon (M1, M2, M3, etc.) for inference.
- Flutter: 3.0.0 or higher.
Installation #
-
Add the dependency to your
pubspec.yaml:flutter pub add mlx_localllm -
Platform Setup: The plugin uses Swift Package Manager (SPM) natively (requires Flutter 3.24+).
- macOS/iOS: Dependencies are automatically resolved by the Flutter toolchain.
- Architecture:
- Apple Silicon (ARM64): Fully supported with native MLX acceleration.
- Intel (x86_64): Supported via Mocking. You can compile and link on Intel Macs, but inference calls will return an "Unsupported architecture" error. This allows for seamless development across different Mac architectures.
Usage #
Check Support
bool supported = await MlxLocalllm().isSupported();
Download Model
// Track progress via modelEvents stream
MlxLocalllm().modelEvents.listen((event) {
if (event['event'] == 'progress') {
print('Download progress: ${event['progress'] * 100}%');
} else if (event['event'] == 'complete') {
print('Download complete at ${event['path']}');
}
});
await MlxLocalllm().downloadModel('mlx-community/Qwen2.5-0.5B-Instruct-4bit');
API Reference #
MlxLocalllm Singleton
| Method | Description | Returns |
|---|---|---|
isSupported() |
Checks if hardware supports MLX (requires Apple Silicon GPU). | Future<bool> |
downloadModel(repoId) |
Starts background download from HF. Monitor via modelEvents. |
Future<bool> |
loadModel(path) |
Loads model into memory from absolute path or repoId. | Future<bool> |
unloadModel() |
Releases model and frees system memory. | Future<bool> |
generate(prompt, config) |
Generates complete response (blocking). | Future<String> |
generateStream(prompt, config) |
Yields text chunks in real-time. | Stream<String> |
checkModelExists(repoId) |
Checks if model folder exists locally. | Future<bool> |
deleteModel(repoId) |
Deletes model from disk. | Future<bool> |
setCustomStoragePath(path) |
Sets/Resets model storage root folder. | Future<void> |
GenerateConfig Parameters
Detailed control over model output:
| Parameter | Type | Default | Description |
|---|---|---|---|
temperature |
double |
0.0 |
Higher = creative, Lower = deterministic. |
maxTokens |
int |
1024 |
Maximum length of generated response. |
topP |
double |
0.95 |
Nucleus sampling probability threshold. |
presence_penalty |
double |
0.0 |
Positive values penalize tokens already present. |
stopSequences |
List |
[] |
Tokens that stop generation (e.g. ["\nUser:"]). |
extraBody |
String |
null |
A JSON string for advanced backend options. |
Advanced Configuration (extraBody)
Pass advanced parameters through extraBody as a JSON string:
GenerateConfig(
extraBody: jsonEncode({
"top_k": 50,
"repetition_penalty": 1.2,
"chat_template_kwargs": {
"enable_thinking": true // Required for reasoning models like Qwen3.5
}
})
)
Thinking Mode (Reasoning) Guide
Thinking models (like Qwen/Qwen3.5-7B-Instruct) use a special template logic to externalize their reasoning process.
- Enable Reasoning: Include
"enable_thinking": trueinchat_template_kwargs. - Handle Output: The model will output text enclosed in
<think>...</think>tags before the final answer. - Hide Reasoning: Set
"enable_thinking": false. Depending on the model's template, it may output no tags or empty tags.
Error Handling #
The plugin throws MlxEngineException for native errors. Always wrap calls in try-catch:
try {
await MlxLocalllm().generate(prompt: "...");
} on MlxEngineException catch (e) {
print("Code: ${e.code}, Message: ${e.message}");
}
简体中文 #
基于 Apple MLX 框架的 macOS/iOS 高性能本地大模型推理 Flutter 插件。通过将模型直接加载到应用进程中,提供零延迟的交互体验。
特性 #
- 🚀 原生加速: 基于 Apple MLX 针对 Apple Silicon 芯片深度优化。
- 📦 进程内推理: 无需安装 Ollama 等外部服务,零延迟通信。
- 🌐 鲁棒下载器: 内置分块下载机制,支持 Hugging Face 及其镜像站。
- 🛠️ 多平台支持: 同时支持 macOS 和 iOS 平台。
- 💻 架构兼容: 支持 x86 架构 Mock 编译,确保 Intel Mac 开发者也能正常运行项目。
平台支持 #
| 平台 | 支持 | 架构 | 最低系统版本 |
|---|---|---|---|
| macOS | ✅ | Apple Silicon (ARM64) / Intel (x86_64 Mock) | 14.0+ |
| iOS | ✅ | ARM64 | 17.0+ |
| Android | ❌ | - | - |
| Windows | ❌ | - | - |
| Linux | ❌ | - | - |
| Web | ❌ | - | - |
系统要求 #
- macOS: 14.0 及以上。
- iOS: 17.0 及以上。
- 硬件: Apple Silicon 系列芯片 (M1, M2, M3 等) 仅在这些芯片上支持推理。
- Flutter: 3.0.0 及以上。
安装指南 #
-
在
pubspec.yaml中添加依赖:flutter pub add mlx_localllm -
平台配置:
- macOS/iOS: 插件使用 Swift Package Manager (SPM) 管理 MLX 依赖。Flutter (3.24+) 会自动处理这些依赖。
- 架构要求:
- Apple Silicon (ARM64): 原生支持,具备 MLX 加速。
- Intel (x86_64): 支持 Mock 编译。您可以在 Intel Mac 上正常编译和链接项目,但推理调用将返回“不支持的架构”错误。这极大地方便了跨架构的协同开发。
API 参考 #
MlxLocalllm 单例方法
| 方法 | 描述 | 返回值 |
|---|---|---|
isSupported() |
检查硬件是否支持 MLX (需要 Apple Silicon GPU)。 | Future<bool> |
downloadModel(repoId) |
从 HuggingFace 开始后台下载。需通过 modelEvents 监控进度。 |
Future<bool> |
loadModel(path) |
从绝对路径或 repoId 将模型加载到内存。 | Future<bool> |
unloadModel() |
从内存中卸载模型,释放系统资源。 | Future<bool> |
generate(prompt, config) |
生成完整文本响应(阻塞式)。 | Future<String> |
generateStream(prompt, config) |
实时返回生成的文本片段。 | Stream<String> |
checkModelExists(repoId) |
检查模型文件夹是否已在本地存在。 | Future<bool> |
deleteModel(repoId) |
从磁盘删除模型。 | Future<bool> |
setCustomStoragePath(path) |
设置或重置模型的根存储目录。 | Future<void> |
GenerateConfig 配置参数
用于精确控制模型输出行为:
| 参数 | 类型 | 默认值 | 描述 |
|---|---|---|---|
temperature |
double |
0.0 |
越高越具创造性,越低越确定(Deterministic)。 |
maxTokens |
int |
1024 |
响应生成的最大 Token 数量。 |
topP |
double |
0.95 |
核采样(Nucleus sampling)概率阈值。 |
presence_penalty |
double |
0.0 |
正值会惩罚已出现的 Token,减少重复。 |
stopSequences |
List |
[] |
停止生成的序列(如 ["\nUser:"])。 |
extraBody |
String |
null |
传递给原生层的高级 JSON 配置字符串。 |
高级配置 (extraBody)
通过 extraBody 以 JSON 字符串形式传递底层高级参数:
GenerateConfig(
extraBody: jsonEncode({
"top_k": 50,
"repetition_penalty": 1.2,
"chat_template_kwargs": {
"enable_thinking": true // 针对 Qwen3.5 等推理模型必填
}
})
)
推理模式 (Thinking Mode) 指南
推理模型(如 Qwen/Qwen3.5-7B-Instruct)使用特殊的模板逻辑来外部化其思维链(CoT)。
- 启动推理: 在
chat_template_kwargs中包含"enable_thinking": true。 - 输出处理: 模型会在最终回答前输出被
<think>...</think>标签包裹的思维过程。 - 关闭推理: 设置
"enable_thinking": false。根据模型模板不同,可能不输出标签或输出空标签。
错误处理 #
插件在发生原生错误时会抛出 MlxEngineException。建议始终使用 try-catch 包裹调用:
try {
await MlxLocalllm().generate(prompt: "...");
} on MlxEngineException catch (e) {
print("错误码: ${e.code}, 消息: ${e.message}");
}
许可证 #
MIT License.