vision_flow 0.0.4+1
vision_flow: ^0.0.4+1 copied to clipboard
A Flutter plugin for real-time vision tasks, Hands, Face, Pose estimation, and Video Classification. Built on top of MediaPipe, powered by ExecuTorch and TensorFlow Lite.
VisionFlow #
A modular, production-ready Flutter plugin for real-time sign language and gesture recognition. VisionFlow connects directly to the device camera stream to extract MediaPipe landmarks, normalizes them, and runs inference on custom PyTorch or TFLite models.
🌟 Key Features #
- Multi-Backend Inference: Load both PyTorch (
.pt) and TensorFlow Lite (.tflite) models natively without rewriting plugin logic. - Configurable Detection Modules: Individually toggle Hand, Face, and Pose detection to optimize performance based on your model's requirements.
- Dynamic Sequence Buffering: Define exactly how many frames (e.g., 30 frames) make up a gesture sequence. VisionFlow automatically buffers and processes these frames.
- Intelligent Normalization: Automatically centers coordinates based on the nose landmark and performs dynamic min-max scaling to match standard ML pipelines, drastically improving model accuracy.
- Real-Time Streaming API: Directly pass YUV420 buffers from the
cameraplugin into VisionFlow for highly optimized, zero-copy native processing.
📦 Installation #
Add vision_flow to your pubspec.yaml:
dependencies:
vision_flow: ^X.X.X # replace X with the latest version
Android Configuration #
Ensure your android/build.gradle has the required repositories, and that compileSdkVersion is set to 34 or higher (36 recommended).
android {
compileSdk 36
// ...
}
🚀 Quick Start Guide #
1. Load Your Model #
VisionFlow supports two model loading strategies. Choose the one that fits your use case:
Option A — Load from Flutter Assets (bundled with the app)
Pack the model into your app at build time. Best for a fixed, pre-trained model shipped with the release.
Important
VisionFlow uses the ExecuTorch runtime for PyTorch models. Models must be exported in the ExecuTorch portable format (.pte), not the old TorchScript (.pt) format.
Step 1 — Export your model to .pte (Python):
import torch
from executorch.exir import to_edge_transform_and_lower
example_inputs = (torch.zeros(1, 30, 330),) # batch=1, seq=30, features=330
edge = to_edge_transform_and_lower(torch.export.export(model, example_inputs))
with open("model.pte", "wb") as f:
f.write(edge.to_executorch().buffer)
Step 2 — Add to your Flutter project:
flutter:
assets:
- assets/models/my_model.pte
Step 3 — Load it at runtime:
import 'package:vision_flow/vision_flow.dart';
await VisionFlow.loadModel(
path: 'assets/models/my_model.pte',
backend: VisionFlowModelType.pytorch,
isAsset: true, // default — can be omitted
);
Option B — Load from Device Storage (file picker)
Allow the user to supply their own model file at runtime. Useful for research tools or apps that let users swap models without a new release.
Add file_picker to your pubspec.yaml:
dependencies:
file_picker: ^11.0.2
Then pick any .pte (ExecuTorch) or .tflite (TFLite) file:
import 'package:file_picker/file_picker.dart';
import 'package:vision_flow/vision_flow.dart';
final result = await FilePicker.pickFiles(
type: FileType.custom,
allowedExtensions: ['pte', 'tflite'],
);
if (result != null) {
final filePath = result.files.single.path!;
final backend = filePath.endsWith('.pte')
? VisionFlowModelType.pytorch // ExecuTorch
: VisionFlowModelType.tflite;
await VisionFlow.loadModel(
path: filePath,
backend: backend,
isAsset: false, // absolute device path
);
}
2. Configure the Pipeline #
Define what the pipeline should detect, and the sequence length expected by your model. The pipeline will automatically pad missing features with zeros.
await VisionFlow.configure(
hands: true, // Enable MediaPipe Hand Tracking (extracts 2 hands)
face: true, // Enable MediaPipe Face Mesh (extracts 68 key points)
pose: false, // Enable Body Pose tracking (if required by your model)
sequenceLength: 30, // The number of frames your model requires per prediction
);
3. Listen for Predictions #
VisionFlow provides a unified stream for predictions. Whenever the frame buffer reaches the configured sequenceLength, it runs inference and pushes a PredictionResult.
VisionFlow.predictions.listen((PredictionResult result) {
print("Predicted Label: ${result.label}");
print("Class Index: ${result.index}");
});
4. Process Camera Frames #
Push raw YUV frames from the Flutter camera plugin directly into VisionFlow. This is highly optimized for Android.
import 'package:camera/camera.dart';
CameraController controller = CameraController(cameras[0], ResolutionPreset.medium);
await controller.initialize();
controller.startImageStream((CameraImage image) async {
// Pass the raw image planes directly to native code
await VisionFlow.processFrame(
y: image.planes[0].bytes,
u: image.planes[1].bytes,
v: image.planes[2].bytes,
width: image.width,
height: image.height,
yRowStride: image.planes[0].bytesPerRow,
uvRowStride: image.planes[1].bytesPerRow,
uvPixelStride: image.planes[1].bytesPerPixel!,
);
});
5. Cleanup #
Always dispose of the plugin resources when navigating away from the camera view to free up the native ML engines and the camera feed.
await VisionFlow.dispose();
🧠 Advanced: Understanding the Feature Vector #
VisionFlow extracts raw coordinates, normalizes them frame-by-frame, and constructs a 3D tensor expected by modern DNN/GRU/LSTM architectures.
Output Tensor Shape #
The exact sequence shape constructed in the FrameBuffer before being sent to your model is (1, SequenceLength, 330).
Feature Allocation (The "330" Vector) #
Every single frame outputs exactly 330 normalized float coordinates in the following order:
- Right Hand (63 features): 21 landmarks × 3 coordinates (X, Y, Z). If the right hand is not detected, it is padded with zeros.
- Left Hand (63 features): 21 landmarks × 3 coordinates (X, Y, Z). If the left hand is not detected, it is padded with zeros.
- Face (204 features): 68 specific key landmarks extracted from the 468-point MediaPipe face mesh × 3 coordinates (X, Y, Z).
Built-in Normalization #
To ensure the model is robust to different camera distances and angles, VisionFlow natively applies spatial normalization before inference:
-
Nose Centering: All X, Y, and Z coordinates across hands and face are subtracted by the coordinate of the nose tip (Face Landmark Index 7).
-
Min-Max Scaling: The bounding box of the entire 30-frame sequence is calculated, and all coordinates are scaled to a
[0, 1]range.
This completely mimics standard Python-based training environments (like scikit-learn's MinMaxScaler), ensuring your exported models perform exactly as they did during training!