vision_flow 0.0.2+1
vision_flow: ^0.0.2+1 copied to clipboard
A Flutter plugin for real-time vision tasks, Hands, Face, Pose estimation, and Video Classification. Built on top of MediaPipe, powered by PyTorch and TensorFlow Lite.
VisionFlow #
A modular, production-ready Flutter plugin for real-time sign language and gesture recognition. VisionFlow connects directly to the device camera stream to extract MediaPipe landmarks, normalizes them, and runs inference on custom PyTorch or TFLite models.
🌟 Key Features #
- Multi-Backend Inference: Load both PyTorch (
.pt) and TensorFlow Lite (.tflite) models natively without rewriting plugin logic. - Configurable Detection Modules: Individually toggle Hand, Face, and Pose detection to optimize performance based on your model's requirements.
- Dynamic Sequence Buffering: Define exactly how many frames (e.g., 30 frames) make up a gesture sequence. VisionFlow automatically buffers and processes these frames.
- Intelligent Normalization: Automatically centers coordinates based on the nose landmark and performs dynamic min-max scaling to match standard ML pipelines, drastically improving model accuracy.
- Real-Time Streaming API: Directly pass YUV420 buffers from the
cameraplugin into VisionFlow for highly optimized, zero-copy native processing.
📦 Installation #
Add vision_flow to your pubspec.yaml:
dependencies:
vision_flow: ^X.X.X # replace X with the latest version
Android Configuration #
Ensure your android/build.gradle has the required repositories, and that compileSdkVersion is set to 34 or higher (36 recommended).
android {
compileSdk 36
// ...
}
🚀 Quick Start Guide #
1. Load Your Model #
Before processing any frames, load your pre-trained model. Place your model (.pt or .tflite) inside the assets/ folder of your Flutter project and declare it in your pubspec.yaml.
flutter:
assets:
- assets/models/my_sign_language_model.pt
Initialize the model in your Dart code:
import 'package:vision_flow/vision_flow.dart';
// Initialize a PyTorch model
await VisionFlow.loadModel(
path: "assets/models/my_sign_language_model.pt",
backend: VisionFlowModelType.pytorch,
);
// OR initialize a TFLite model
await VisionFlow.loadModel(
path: "assets/models/my_sign_language_model.tflite",
backend: VisionFlowModelType.tflite,
);
2. Configure the Pipeline #
Define what the pipeline should detect, and the sequence length expected by your model. The pipeline will automatically pad missing features with zeros.
await VisionFlow.configure(
hands: true, // Enable MediaPipe Hand Tracking (extracts 2 hands)
face: true, // Enable MediaPipe Face Mesh (extracts 68 key points)
pose: false, // Enable Body Pose tracking (if required by your model)
sequenceLength: 30, // The number of frames your model requires per prediction
);
3. Listen for Predictions #
VisionFlow provides a unified stream for predictions. Whenever the frame buffer reaches the configured sequenceLength, it runs inference and pushes a PredictionResult.
VisionFlow.predictions.listen((PredictionResult result) {
print("Predicted Label: ${result.label}");
print("Class Index: ${result.index}");
});
4. Process Camera Frames #
Push raw YUV frames from the Flutter camera plugin directly into VisionFlow. This is highly optimized for Android.
import 'package:camera/camera.dart';
CameraController controller = CameraController(cameras[0], ResolutionPreset.medium);
await controller.initialize();
controller.startImageStream((CameraImage image) async {
// Pass the raw image planes directly to native code
await VisionFlow.processFrame(
y: image.planes[0].bytes,
u: image.planes[1].bytes,
v: image.planes[2].bytes,
width: image.width,
height: image.height,
yRowStride: image.planes[0].bytesPerRow,
uvRowStride: image.planes[1].bytesPerRow,
uvPixelStride: image.planes[1].bytesPerPixel!,
);
});
5. Cleanup #
Always dispose of the plugin resources when navigating away from the camera view to free up the native ML engines and the camera feed.
await VisionFlow.dispose();
🧠 Advanced: Understanding the Feature Vector #
VisionFlow extracts raw coordinates, normalizes them frame-by-frame, and constructs a 3D tensor expected by modern DNN/GRU/LSTM architectures.
Output Tensor Shape #
The exact sequence shape constructed in the FrameBuffer before being sent to your model is (1, SequenceLength, 330).
Feature Allocation (The "330" Vector) #
Every single frame outputs exactly 330 normalized float coordinates in the following order:
- Right Hand (63 features): 21 landmarks × 3 coordinates (X, Y, Z). If the right hand is not detected, it is padded with zeros.
- Left Hand (63 features): 21 landmarks × 3 coordinates (X, Y, Z). If the left hand is not detected, it is padded with zeros.
- Face (204 features): 68 specific key landmarks extracted from the 468-point MediaPipe face mesh × 3 coordinates (X, Y, Z).
Built-in Normalization #
To ensure the model is robust to different camera distances and angles, VisionFlow natively applies spatial normalization before inference:
-
Nose Centering: All X, Y, and Z coordinates across hands and face are subtracted by the coordinate of the nose tip (Face Landmark Index 7).
-
Min-Max Scaling: The bounding box of the entire 30-frame sequence is calculated, and all coordinates are scaled to a
[0, 1]range.
This completely mimics standard Python-based training environments (like scikit-learn's MinMaxScaler), ensuring your exported models perform exactly as they did during training!