vision_ai

On-device hand gesture recognition and facial emotion detection for Flutter. Runs at 25-30 FPS with zero cloud dependencies.

vision_ai banner

What You Can Build

Sign language interpreter — map 13+ gestures to words with custom finger patterns
Driver drowsiness alert — blink detection + attention scoring triggers warnings
Touchless kiosk — hand motion direction controls UI without touching screen
Online exam proctoring — attention score + face tracking + head nod/shake
Fitness rep counter — track hand landmarks in world coordinates (meters)
Interactive children's game — emotion-driven characters + clap/pinch detection
Accessibility controller — custom gestures → app actions, blink-to-click
Live stream reactions — real-time emotion overlay on broadcaster's face
AR filter trigger — face contours + landmarks drive filter positioning
Social distance monitor — face distance estimation in cm

Platform Support

Platform	Status	Min Version	Notes
Android	Stable	API 24 (Android 7.0)	Tested on Samsung Galaxy A15 and other devices
iOS	Beta	iOS 12.0	Implementation complete — community testing welcome (report issues)

Installation

dependencies:
  vision_ai: ^0.1.0
  vision_ai_flutter: ^0.1.0  # optional: pre-built camera overlay widgets

Android

Add camera permission to android/app/src/main/AndroidManifest.xml:

<uses-permission android:name="android.permission.CAMERA" />

iOS

Add camera usage description to ios/Runner/Info.plist:

<key>NSCameraUsageDescription</key>
<string>Camera access is needed for hand gesture and face detection.</string>

Core API

VisionAi

The main controller. Create it, start the camera, listen to results, dispose when done.

// Hand + face combined
final vision = VisionAi(
  hand: HandConfig(maxHands: 2),
  face: FaceConfig(detectEmotion: true),
  camera: CameraConfig(facing: CameraFacing.front),
);

// Or use factory constructors for single-mode:
final handOnly = VisionAi.hand();
final faceOnly = VisionAi.face();

Method	Returns	Description
`start()`	`Future<int>`	Starts camera + ML processing. Returns texture ID for Flutter's `Texture` widget.
`stop()`	`Future<void>`	Stops processing, releases camera. Can `start()` again after.
`dispose()`	`Future<void>`	Releases everything. Instance is unusable after this.
`results`	`Stream<VisionResult>`	Per-frame detection results. Active between `start()` and `stop()`.
`updateHandConfig(config)`	`Future<void>`	Hot-swap hand settings while running. Requires restart for some changes.
`updateFaceConfig(config)`	`Future<void>`	Hot-swap face settings while running.
`switchCamera(facing)`	`Future<void>`	Switch front/back. Requires stop+start to take effect.
`isRunning`	`bool`	Whether the camera is actively processing frames.

VisionResult

Every frame produces one of these. Contains all detected hands and faces for that frame.

vision.results.listen((result) {
  print('Hands: ${result.hands.length}, Faces: ${result.faces.length}');
  print('Frame size: ${result.imageSize}');
  print('ML took: ${result.inferenceTimeMs}ms');
});

Property	Type	Description
`hands`	`List<HandResult>`	All detected hands (0, 1, or 2 depending on `maxHands`)
`faces`	`List<FaceResult>`	All detected faces
`timestampMs`	`int`	Milliseconds since device boot
`imageSize`	`Size`	Camera frame dimensions (for scaling overlays)
`inferenceTimeMs`	`int`	Combined hand + face ML processing time
`primaryHand`	`HandResult?`	Hand with highest gesture confidence, or null
`primaryFace`	`FaceResult?`	Face with highest emotion confidence, or null
`hasHands`	`bool`	`hands.isNotEmpty`
`hasFaces`	`bool`	`faces.isNotEmpty`

Hand Detection

HandConfig

HandConfig(
  maxHands: 2,                    // 1 or 2 hands to detect
  minDetectionConfidence: 0.5,    // [0.0, 1.0] — lower = more detections, more false positives
  minPresenceConfidence: 0.5,     // [0.0, 1.0] — confidence hand is still present between frames
  minTrackingConfidence: 0.5,     // [0.0, 1.0] — landmark tracking quality threshold
  customGestures: [...],          // your own finger patterns (see below)
  allowedGestures: {Gesture.peace, Gesture.thumbsUp},  // only report these (null = all)
  deniedGestures: {Gesture.fist},                       // block these (null = none)
  gestureThresholds: {Gesture.thumbsUp: 0.8},           // per-gesture min confidence
)

HandResult

Each detected hand has landmarks, gesture, finger states, and a bounding box.

final hand = result.primaryHand;
if (hand != null) {
  print(hand.gesture);            // Gesture.peace
  print(hand.gestureConfidence);  // 0.95
  print(hand.isLeftHand);         // true/false (from camera's perspective)
  print(hand.customGestureName);  // "rock" (only for user-defined gestures)
  print(hand.boundingBox);        // Rect in normalized [0,1] coords
}

Property	Type	Description
`gesture`	`Gesture`	Detected gesture enum (fist, peace, thumbsUp, etc.)
`gestureConfidence`	`double`	`0.0, 1.0` confidence for the gesture
`customGestureName`	`String?`	Non-null only for user-defined custom gestures
`landmarks`	`List<NormalizedLandmark>`	21 points in `0.0, 1.0` image coordinates
`worldLandmarks`	`List<WorldLandmark>`	21 points in meters (real-world scale)
`isLeftHand`	`bool`	Handedness from camera's perspective
`handednessConfidence`	`double`	How confident the L/R classification is
`fingerStates`	`Map<Finger, FingerState>`	Extended/closed for each finger
`boundingBox`	`Rect?`	Normalized bounding box from landmark min/max. Null if no landmarks.

Finger States

Check which fingers are extended:

final fingers = hand.fingerStates;
if (fingers[Finger.indexFinger] == FingerState.extended &&
    fingers[Finger.middle] == FingerState.extended) {
  print('Peace sign!');
}

// Count extended fingers
final count = fingers.values.where((s) => s == FingerState.extended).length;
print('$count fingers up');

21 Hand Landmarks

Each hand has 21 3D landmarks. Use HandLandmarkIndex constants to access specific joints:

final wrist = hand.landmarks[HandLandmarkIndex.wrist];         // index 0
final thumbTip = hand.landmarks[HandLandmarkIndex.thumbTip];   // index 4
final indexTip = hand.landmarks[HandLandmarkIndex.indexTip];   // index 8
final middleTip = hand.landmarks[HandLandmarkIndex.middleTip]; // index 12
final pinkyTip = hand.landmarks[HandLandmarkIndex.pinkyTip];   // index 20

// Convert to pixel coordinates for drawing
final pixelPos = wrist.toOffset(screenWidth, screenHeight);

// All 23 bone connections for skeleton rendering:
for (final bone in HandLandmarkIndex.connections) {
  final from = hand.landmarks[bone[0]];
  final to = hand.landmarks[bone[1]];
  // draw line from → to
}

Landmark indices: 0=wrist, 1-4=thumb (CMC→tip), 5-8=index (MCP→tip), 9-12=middle, 13-16=ring, 17-20=pinky.

World Coordinates (Meters)

worldLandmarks give real-world 3D positions relative to the hand's center. Use them to measure actual distances:

// Pinch distance in centimeters
final thumbTip = hand.worldLandmarks[HandLandmarkIndex.thumbTip];
final indexTip = hand.worldLandmarks[HandLandmarkIndex.indexTip];
final pinchCm = thumbTip.distanceTo(indexTip) * 100;
print('Pinch gap: ${pinchCm.toStringAsFixed(1)}cm');

// Hand span (thumb to pinky)
final pinkyTip = hand.worldLandmarks[HandLandmarkIndex.pinkyTip];
final spanCm = thumbTip.distanceTo(pinkyTip) * 100;
print('Hand span: ${spanCm.toStringAsFixed(1)}cm');

Custom Gestures

Define finger patterns. Fingers not in the map act as wildcards (any state matches):

HandConfig(
  customGestures: [
    // Rock sign: index + pinky up, others down
    CustomGesture(
      name: 'rock',
      fingerStates: {
        Finger.thumb: FingerState.closed,
        Finger.indexFinger: FingerState.extended,
        Finger.middle: FingerState.closed,
        Finger.ring: FingerState.closed,
        Finger.pinky: FingerState.extended,
      },
    ),
    // Gun: thumb + index up (other fingers are wildcards)
    CustomGesture(
      name: 'gun',
      fingerStates: {
        Finger.thumb: FingerState.extended,
        Finger.indexFinger: FingerState.extended,
      },
    ),
  ],
)

Custom gestures are checked after built-in MediaPipe gestures fail. Priority: OK → counting 1-5 → your patterns (first match wins).

When a custom gesture matches, hand.gesture == Gesture.custom and hand.customGestureName == "rock".

Gesture Filtering

Control which gestures are reported:

HandConfig(
  // Only report these (everything else becomes Gesture.none)
  allowedGestures: {Gesture.thumbsUp, Gesture.peace, Gesture.fist},
  
  // OR block specific ones (everything else passes through)
  deniedGestures: {Gesture.fist, Gesture.openHand},
  
  // Raise the bar for specific gestures
  gestureThresholds: {
    Gesture.thumbsUp: 0.8,  // must be 80%+ confident
    Gesture.peace: 0.7,
  },
)

Filtering happens after MediaPipe classification but before custom gesture fallback. So if fist is denied and the user makes a fist, the custom gesture classifier still gets a chance.

Supported Gestures

Gesture	Enum	Source	When detected
Fist	`Gesture.fist`	MediaPipe	All fingers closed
Open Hand	`Gesture.openHand`	MediaPipe	All fingers spread
Peace	`Gesture.peace`	MediaPipe	Index + middle up
Thumbs Up	`Gesture.thumbsUp`	MediaPipe	Thumb up, others closed
Thumbs Down	`Gesture.thumbsDown`	MediaPipe	Thumb down, others closed
Pointing Up	`Gesture.pointingUp`	MediaPipe	Index up, others closed
I Love You	`Gesture.iLoveYou`	MediaPipe	Thumb + index + pinky
OK	`Gesture.ok`	Custom	Thumb-index pinch, others extended
One–Five	`Gesture.one`–`Gesture.five`	Custom	Counting patterns
User-defined	`Gesture.custom`	Your config	Check `customGestureName`

Face Detection

FaceConfig

FaceConfig(
  detectEmotion: true,       // run TFLite emotion classifier (~5-15ms extra)
  detectLandmarks: false,    // 10 face landmark points (eyes, nose, mouth, ears, cheeks)
  detectContours: false,     // 15 face contour types (detailed mesh)
  minFaceSize: 0.1,          // [0.0, 1.0] — fraction of image width; smaller = slower
  enableTracking: true,      // stable face IDs across frames (can't use with contours)
  minEmotionConfidence: 0.4, // stored for future filtering
  accurateMode: false,       // ML Kit ACCURATE mode — better for distant faces, ~2-3x slower
)

Note: Contour mode and face tracking are mutually exclusive (ML Kit limitation on both platforms). Enabling contours automatically disables tracking.

FaceResult

final face = result.primaryFace;
if (face != null) {
  print(face.emotion);              // Emotion.happy
  print(face.emotionConfidence);    // 0.98
  print(face.smilingProbability);   // 0.95 (null if not available)
  print(face.leftEyeOpenProbability);  // 0.92
  print(face.rightEyeOpenProbability); // 0.88
  print(face.trackingId);           // 42 (-1 when tracking disabled)
  print(face.boundingBox);          // Rect in pixel coordinates
  
  // Euler angles (degrees)
  print(face.headEulerAngleX);  // pitch: positive = looking up
  print(face.headEulerAngleY);  // yaw: positive = turned right  
  print(face.headEulerAngleZ);  // roll: positive = head tilted right
  
  // Emotion scores for all 7 classes
  face.emotionScores.forEach((emotion, score) {
    print('$emotion: ${(score * 100).toStringAsFixed(0)}%');
  });
}

Property	Type	Description
`emotion`	`Emotion`	Highest-scoring emotion
`emotionConfidence`	`double`	`0.0, 1.0` score for the top emotion
`emotionScores`	`Map<Emotion, double>`	All 7 class probabilities
`boundingBox`	`Rect`	Face position in pixel coordinates
`headEulerAngleX`	`double`	Pitch in degrees (+ = looking up)
`headEulerAngleY`	`double`	Yaw in degrees (+ = turned right)
`headEulerAngleZ`	`double`	Roll in degrees (+ = tilted right)
`smilingProbability`	`double?`	`0.0, 1.0` or null
`leftEyeOpenProbability`	`double?`	`0.0, 1.0` or null
`rightEyeOpenProbability`	`double?`	`0.0, 1.0` or null
`trackingId`	`int`	Stable ID across frames (-1 when tracking off)
`landmarks`	`List<Offset>?`	10 points in pixel coords (null when `detectLandmarks: false`)
`contours`	`List<List<Offset>>?`	15 contour polylines (null when `detectContours: false`)

Supported Emotions

Emotion	Enum	Reliability	Notes
Happy	`Emotion.happy`	High	Smiles detected very reliably
Neutral	`Emotion.neutral`	High	Default resting face
Surprised	`Emotion.surprised`	High	Wide eyes + open mouth
Sad	`Emotion.sad`	Medium	Works with exaggerated expressions
Angry	`Emotion.angry`	Medium	Furrowed brows help
Disgusted	`Emotion.disgusted`	Low	Often confused with angry
Fearful	`Emotion.fearful`	Low	Often confused with surprised

Face Landmarks (10 points)

When detectLandmarks: true, pixel-coordinate positions for:

Index	Point	Use case
0	Left eye center	Gaze direction, blink
1	Right eye center	Gaze direction, blink
2	Nose base	Face center reference
3	Mouth left corner	Smile detection
4	Mouth right corner	Smile width
5	Mouth bottom	Mouth open detection
6	Left ear	Face width
7	Right ear	Face width
8	Left cheek	Face shape
9	Right cheek	Face shape

Missing points (face turned away) return Offset(-1, -1).

Face Contours (15 types)

When detectContours: true, detailed polylines for face mesh rendering:

Face outline, left/right eyebrow (top + bottom), left/right eye, upper/lower lip (top + bottom), nose bridge, nose bottom, left/right cheek center.

Each contour is a List<Offset> of connected points in pixel coordinates.

Dart-Only Detectors

These run entirely in Dart — no native code, no extra ML models. They consume FaceResult or HandResult from the stream and compute higher-level events. All are stateful: create once, feed every frame, call reset() when switching subjects.

BlinkDetector

Detects eye blinks from open/close probability transitions.

final blinkDetector = BlinkDetector(
  openThreshold: 0.7,       // above this = "eyes open"
  closedThreshold: 0.3,     // below this = "eyes closed"
  maxBlinkDurationMs: 500,  // longer closures are ignored (not a blink)
);

vision.results.listen((result) {
  final face = result.primaryFace;
  if (face != null) {
    final blink = blinkDetector.update(face, result.timestampMs);
    if (blink != null) {
      print('${blink.eye} blink, ${blink.durationMs}ms'); // BlinkEye.left, .right, or .both
    }
  }
});

Use cases: Blink-to-click for accessibility, drowsiness detection (slow/frequent blinks), liveness check for authentication.

HeadGestureDetector

Detects head nod (yes) and shake (no) from Euler angle oscillations.

final headDetector = HeadGestureDetector(
  nodAngleThreshold: 8.0,      // degrees of pitch change to count as a nod movement
  shakeAngleThreshold: 10.0,   // degrees of yaw change to count as a shake movement
  minOscillations: 3,          // direction changes needed (3 = 1.5 back-and-forth cycles)
  windowMs: 1000,              // oscillations must happen within this time window
  cooldownMs: 1500,            // wait after detection before allowing another
);

vision.results.listen((result) {
  final face = result.primaryFace;
  if (face != null) {
    final gesture = headDetector.update(face, result.timestampMs);
    if (gesture != null) {
      print(gesture.gesture == HeadGesture.nod ? 'YES' : 'NO');
    }
  }
});

Use cases: Hands-free yes/no input, survey responses, accessibility confirmation.

FaceDistanceEstimator

Estimates camera-to-face distance using the pinhole camera model.

final distanceEstimator = FaceDistanceEstimator(
  assumedFaceWidthCm: 15.0,  // average adult face ~14-16cm
  cameraFovDegrees: 75.0,    // most phone front cameras are 70-80 degrees
);

vision.results.listen((result) {
  final face = result.primaryFace;
  if (face != null) {
    final estimate = distanceEstimator.estimate(face, result.imageSize);
    if (estimate != null) {
      print('${estimate.distanceCm.toStringAsFixed(0)}cm — ${estimate.zone.name}');
      // Zones: veryClose (<30cm), close (30-60cm), medium (60-120cm), far (>120cm)
    }
  }
});

Use cases: Screen distance warnings, social distancing, zoom-based UI scaling. Accuracy is ~20-30%, good for zone detection, not precise measurement.

AttentionScorer

Combines three signals into a single 0-100% attention/engagement score:

Eye openness (40% weight) — average of both eyes
Face orientation (40% weight) — pitch + yaw distance from center
Head stability (20% weight) — inverse of angular velocity over 500ms

final scorer = AttentionScorer(
  eyeWeight: 0.4,
  orientationWeight: 0.4,
  stabilityWeight: 0.2,
  maxPitchDegrees: 45.0,       // beyond this angle, orientation score = 0
  maxYawDegrees: 45.0,
  stabilityWindowMs: 500,
  maxAngularVelocity: 60.0,    // degrees/sec above which stability = 0
);

vision.results.listen((result) {
  final face = result.primaryFace;
  if (face != null) {
    final attention = scorer.update(face, result.timestampMs);
    if (attention != null) {
      print('Attention: ${(attention.score * 100).toStringAsFixed(0)}% (${attention.level.name})');
      print('  Eye: ${(attention.eyeScore * 100).toStringAsFixed(0)}%');
      print('  Orientation: ${(attention.orientationScore * 100).toStringAsFixed(0)}%');
      print('  Stability: ${(attention.stabilityScore * 100).toStringAsFixed(0)}%');
      // AttentionLevel: high (>=75%), medium (45-75%), low (15-45%), none (<15%)
    }
  }
});

Use cases: E-learning engagement tracking, proctoring, driver monitoring, meeting participation.

HandMotionTracker

Tracks hand velocity and movement direction across frames.

final tracker = HandMotionTracker(
  windowMs: 200,                // velocity averaged over this window
  stillThreshold: 0.02,         // below this speed = still
  trackingLandmarkIndex: 0,     // 0 = wrist (default), or any landmark index
);

vision.results.listen((result) {
  final hand = result.primaryHand;
  if (hand != null) {
    final motion = tracker.update(hand, result.timestampMs);
    if (motion != null) {
      print('Speed: ${motion.speed.toStringAsFixed(2)}/s');  // normalized units/sec
      print('Direction: ${motion.direction.name}');          // up, upRight, right, etc.
      print('State: ${motion.state.name}');                  // still, slow, moderate, fast
      print('Velocity: (${motion.velocityX}, ${motion.velocityY})');
    }
  }
});

Directions: up, upRight, right, downRight, down, downLeft, left, upLeft (8 compass points).

States: still (<0.02), slow (0.02-0.15), moderate (0.15-0.5), fast (>0.5 normalized units/sec).

Use cases: Swipe gesture recognition, wave detection, touchless scrolling direction.

TwoHandInteractionDetector

Detects interactions between two hands.

final twoHand = TwoHandInteractionDetector(
  pinchThreshold: 0.06,          // index tips within 6% of image width
  touchThreshold: 0.08,          // any fingertips within 8%
  clapVelocityThreshold: 0.3,    // wrist approach speed for clap
  cooldownMs: 500,               // ms between detections
);

vision.results.listen((result) {
  final event = twoHand.update(result);  // takes full VisionResult, not single hand
  if (event != null) {
    print('${event.gesture.name} at distance ${event.distance.toStringAsFixed(3)}');
    // TwoHandGesture: pinch, clap, touching
  }
});

Requires HandConfig(maxHands: 2). Detection priority: pinch (most specific) → clap (velocity-based) → touching (fallback).

Use cases: Zoom gestures, clap-to-action, collaborative interactions.

Camera Configuration

CameraConfig(
  facing: CameraFacing.front,           // .front or .back
  resolution: AnalysisResolution.medium, // .low (320x240), .medium (640x480), .high (1280x720)
  maxResultsPerSecond: 0,               // 0 = no throttle (every frame)
)

Emission Throttling

Control how many results per second reach Dart. The ML pipeline still runs at full speed — throttling only skips the emission so the next result is always fresh.

Value	Effect	Best for
`0`	Every frame (~20-30 FPS)	Smooth hand skeleton drawing
`10-15`	Balanced	Gesture/emotion labels with acceptable landmark lag
`5`	Labels only	Minimal CPU, choppy skeletons

CameraConfig(maxResultsPerSecond: 10)

Camera Preview

VisionAi.start() returns a texture ID. Render with Flutter's Texture widget:

final textureId = await vision.start();
// In your build:
Texture(textureId: textureId)

Or use VisionAiCameraView from vision_ai_flutter for a complete solution with overlays.

Architecture

All ML inference runs on-device:

Hand gestures: MediaPipe Gesture Recognizer (~8MB model, GPU delegate with CPU fallback)
Face detection: Google ML Kit Face Detection (bundled per-platform)
Emotion: TFLite CNN trained on FER2013 (~2MB model, 2 inference threads)

Camera frames are processed natively (CameraX on Android, AVFoundation on iOS). Only lightweight results cross the platform channel — raw frame data never leaves the native side.

Example App

The package ships with a full-featured demo app that lets you test every feature before writing any code. It includes a settings panel with per-feature toggles organized into cards, so you can enable/disable individual capabilities and see the results in real-time.

cd example
flutter run

What you can test:

Toggle hand detection, face detection, or both simultaneously
Switch between front/back camera and low/medium/high resolution
Enable hand motion tracking, two-hand interaction, gesture filtering
Enable blink detection, head nod/shake, face distance, attention scoring
Toggle individual overlays: hand skeleton, hand bounding box, face box, face contours, gesture label, emotion label, world coordinates
Adjust detection confidence, min face size, max results/sec with sliders
Try accurate mode for face detection
Define a custom "rock" gesture out of the box

All toggles apply instantly for overlay settings. Detection and camera changes require a restart (tap Stop then Start). When you disable hand or face detection, all related sub-settings and overlay options disappear automatically and reset to defaults.

The example also serves as a reference implementation showing how to use ValueNotifier + ValueListenableBuilder instead of setState for reactive state management with this package.

iOS Beta

The iOS implementation is complete and mirrors the Android architecture (AVFoundation + MediaPipe + ML Kit + TFLite), but has not been extensively tested on physical devices. If you have a Mac + iPhone/iPad:

Run the example app: cd example && flutter run
Test hand gestures, face detection, emotion classification
Share crash logs or issues at GitHub Issues with the ios label

Your testing helps us move iOS from Beta to Stable.

License

Apache 2.0 — see LICENSE and NOTICE. Forks must retain attribution and state changes.