axiom-vision

Vision Framework Computer Vision

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "axiom-vision" with this command: npx skills add fotescodev/ios-agent-skills/fotescodev-ios-agent-skills-axiom-vision

Vision Framework Computer Vision

Guides you through implementing computer vision: subject segmentation, hand/body pose detection, person detection, text recognition, barcode detection, document scanning, and combining Vision APIs to solve complex problems.

When to Use This Skill

Use when you need to:

  • ☑ Isolate subjects from backgrounds (subject lifting)

  • ☑ Detect and track hand poses for gestures

  • ☑ Detect and track body poses for fitness/action classification

  • ☑ Segment multiple people separately

  • ☑ Exclude hands from object bounding boxes (combining APIs)

  • ☑ Choose between VisionKit and Vision framework

  • ☑ Combine Vision with CoreImage for compositing

  • ☑ Decide which Vision API solves your problem

  • ☑ Recognize text in images (OCR)

  • ☑ Detect barcodes and QR codes

  • ☑ Scan documents with perspective correction

  • ☑ Extract structured data from documents (iOS 26+)

  • ☑ Build live scanning experiences (DataScannerViewController)

Example Prompts

"How do I isolate a subject from the background?" "I need to detect hand gestures like pinch" "How can I get a bounding box around an object without including the hand holding it?" "Should I use VisionKit or Vision framework for subject lifting?" "How do I segment multiple people separately?" "I need to detect body poses for a fitness app" "How do I preserve HDR when compositing subjects on new backgrounds?" "How do I recognize text in an image?" "I need to scan QR codes from camera" "How do I extract data from a receipt?" "Should I use DataScannerViewController or Vision directly?" "How do I scan documents and correct perspective?" "I need to extract table data from a document"

Red Flags

Signs you're making this harder than it needs to be:

  • ❌ Manually implementing subject segmentation with CoreML models

  • ❌ Using ARKit just for body pose (Vision works offline)

  • ❌ Writing gesture recognition from scratch (use hand pose + simple distance checks)

  • ❌ Processing on main thread (blocks UI - Vision is resource intensive)

  • ❌ Training custom models when Vision APIs already exist

  • ❌ Not checking confidence scores (low confidence = unreliable landmarks)

  • ❌ Forgetting to convert coordinates (lower-left origin vs UIKit top-left)

  • ❌ Building custom text recognizer when VNRecognizeTextRequest exists

  • ❌ Using AVFoundation + Vision when DataScannerViewController suffices

  • ❌ Processing every camera frame for scanning (skip frames, use region of interest)

  • ❌ Enabling all barcode symbologies when you only need one (performance hit)

  • ❌ Ignoring RecognizeDocumentsRequest when you need table/list structure (iOS 26+)

Mandatory First Steps

Before implementing any Vision feature:

  1. Choose the Right API (Decision Tree)

What do you need to do?

┌─ Isolate subject(s) from background? │ ├─ Need system UI + out-of-process → VisionKit │ │ └─ ImageAnalysisInteraction (iOS/iPadOS) │ │ └─ ImageAnalysisOverlayView (macOS) │ ├─ Need custom pipeline / HDR / large images → Vision │ │ └─ VNGenerateForegroundInstanceMaskRequest │ └─ Need to EXCLUDE hands from object → Combine APIs │ └─ Subject mask + Hand pose + custom masking (see Pattern 1) │ ├─ Segment people? │ ├─ All people in one mask → VNGeneratePersonSegmentationRequest │ └─ Separate mask per person (up to 4) → VNGeneratePersonInstanceMaskRequest │ ├─ Detect hand pose/gestures? │ ├─ Just hand location → VNDetectHumanRectanglesRequest │ └─ 21 hand landmarks → VNDetectHumanHandPoseRequest │ └─ Gesture recognition → Hand pose + distance checks │ ├─ Detect body pose? │ ├─ 2D normalized landmarks → VNDetectHumanBodyPoseRequest │ ├─ 3D real-world coordinates → VNDetectHumanBodyPose3DRequest │ └─ Action classification → Body pose + CreateML model │ ├─ Face detection? │ ├─ Just bounding boxes → VNDetectFaceRectanglesRequest │ └─ Detailed landmarks → VNDetectFaceLandmarksRequest │ ├─ Person detection (location only)? │ └─ VNDetectHumanRectanglesRequest │ ├─ Recognize text in images? │ ├─ Real-time from camera + need UI → DataScannerViewController (iOS 16+) │ ├─ Processing captured image → VNRecognizeTextRequest │ │ ├─ Need speed (real-time camera) → recognitionLevel = .fast │ │ └─ Need accuracy (documents) → recognitionLevel = .accurate │ └─ Need structured documents (iOS 26+) → RecognizeDocumentsRequest │ ├─ Detect barcodes/QR codes? │ ├─ Real-time camera + need UI → DataScannerViewController (iOS 16+) │ └─ Processing image → VNDetectBarcodesRequest │ └─ Scan documents? ├─ Need built-in UI + perspective correction → VNDocumentCameraViewController ├─ Need structured data (tables, lists) → RecognizeDocumentsRequest (iOS 26+) └─ Custom pipeline → VNDetectDocumentSegmentationRequest + perspective correction

  1. Set Up Background Processing

NEVER run Vision on main thread:

let processingQueue = DispatchQueue(label: "com.yourapp.vision", qos: .userInitiated)

processingQueue.async { do { let request = VNGenerateForegroundInstanceMaskRequest() let handler = VNImageRequestHandler(cgImage: image) try handler.perform([request])

    // Process observations...

    DispatchQueue.main.async {
        // Update UI
    }
} catch {
    // Handle error
}

}

  1. Verify Platform Availability

API Minimum Version

Subject segmentation (instance masks) iOS 17+

VisionKit subject lifting iOS 16+

Hand pose iOS 14+

Body pose (2D) iOS 14+

Body pose (3D) iOS 17+

Person instance segmentation iOS 17+

VNRecognizeTextRequest (basic) iOS 13+

VNRecognizeTextRequest (accurate, multi-lang) iOS 14+

VNDetectBarcodesRequest iOS 11+

VNDetectBarcodesRequest (revision 2: Codabar, MicroQR) iOS 15+

VNDetectBarcodesRequest (revision 3: ML-based) iOS 16+

DataScannerViewController iOS 16+

VNDocumentCameraViewController iOS 13+

VNDetectDocumentSegmentationRequest iOS 15+

RecognizeDocumentsRequest iOS 26+

Common Patterns

Pattern 1: Isolate Object While Excluding Hand

User's original problem: Getting a bounding box around an object held in hand, without including the hand.

Root cause: VNGenerateForegroundInstanceMaskRequest is class-agnostic and treats hand+object as one subject.

Solution: Combine subject mask with hand pose to create exclusion mask.

// 1. Get subject instance mask let subjectRequest = VNGenerateForegroundInstanceMaskRequest() let handler = VNImageRequestHandler(cgImage: sourceImage) try handler.perform([subjectRequest])

guard let subjectObservation = subjectRequest.results?.first as? VNInstanceMaskObservation else { fatalError("No subject detected") }

// 2. Get hand pose landmarks let handRequest = VNDetectHumanHandPoseRequest() handRequest.maximumHandCount = 2 try handler.perform([handRequest])

guard let handObservation = handRequest.results?.first as? VNHumanHandPoseObservation else { // No hand detected - use full subject mask let mask = try subjectObservation.createScaledMask( for: subjectObservation.allInstances, croppedToInstancesContent: false ) return mask }

// 3. Create hand exclusion region from landmarks let handPoints = try handObservation.recognizedPoints(.all) let handBounds = calculateConvexHull(from: handPoints) // Your implementation

// 4. Subtract hand region from subject mask using CoreImage let subjectMask = try subjectObservation.createScaledMask( for: subjectObservation.allInstances, croppedToInstancesContent: false )

let subjectCIMask = CIImage(cvPixelBuffer: subjectMask) let handMask = createMaskFromRegion(handBounds, size: sourceImage.size) let finalMask = subtractMasks(handMask: handMask, from: subjectCIMask)

// 5. Calculate bounding box from final mask let objectBounds = calculateBoundingBox(from: finalMask)

Helper: Convex Hull

func calculateConvexHull(from points: [VNRecognizedPointKey: VNRecognizedPoint]) -> CGRect { // Get high-confidence points let validPoints = points.values.filter { $0.confidence > 0.5 }

guard !validPoints.isEmpty else { return .zero }

// Simple bounding rect (for more accuracy, use actual convex hull algorithm)
let xs = validPoints.map { $0.location.x }
let ys = validPoints.map { $0.location.y }

let minX = xs.min()!
let maxX = xs.max()!
let minY = ys.min()!
let maxY = ys.max()!

return CGRect(
    x: minX,
    y: minY,
    width: maxX - minX,
    height: maxY - minY
)

}

Cost: 2-5 hours initial implementation, 30 min ongoing maintenance

Pattern 2: VisionKit Simple Subject Lifting

Use case: Add system-like subject lifting UI with minimal code.

// iOS let interaction = ImageAnalysisInteraction() interaction.preferredInteractionTypes = .imageSubject imageView.addInteraction(interaction)

// macOS let overlayView = ImageAnalysisOverlayView() overlayView.preferredInteractionTypes = .imageSubject nsView.addSubview(overlayView)

When to use:

  • ✓ Want system behavior (long-press to select, drag to share)

  • ✓ Don't need custom processing pipeline

  • ✓ Image size within VisionKit limits (out-of-process)

Cost: 15 min implementation, 5 min ongoing

Pattern 3: Programmatic Subject Access (VisionKit)

Use case: Need subject images/bounds without UI interaction.

let analyzer = ImageAnalyzer() let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])

let analysis = try await analyzer.analyze(sourceImage, configuration: configuration)

// Get all subjects for subject in analysis.subjects { let subjectImage = subject.image let subjectBounds = subject.bounds

// Process subject...

}

// Tap-based lookup if let subject = try await analysis.subject(at: tapPoint) { let compositeImage = try await analysis.image(for: [subject]) }

Cost: 30 min implementation, 10 min ongoing

Pattern 4: Vision Instance Mask for Custom Pipeline

Use case: HDR preservation, large images, custom compositing.

let request = VNGenerateForegroundInstanceMaskRequest() let handler = VNImageRequestHandler(cgImage: sourceImage) try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else { return }

// Get soft segmentation mask let mask = try observation.createScaledMask( for: observation.allInstances, croppedToInstancesContent: false // Full resolution for compositing )

// Use with CoreImage for HDR preservation let filter = CIFilter(name: "CIBlendWithMask")! filter.setValue(CIImage(cgImage: sourceImage), forKey: kCIInputImageKey) filter.setValue(CIImage(cvPixelBuffer: mask), forKey: kCIInputMaskImageKey) filter.setValue(newBackground, forKey: kCIInputBackgroundImageKey)

let compositedImage = filter.outputImage

Cost: 1 hour implementation, 15 min ongoing

Pattern 5: Tap-to-Select Instance

Use case: User taps to select which subject/person to lift.

// Get instance at tap point let instance = observation.instanceAtPoint(tapPoint)

if instance == 0 { // Background tapped - select all instances let mask = try observation.createScaledMask( for: observation.allInstances, croppedToInstancesContent: false ) } else { // Specific instance tapped let mask = try observation.createScaledMask( for: IndexSet(integer: instance), croppedToInstancesContent: true ) }

Alternative: Raw pixel buffer access

let instanceMask = observation.instanceMask

CVPixelBufferLockBaseAddress(instanceMask, .readOnly) defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }

let baseAddress = CVPixelBufferGetBaseAddress(instanceMask) let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)

// Convert normalized tap to pixel coordinates let pixelPoint = VNImagePointForNormalizedPoint( tapPoint, width: imageWidth, height: imageHeight )

let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x) let label = UnsafeRawPointer(baseAddress!).load( fromByteOffset: offset, as: UInt8.self )

Cost: 45 min implementation, 10 min ongoing

Pattern 6: Hand Gesture Recognition (Pinch)

Use case: Detect pinch gesture for custom camera trigger or UI control.

let request = VNDetectHumanHandPoseRequest() request.maximumHandCount = 1

try handler.perform([request])

guard let observation = request.results?.first as? VNHumanHandPoseObservation else { return }

let thumbTip = try observation.recognizedPoint(.thumbTip) let indexTip = try observation.recognizedPoint(.indexTip)

// Check confidence guard thumbTip.confidence > 0.5, indexTip.confidence > 0.5 else { return }

// Calculate distance (normalized coordinates) let dx = thumbTip.location.x - indexTip.location.x let dy = thumbTip.location.y - indexTip.location.y let distance = sqrt(dx * dx + dy * dy)

let isPinching = distance < 0.05 // Adjust threshold

// State machine for evidence accumulation if isPinching { pinchFrameCount += 1 if pinchFrameCount >= 3 { state = .pinched } } else { pinchFrameCount = max(0, pinchFrameCount - 1) if pinchFrameCount == 0 { state = .apart } }

Cost: 2 hours implementation, 20 min ongoing

Pattern 7: Separate Multiple People

Use case: Apply different effects to each person or count people.

let request = VNGeneratePersonInstanceMaskRequest() try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else { return }

let peopleCount = observation.allInstances.count // Up to 4

for personIndex in observation.allInstances { let personMask = try observation.createScaledMask( for: IndexSet(integer: personIndex), croppedToInstancesContent: false )

// Apply effect to this person only
applyEffect(to: personMask, personIndex: personIndex)

}

Crowded scenes (>4 people):

// Count faces to detect crowding let faceRequest = VNDetectFaceRectanglesRequest() try handler.perform([faceRequest])

let faceCount = faceRequest.results?.count ?? 0

if faceCount > 4 { // Fallback: Use single mask for all people let singleMaskRequest = VNGeneratePersonSegmentationRequest() try handler.perform([singleMaskRequest]) }

Cost: 1.5 hours implementation, 15 min ongoing

Pattern 8: Body Pose for Action Classification

Use case: Fitness app that recognizes exercises (jumping jacks, squats, etc.)

// 1. Collect body pose observations var poseObservations: [VNHumanBodyPoseObservation] = []

let request = VNDetectHumanBodyPoseRequest() try handler.perform([request])

if let observation = request.results?.first as? VNHumanBodyPoseObservation { poseObservations.append(observation) }

// 2. When you have 60 frames of poses, prepare for CreateML model if poseObservations.count == 60 { var multiArray = try MLMultiArray( shape: [60, 18, 3], // 60 frames, 18 joints, (x, y, confidence) dataType: .double )

for (frameIndex, observation) in poseObservations.enumerated() {
    let allPoints = try observation.recognizedPoints(.all)

    for (jointIndex, (_, point)) in allPoints.enumerated() {
        multiArray[[frameIndex, jointIndex, 0] as [NSNumber]] = NSNumber(value: point.location.x)
        multiArray[[frameIndex, jointIndex, 1] as [NSNumber]] = NSNumber(value: point.location.y)
        multiArray[[frameIndex, jointIndex, 2] as [NSNumber]] = NSNumber(value: point.confidence)
    }
}

// 3. Run inference with CreateML model
let input = YourActionClassifierInput(poses: multiArray)
let output = try actionClassifier.prediction(input: input)

let action = output.label  // "jumping_jacks", "squats", etc.

}

Cost: 3-4 hours implementation, 1 hour ongoing

Pattern 9: Text Recognition (OCR)

Use case: Extract text from images, receipts, signs, documents.

let request = VNRecognizeTextRequest() request.recognitionLevel = .accurate // Or .fast for real-time request.recognitionLanguages = ["en-US"] // Specify known languages request.usesLanguageCorrection = true // Helps accuracy

let handler = VNImageRequestHandler(cgImage: image) try handler.perform([request])

guard let observations = request.results as? [VNRecognizedTextObservation] else { return }

for observation in observations { // Get top candidate (most likely) guard let candidate = observation.topCandidates(1).first else { continue }

let text = candidate.string
let confidence = candidate.confidence

// Get bounding box for specific substring
if let range = text.range(of: searchTerm) {
    if let boundingBox = try? candidate.boundingBox(for: range) {
        // Use for highlighting
    }
}

}

Fast vs Accurate:

  • Fast: Real-time camera, large legible text (signs, billboards), character-by-character

  • Accurate: Documents, receipts, small text, handwriting, ML-based word/line recognition

Language tips:

  • Order matters: first language determines ML model for accurate path

  • Use automaticallyDetectsLanguage = true only when language unknown

  • Query supportedRecognitionLanguages for current revision

Cost: 30 min basic implementation, 2 hours with language handling

Pattern 10: Barcode/QR Code Detection

Use case: Scan product barcodes, QR codes, healthcare codes.

let request = VNDetectBarcodesRequest() request.revision = VNDetectBarcodesRequestRevision3 // ML-based, iOS 16+ request.symbologies = [.qr, .ean13] // Specify only what you need!

let handler = VNImageRequestHandler(cgImage: image) try handler.perform([request])

guard let observations = request.results as? [VNBarcodeObservation] else { return }

for barcode in observations { let payload = barcode.payloadStringValue // Decoded content let symbology = barcode.symbology // Type of barcode let bounds = barcode.boundingBox // Location (normalized)

print("Found \(symbology): \(payload ?? "no string")")

}

Performance tip: Specifying fewer symbologies = faster scanning

Revision differences:

  • Revision 1: One code at a time, 1D codes return lines

  • Revision 2: Codabar, GS1Databar, MicroPDF, MicroQR, better with ROI

  • Revision 3: ML-based, multiple codes at once, better bounding boxes, fewer duplicates

Cost: 15 min implementation

Pattern 11: DataScannerViewController (Live Scanning)

Use case: Camera-based text/barcode scanning with built-in UI (iOS 16+).

import VisionKit

// Check support guard DataScannerViewController.isSupported, DataScannerViewController.isAvailable else { // Not supported or camera access denied return }

// Configure what to scan let recognizedDataTypes: Set<DataScannerViewController.RecognizedDataType> = [ .barcode(symbologies: [.qr]), .text(textContentType: .URL) // Or nil for all text ]

// Create and present let scanner = DataScannerViewController( recognizedDataTypes: recognizedDataTypes, qualityLevel: .balanced, // Or .fast, .accurate recognizesMultipleItems: false, // Center-most if false isHighFrameRateTrackingEnabled: true, // For smooth highlights isPinchToZoomEnabled: true, isGuidanceEnabled: true, isHighlightingEnabled: true )

scanner.delegate = self present(scanner, animated: true) { try? scanner.startScanning() }

Delegate methods:

func dataScanner(_ scanner: DataScannerViewController, didTapOn item: RecognizedItem) { switch item { case .text(let text): print("Tapped text: (text.transcript)") case .barcode(let barcode): print("Tapped barcode: (barcode.payloadStringValue ?? "")") @unknown default: break } }

// For custom highlights func dataScanner(_ scanner: DataScannerViewController, didAdd addedItems: [RecognizedItem], allItems: [RecognizedItem]) { for item in addedItems { let highlight = createHighlight(for: item) scanner.overlayContainerView.addSubview(highlight) } }

Async stream alternative:

for await items in scanner.recognizedItems { // Process current items }

Cost: 45 min implementation with custom highlights

Pattern 12: Document Scanning with VNDocumentCameraViewController

Use case: Scan paper documents with automatic edge detection and perspective correction.

import VisionKit

let documentCamera = VNDocumentCameraViewController() documentCamera.delegate = self present(documentCamera, animated: true)

// In delegate func documentCameraViewController(_ controller: VNDocumentCameraViewController, didFinishWith scan: VNDocumentCameraScan) { controller.dismiss(animated: true)

// Process each page
for pageIndex in 0..&#x3C;scan.pageCount {
    let image = scan.imageOfPage(at: pageIndex)

    // Now run text recognition on the corrected image
    let handler = VNImageRequestHandler(cgImage: image.cgImage!)
    let textRequest = VNRecognizeTextRequest()
    try? handler.perform([textRequest])
}

}

Cost: 30 min implementation

Pattern 13: Document Segmentation (Custom Pipeline)

Use case: Detect document edges programmatically for custom camera UI.

let request = VNDetectDocumentSegmentationRequest() let handler = VNImageRequestHandler(ciImage: inputImage) try handler.perform([request])

guard let observation = request.results?.first, let document = observation as? VNRectangleObservation else { return }

// Get corner points (normalized coordinates) let topLeft = document.topLeft let topRight = document.topRight let bottomLeft = document.bottomLeft let bottomRight = document.bottomRight

// Apply perspective correction with CoreImage let correctedImage = inputImage .cropped(to: document.boundingBox.scaled(to: imageSize)) .applyingFilter("CIPerspectiveCorrection", parameters: [ "inputTopLeft": CIVector(cgPoint: topLeft.scaled(to: imageSize)), "inputTopRight": CIVector(cgPoint: topRight.scaled(to: imageSize)), "inputBottomLeft": CIVector(cgPoint: bottomLeft.scaled(to: imageSize)), "inputBottomRight": CIVector(cgPoint: bottomRight.scaled(to: imageSize)) ])

VNDetectDocumentSegmentationRequest vs VNDetectRectanglesRequest:

  • Document: ML-based, trained on documents, handles non-rectangles, returns one document

  • Rectangle: Edge-based, finds any quadrilateral, returns multiple, CPU-only

Cost: 1-2 hours implementation

Pattern 14: Structured Document Extraction (iOS 26+)

Use case: Extract tables, lists, paragraphs with semantic understanding.

// iOS 26+ let request = RecognizeDocumentsRequest() let observations = try await request.perform(on: imageData)

guard let document = observations.first?.document else { return }

// Extract tables for table in document.tables { for row in table.rows { for cell in row { let text = cell.content.text.transcript print("Cell: (text)") } } }

// Get detected data (emails, phones, URLs, dates) let allDetectedData = document.text.detectedData for data in allDetectedData { switch data.match.details { case .emailAddress(let email): print("Email: (email.emailAddress)") case .phoneNumber(let phone): print("Phone: (phone.phoneNumber)") case .link(let url): print("URL: (url)") default: break } }

Document hierarchy:

  • Document → containers (text, tables, lists, barcodes)

  • Table → rows → cells → content

  • Content → text (transcript, lines, paragraphs, words, detectedData)

Cost: 1 hour implementation

Pattern 15: Real-time Phone Number Scanner

Use case: Scan phone numbers from camera like barcode scanner (from WWDC 2019).

// 1. Use region of interest to guide user let textRequest = VNRecognizeTextRequest { request, error in guard let observations = request.results as? [VNRecognizedTextObservation] else { return }

for observation in observations {
    guard let candidate = observation.topCandidates(1).first else { continue }

    // Use domain knowledge to filter
    if let phoneNumber = self.extractPhoneNumber(from: candidate.string) {
        self.stringTracker.add(phoneNumber)
    }
}

// Build evidence over frames
if let stableNumber = self.stringTracker.getStableString(threshold: 10) {
    self.foundPhoneNumber(stableNumber)
}

}

textRequest.recognitionLevel = .fast // Real-time textRequest.usesLanguageCorrection = false // Codes, not natural text textRequest.regionOfInterest = guidanceBox // Crop to user's focus area

// 2. String tracker for stability class StringTracker { private var seenStrings: [String: Int] = [:]

func add(_ string: String) {
    seenStrings[string, default: 0] += 1
}

func getStableString(threshold: Int) -> String? {
    seenStrings.first { $0.value >= threshold }?.key
}

}

Key techniques from WWDC 2019:

  • Use .fast recognition level for real-time

  • Disable language correction for codes/numbers

  • Use region of interest to improve speed and focus

  • Build evidence over multiple frames (string tracker)

  • Apply domain knowledge (phone number regex)

Cost: 2 hours implementation

Anti-Patterns

Anti-Pattern 1: Processing on Main Thread

Wrong:

let request = VNGenerateForegroundInstanceMaskRequest() let handler = VNImageRequestHandler(cgImage: image) try handler.perform([request]) // Blocks UI!

Right:

DispatchQueue.global(qos: .userInitiated).async { let request = VNGenerateForegroundInstanceMaskRequest() let handler = VNImageRequestHandler(cgImage: image) try handler.perform([request])

DispatchQueue.main.async {
    // Update UI
}

}

Why it matters: Vision is resource-intensive. Blocking main thread freezes UI.

Anti-Pattern 2: Ignoring Confidence Scores

Wrong:

let thumbTip = try observation.recognizedPoint(.thumbTip) let location = thumbTip.location // May be unreliable!

Right:

let thumbTip = try observation.recognizedPoint(.thumbTip) guard thumbTip.confidence > 0.5 else { // Low confidence - landmark unreliable return } let location = thumbTip.location

Why it matters: Low confidence points are inaccurate (occlusion, blur, edge of frame).

Anti-Pattern 3: Forgetting Coordinate Conversion

Wrong (mixing coordinate systems):

// Vision uses lower-left origin let visionPoint = recognizedPoint.location // (0, 0) = bottom-left

// UIKit uses top-left origin let uiPoint = CGPoint(x: axiom-visionPoint.x, y: axiom-visionPoint.y) // WRONG!

Right:

let visionPoint = recognizedPoint.location

// Convert to UIKit coordinates let uiPoint = CGPoint( x: axiom-visionPoint.x * imageWidth, y: (1 - visionPoint.y) * imageHeight // Flip Y axis )

Why it matters: Mismatched origins cause UI overlays to appear in wrong positions.

Anti-Pattern 4: Setting maximumHandCount Too High

Wrong:

let request = VNDetectHumanHandPoseRequest() request.maximumHandCount = 10 // "Just in case"

Right:

let request = VNDetectHumanHandPoseRequest() request.maximumHandCount = 2 // Only compute what you need

Why it matters: Performance scales with maximumHandCount . Pose computed for all detected hands ≤ max.

Anti-Pattern 5: Using ARKit When Vision Suffices

Wrong (if you don't need AR):

// Requires AR session just for body pose let arSession = ARBodyTrackingConfiguration()

Right:

// Vision works offline on still images let request = VNDetectHumanBodyPoseRequest()

Why it matters: ARKit body pose requires rear camera, AR session, supported devices. Vision works everywhere (even offline).

Pressure Scenarios

Scenario 1: "Just Ship the Feature"

Context: Product manager wants subject lifting "like in Photos app" by Friday. You're considering skipping background processing.

Pressure: "It's working on my iPhone 15 Pro, let's ship it."

Reality: Vision blocks UI on older devices. Users on iPhone 12 will experience frozen app.

Correct action:

  • Implement background queue (15 min)

  • Add loading indicator (10 min)

  • Test on iPhone 12 or earlier (5 min)

Push-back template: "Subject lifting works, but it freezes the UI on older devices. I need 30 minutes to add background processing and prevent 1-star reviews."

Scenario 2: "Training Our Own Model"

Context: Designer wants to exclude hands from subject bounding box. Engineer suggests training custom CoreML model for specific object detection.

Pressure: "We need perfect bounds, let's train a model."

Reality: Training requires labeled dataset (weeks), ongoing maintenance, and still won't generalize to new objects. Built-in Vision APIs + hand pose solve it in 2-5 hours.

Correct action:

  • Explain Pattern 1 (combine subject mask + hand pose)

  • Prototype in 1 hour to demonstrate

  • Compare against training timeline (weeks vs hours)

Push-back template: "Training a model takes weeks and only works for specific objects. I can combine Vision APIs to solve this in a few hours and it'll work for any object."

Scenario 3: "We Can't Wait for iOS 17"

Context: You need instance masks but app supports iOS 15+.

Pressure: "Just use iOS 15 person segmentation and ship it."

Reality: VNGeneratePersonSegmentationRequest (iOS 15) returns single mask for all people. Doesn't solve multi-person use case.

Correct action:

  • Raise minimum deployment target to iOS 17 (best UX)

  • OR implement fallback: use iOS 15 API but disable multi-person features

  • OR use @available to conditionally enable features

Push-back template: "Person segmentation on iOS 15 combines all people into one mask. We can either require iOS 17 for the best experience, or disable multi-person features on older OS versions. Which do you prefer?"

Checklist

Before shipping Vision features:

Performance:

  • ☑ All Vision requests run on background queue

  • ☑ UI shows loading indicator during processing

  • ☑ Tested on iPhone 12 or earlier (not just latest devices)

  • ☑ maximumHandCount set to minimum needed value

Accuracy:

  • ☑ Confidence scores checked before using landmarks

  • ☑ Fallback behavior for low confidence observations

  • ☑ Handles case where no subjects/hands/people detected

Coordinates:

  • ☑ Vision coordinates (lower-left origin) converted to UIKit (top-left)

  • ☑ Normalized coordinates scaled to pixel dimensions

  • ☑ UI overlays aligned correctly with image

Platform Support:

  • ☑ @available checks for iOS 17+ APIs (instance masks)

  • ☑ Fallback for iOS 14-16 (or raised deployment target)

  • ☑ Tested on actual devices, not just simulator

Edge Cases:

  • ☑ Handles images with no detectable subjects

  • ☑ Handles partially occluded hands/bodies

  • ☑ Handles hands/bodies near image edges

  • ☑ Handles >4 people for person instance segmentation

CoreImage Integration (if applicable):

  • ☑ HDR preservation verified with high dynamic range images

  • ☑ Mask resolution matches source image

  • ☑ croppedToInstancesContent set appropriately (false for compositing)

Text/Barcode Recognition (if applicable):

  • ☑ Recognition level matches use case (fast for real-time, accurate for documents)

  • ☑ Language correction disabled for codes/serial numbers

  • ☑ Barcode symbologies limited to actual needs (performance)

  • ☑ Region of interest used to focus scanning area

  • ☑ Multiple candidates checked (not just top candidate)

  • ☑ Evidence accumulated over frames for real-time (string tracker)

  • ☑ DataScannerViewController availability checked before presenting

Resources

WWDC: 2019-234, 2021-10041, 2022-10024, 2022-10025, 2025-272, 2023-10176, 2023-111241, 2020-10653

Docs: /vision, /visionkit, /vision/vnrecognizetextrequest, /vision/vndetectbarcodesrequest

Skills: axiom-vision-ref, axiom-vision-diag

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

axiom-swiftui-architecture

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

axiom-testflight-triage

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

axiom-avfoundation-ref

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

axiom-realm-migration-ref

No summary provided by upstream source.

Repository SourceNeeds Review