Human Extractor Skill

Description

GPU-accelerated pipeline for detecting, tracking, and classifying humans in dashcam footage. Processes MP4 videos to extract human crops with optional CLIP-based head covering classification, saving all outputs to a unified directory with comprehensive indexing.

Purpose

Extract visual evidence of human presence from dashcam recordings for investigative analysis. Optimized for high throughput using NVDEC decoding, batched YOLOv8 detection, ByteTrack multi-object tracking, and optional CLIP classification.

Usage

Basic Invocation

Extract humans from Park_R videos on October 6, 2025

Advanced Invocation

Scan Park_R\20251006 and 20251007, keep only frames with people, save all outputs in one folder, add one full-frame per timestamp with boxes, use my GPU at max, filter for head-covered individuals at 80% confidence

Input Parameters

Required

roots (list[str]): One or more source directories containing MP4 files
Example: ["G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R\20251006"]

Core Detection

confidence (float, default: 0.35): YOLOv8 detection confidence threshold (0.0-1.0)
iou (float, default: 0.50): IoU threshold for NMS (0.0-1.0)
yolo_batch (int, default: 64): YOLOv8 batch size (32-128 depending on VRAM)

CLIP Filtering (Optional)

clip_filter.enabled (bool, default: false): Enable head covering classification
clip_filter.threshold (float, default: 0.80): CLIP confidence threshold
clip_filter.batch (int, default: 384): CLIP batch size (256-512)

Hardware Acceleration

nvdec (bool, default: true): Use NVIDIA hardware video decoding
gpu_id (int, default: 0): CUDA device ID

Output Control

single_output_dir (str, default: "parsed\ALL_CROPS"): Unified output directory
save_full_frame (bool, default: false): Save one annotated full-frame per timestamp
full_frame_maxw (int, default: 1280): Max width for full-frame saves
draw_boxes (bool, default: true): Annotate boxes on full-frames

Filename Convention

filename_version (str, default: "v1"): Version tag for output filenames

Deduplication

dedup.enabled (bool, default: true): Enable similarity deduplication
dedup.ssim (float, default: 0.92): SSIM threshold (0.0-1.0)
dedup.rate_cap_per_track_per_min (int, default: 12): Max crops per track per minute

Parallel Processing

parallel.dates (list[str], optional): Process multiple dates concurrently
parallel.max_workers (int, default: 3): Max parallel date workers

Output Format

Success Response

{ "status": "ok", "summary": { "videos_processed": 142, "crops_saved": 4414, "frames_saved": 728, "gpu_util_avg": 0.85, "processing_time_sec": 2847, "errors": 0 }, "artifacts": { "index_csv": "G:\My Drive\PROJECTS\APPS\Human_Detection\parsed\ALL_CROPS\INDEX.csv", "output_dir": "G:\My Drive\PROJECTS\APPS\Human_Detection\parsed\ALL_CROPS", "log_file": "G:\My Drive\PROJECTS\APPS\Human_Detection\parsed\ALL_CROPS\run_20251006_143022.log" }, "performance": { "nvdec_active": true, "yolo_batch": 64, "clip_batch": 384, "avg_fps": 48.3, "vram_peak_gb": 9.2 }, "notes": [ "NVDEC hardware decoding active", "Batched YOLO=64, CLIP=384", "GPU utilization: 85%" ] }

Error Response

{ "status": "error", "error": "CUDA out of memory", "suggestion": "Reduce batch sizes: yolo_batch=48, clip_batch=256", "partial_results": { "videos_processed": 67, "crops_saved": 2103 } }

Output Structure

Directory Layout

parsed\ALL_CROPS
├── INDEX.csv # Global master index ├── INDEX.20251006_pid1234.csv # Shard (pre-merge) ├── run_20251006_143022.log # Execution log │

Crop files (per person detection)

├── 20251006__20251006142644_070785B__t15234__f365__trk017__x1014y46w266h659__c85__v1.webp ├── 20251006__20251006143844_070787B__t8420__f202__trk003__x234y567w180h420__c92__v1.webp │

Full-frame files (optional, one per timestamp)

├── 20251006__20251006142644_070785B__t15234__FRAME__v1.webp └── 20251006__20251006143844_070787B__t8420__FRAME__v1.webp

Filename Convention

Crop Format:

<date>__<video_stem>__t<ts_ms>__f<frame_idx>__trk<track_id>__x<x1>y<y1>w<w>h<h>__c<covered_0to100>__v<ver>.webp

Example: 20251006__20251006142644_070785B__t15234__f365__trk017__x1014y46w266h659__c85__v1.webp

Decoded:

Date: 2025-10-06
Video: 20251006142644_070785B.MP4
Timestamp: 15234 ms
Frame: 365
Track: 17
BBox: x=1014, y=46, w=266, h=659
CLIP confidence: 85% (head covering)
Version: v1

Full-Frame Format:

<date>__<video_stem>__t<ts_ms>__FRAME__v<ver>.webp

Example: 20251006__20251006142644_070785B__t15234__FRAME__v1.webp

INDEX.csv Schema

dataset,date,video_rel,video_stem,frame_idx,ts_ms,track_id,x1,y1,w,h,person_conf,covered_conf,file_type,crop_file,sha1,bboxes_json,annotated,pipeline_ver,yolo_batch,clip_batch,nvdec,created_utc

Example rows:

Park_R,20251006,20251006\20251006142644_070785B.MP4,20251006142644_070785B,365,15234,17,1014,46,266,659,0.92,0.85,crop,20251006__20251006142644_070785B__t15234__f365__trk017__x1014y46w266h659__c85__v1.webp,a3f2c8b9...,,,v1,64,384,1,2025-10-06T14:30:22Z Park_R,20251006,20251006\20251006142644_070785B.MP4,20251006142644_070785B,365,15234,,,,,,,frame,20251006__20251006142644_070785B__t15234__FRAME__v1.webp,d4e1a2c7...,"[{""x1"":1014,""y1"":46,""w"":266,""h"":659,""conf"":0.92,""track"":17}]",1,v1,64,384,1,2025-10-06T14:30:22Z

Column Definitions:

dataset: Source camera (Park_R, Park_F, Movie_F, Movie_R)
date: YYYYMMDD
video_rel: Relative path from dataset root
video_stem: Filename without .MP4 extension
frame_idx: Frame number in video
ts_ms: Timestamp in milliseconds
track_id: ByteTrack ID (empty for FRAME rows)
x1,y1,w,h: Bounding box (empty for FRAME rows)
person_conf: YOLOv8 detection confidence
covered_conf: CLIP head covering confidence (0-100 scale, empty if disabled)
file_type: "crop" or "frame"
crop_file: Relative filename
sha1: File hash for integrity
bboxes_json: All detections in frame (FRAME rows only)
annotated: 1 if boxes drawn on frame, 0 otherwise
pipeline_ver: Semantic version tag
yolo_batch: YOLO batch size used
clip_batch: CLIP batch size used (0 if disabled)
nvdec: 1 if NVDEC used, 0 otherwise
created_utc: ISO 8601 timestamp

Implementation Details

Processing Pipeline

[MP4 Videos] │ ▼ [NVDEC Decoder (GPU)] RGB tensor → CUDA Stream A │ ▼ [YOLOv8s Detection] Batched (64 frames) FP16, conf=0.35 │ ▼ [ByteTrack Tracking] IoU=0.5, max_age=10 │ ├──────────────────────► [Full-Frame Saver] │ (optional, downscaled, annotated) ▼ [ROI Align (GPU)] Extract crops on GPU │ ▼ [CLIP Classification] ◄────── (optional) Batched (384 crops) FP16, threshold=0.80 │ ▼ [Deduplication Filter] SSIM ≥ 0.92 Rate cap: 12/min/track │ ▼ [Async I/O Thread Pool] WebP encode (q=85) Shard INDEX writes │ ▼ [Final Merge] INDEX.csv

GPU Optimization Strategy

Dual CUDA Streams:

Stream A: YOLOv8 detection
Stream B: CLIP classification
Overlap compute + memory transfers

Dynamic Batching:

Accumulate frames until batch size reached
Process immediately on timeout (100ms)
Keep GPU pipeline full (80-90% utilization)

Memory Management:

Pinned memory for faster CPU↔GPU transfers
Pre-allocated tensor buffers
Stream-ordered operations

Decoder Priority:

NVDEC (GPU hardware decoder) - 5-10x faster
CPU fallback (OpenCV) if NVDEC unavailable
Multi-threaded DataLoader (8-12 workers)

Performance Targets (RTX 4080 16GB)

Metric Target Notes

GPU Utilization 80-90% NVDEC + dual streams + large batches

Throughput 3-4 videos/min Parking videos (2 FPS sampling)

VRAM Usage 6-10 GB YOLO=64, CLIP=384

Latency <30s per video Including decode, detect, track, classify

Configuration Tuning

If GPU util < 70%:

Increase batch sizes: yolo_batch=80 , clip_batch=448
Verify NVDEC active (check nvdec_active in response)
Increase parallel workers: max_workers=4

If CUDA OOM:

Reduce CLIP batch first: clip_batch=256
Then reduce YOLO batch: yolo_batch=48
Disable full-frame saves: save_full_frame=false

If disk I/O bottleneck:

Disable full-frame: save_full_frame=false
Reduce quality: full_frame_maxw=960 , WebP q=75
Use faster storage (NVMe SSD)

CLI Equivalent

Basic usage

python -m src.cli.run_multi_dates
--root "G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R"
--out parsed\ALL_CROPS
--dates 20251006 20251007
--use-nvdec --conf 0.35 --iou 0.5

Advanced usage with CLIP filtering

python -m src.cli.run_multi_dates
--root "G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R"
--out parsed\ALL_CROPS
--dates 20251006 20251007 20251008
--use-nvdec
--yolo-batch 64
--clip-batch 384
--clip-threshold 0.80
--conf 0.35
--iou 0.5
--save-full-frame
--draw-boxes
--parallel 3

Example Interactions

Example 1: Basic Detection

User: "Extract all humans from Park_R videos on October 6"

Skill invokes:

{ "mode": "extract_humans", "roots": ["G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R\20251006"], "confidence": 0.35, "single_output_dir": "parsed\ALL_CROPS", "nvdec": true }

Example 2: Advanced with CLIP

User: "Scan Park_R for October 6-8, filter for people with head coverings at 80% confidence, save annotated frames, max GPU usage"

Skill invokes:

{ "mode": "extract_humans", "roots": [ "G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R\20251006", "G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R\20251007", "G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R\20251008" ], "confidence": 0.35, "iou": 0.50, "yolo_batch": 64, "clip_filter": { "enabled": true, "threshold": 0.80, "batch": 384 }, "nvdec": true, "save_full_frame": true, "draw_boxes": true, "single_output_dir": "parsed\ALL_CROPS", "parallel": { "max_workers": 3 } }

Example 3: Low-Resource Mode

User: "Process Park_R October 6 with minimal GPU memory"

Skill invokes:

{ "mode": "extract_humans", "roots": ["G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R\20251006"], "confidence": 0.35, "yolo_batch": 32, "clip_filter": { "enabled": false }, "nvdec": false, "save_full_frame": false, "single_output_dir": "parsed\ALL_CROPS" }

Safety & Guardrails

Do Not

❌ Move or delete source MP4 files
❌ Infer gender unless explicitly enabled (sensitive, noisy)
❌ Process videos without user consent
❌ Share outputs containing identifiable persons

✅ Verify GPU availability before processing
✅ Enforce longitude sign corrections for GPS overlays
✅ Maintain audit trail in INDEX.csv
✅ Log versions, batches, NVDEC usage
✅ Handle OOM gracefully with suggestions

Resume Safety

Idempotent: skip already-processed crops by filename
Shard-based: partial runs can resume
Index integrity: SHA1 hashes verify file correctness

Testing & Verification

Pre-Run Checks

GPU availability

assert torch.cuda.is_available(), "CUDA required" assert torch.cuda.device_count() > 0, "No GPU found"

Model files

assert Path("models/yolov8s.pt").exists(), "YOLOv8 model missing"

Output directory writable

output_dir = Path("parsed/ALL_CROPS") output_dir.mkdir(parents=True, exist_ok=True) assert os.access(output_dir, os.W_OK), "Output dir not writable"

Post-Run Verification

Check outputs exist

assert Path("parsed/ALL_CROPS/INDEX.csv").exists() assert len(list(Path("parsed/ALL_CROPS").glob("*.webp"))) > 0

Validate INDEX.csv

df = pd.read_csv("parsed/ALL_CROPS/INDEX.csv") assert df['crop_file'].notna().all() assert df['person_conf'].between(0, 1).all()

Sample roundtrip

sample = df.sample(1).iloc[0] assert Path(f"parsed/ALL_CROPS/{sample['crop_file']}").exists()

GPU utilization check

assert gpu_util_avg > 0.70, f"Low GPU util: {gpu_util_avg}"

Dependencies

Required

Python 3.10+
PyTorch 2.0+ with CUDA 11.8+
ultralytics (YOLOv8)
transformers (CLIP)
opencv-python
pillow
pandas
numpy

Optional (Performance)

NVIDIA Video Codec SDK (NVDEC)
TensorRT (future optimization)
nvJPEG (GPU JPEG encoding)

Installation

cd "G:\My Drive\PROJECTS\APPS\Human_Detection" pip install -r requirements.txt

Troubleshooting

Common Issues

CUDA Out of Memory

Error: CUDA out of memory. Tried to allocate 2.50 GiB Solution: Reduce batch sizes yolo_batch: 64 → 48 → 32 clip_batch: 384 → 256 → 128

NVDEC Not Available

Warning: NVDEC unavailable, falling back to CPU decode Solution: Check NVIDIA driver version (≥525.60) GPU must support Video Codec SDK Verify with: nvidia-smi --query-gpu=name --format=csv

Low GPU Utilization

Warning: GPU util only 45% Solutions:

Increase batch sizes (if VRAM allows)
Enable NVDEC: nvdec=true
Increase parallel workers: max_workers=4
Check CPU bottleneck (use more DataLoader workers)
Slow Processing

Performance: 0.8 videos/min (expected 3-4) Diagnostics:

Check disk I/O (use SSD)
Verify NVDEC active (5-10x faster than CPU)
Profile with: python -m torch.utils.bottleneck script.py

Future Enhancements

Planned

TensorRT optimization (2-4x CLIP speedup)
Multi-GPU sharding (process different dates on different GPUs)
GPU JPEG/WebP encoding (nvJPEG)
Real-time streaming mode

Under Consideration

Face recognition integration
Gender classification (opt-in only, with warnings)
Action recognition (walking, standing, etc.)
Multi-camera fusion (correlate detections across cameras)

Version History

v1.0 (Current)

Initial release
YOLOv8s + ByteTrack + CLIP
NVDEC support
Unified output directory
Global INDEX.csv

References

YOLOv8 Documentation
ByteTrack Paper
CLIP Paper
NVIDIA Video Codec SDK

Contact & Support

For issues or questions:

Check parsed/ALL_CROPS/run_*.log for error details
Review GPU diagnostics: nvidia-smi
Validate input paths exist and are readable
Verify CUDA/PyTorch installation: python -c "import torch; print(torch.cuda.is_available())"

human-extractor

Safety Notice

Copy this and send it to your AI assistant to learn

Crop files (per person detection)

Full-frame files (optional, one per timestamp)

Example rows:

Basic usage

Advanced usage with CLIP filtering

GPU availability

Model files

Output directory writable

Check outputs exist

Validate INDEX.csv

Sample roundtrip

GPU utilization check

Source Transparency

Related Skills

tesseract

context-extract

context-manager

a2ui-embed