perceptron

Image and video analysis powered by Isaac vision models. Capabilities include visual Q&A, object detection, OCR, captioning, counting, and grounded spatial reasoning with bounding boxes, points, and polygons. Supports streaming, structured outputs, DSL composition, and in-context learning. NOT for: generating images (analysis only).

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "perceptron" with this command: npx skills add Subraiz/perceptron

Perceptron — Vision SDK

Docs: https://docs.perceptron.inc/

Image and video analysis via the Perceptron Python SDK. Pass file paths or URLs directly — the SDK handles base64 conversion automatically.

Setup

pip install perceptron
export PERCEPTRON_API_KEY=ak_...

Quick Reference

TaskFunctionExample
Describe / Q&Aquestion()question("photo.jpg", "What's in this image?")
Grounded Q&Aquestion()question("photo.jpg", "Where is the cat?", expects="box")
Object detectiondetect()detect("photo.jpg", classes=["person", "car"])
OCRocr()ocr("document.png")
OCR (markdown)ocr_markdown()ocr_markdown("document.png")
Captioncaption()caption("photo.jpg", style="detailed")
Countingquestion()question("photo.jpg", "How many dogs?", expects="point")
Custom workflow@perceiveSee DSL composition below

Python SDK

from perceptron import configure, detect, caption, ocr, ocr_markdown, question

# Configuration (or set PERCEPTRON_API_KEY env var)
configure(provider="perceptron", api_key="ak_...")

# Visual Q&A — the most common operation
result = question("photo.jpg", "What's happening in this image?")
print(result.text)

# Grounded Q&A — get bounding boxes with answers
result = question("photo.jpg", "Where is the damage?", expects="box")
for box in result.points or []:
    print(f"{box.mention}: ({box.top_left.x},{box.top_left.y}) → ({box.bottom_right.x},{box.bottom_right.y})")

# Object detection
result = detect("warehouse.jpg", classes=["forklift", "person"])
for box in result.points or []:
    print(f"{box.mention}: ({box.top_left.x},{box.top_left.y}) → ({box.bottom_right.x},{box.bottom_right.y})")

# OCR
result = ocr("receipt.jpg", prompt="Extract the total amount")
print(result.text)

result = ocr_markdown("document.png")  # structured markdown output
print(result.text)

# Captioning
result = caption("scene.png", style="detailed")
print(result.text)

DSL Composition (Advanced)

Build custom multimodal workflows:

from perceptron import perceive, image, text, system

@perceive(expects="box", model="isaac-0.2-2b-preview")
def find_hazards(img_path):
    return [system("<hint>BOX</hint>"), image(img_path), text("Locate all safety hazards")]

result = find_hazards("factory.jpg")

Structured Outputs

Constrain responses to Pydantic models, JSON schemas, or regex:

from perceptron import perceive, image, text, pydantic_format
from pydantic import BaseModel

class Scene(BaseModel):
    objects: list[str]
    count: int

@perceive(response_format=pydantic_format(Scene))
def count_objects(path):
    return image(path) + text("List all objects and count them. Return JSON.")

result = count_objects("photo.jpg")
scene = Scene.model_validate_json(result.text)

Pixel Coordinate Conversion

All spatial outputs use normalized coordinates (0–1000). Convert to pixels:

pixel_boxes = result.points_to_pixels(width=1920, height=1080)

# Or standalone:
from perceptron import scale_points_to_pixels
pixel_pts = scale_points_to_pixels(result.points, width=1920, height=1080)

CLI Script

Located at: <skill-dir>/scripts/perceptron_cli.py

Requires PERCEPTRON_API_KEY environment variable. The provider is always perceptron.

P=<skill-dir>/scripts/perceptron_cli.py

# Visual Q&A
python3 $P question photo.jpg "What do you see?"
python3 $P question photo.jpg "Where is the car?" --expects box

# Object detection
python3 $P detect photo.jpg --classes person,car
python3 $P detect photo.jpg --classes forklift --format json --pixels
python3 $P detect ./frames/ --classes defect  # batch directory

# OCR
python3 $P ocr document.png
python3 $P ocr receipt.jpg --output markdown

# Captioning
python3 $P caption scene.png --style detailed

# Custom perceive
python3 $P perceive frame.png --prompt "Describe this scene" --expects box

# Batch processing
python3 $P batch --images img1.jpg img2.jpg --prompt "Describe" --output results.json

# Parse raw model output
python3 $P parse "<point_box ...>" --mode points

# List models
python3 $P models

Models

ModelBest forSpeedTemp
isaac-0.2-2b-preview (default)General use, detection, OCRFast0.0
isaac-0.2-1bQuick/simple tasksFastest0.0

Override with model="..." in any SDK call or --model ... in CLI.

Grounding (expects parameter)

ValueReturnsUse case
text (default)Plain textQ&A, descriptions, OCR
boxBounding boxesDetection, localization
pointPoint coordinatesCounting, pointing
polygonPolygon verticesSegmentation

Video Analysis

Extract frames with ffmpeg, then analyze:

# Single frame at 5 seconds
ffmpeg -ss 5 -i video.mp4 -frames:v 1 -q:v 2 /tmp/frame.jpg

# Then analyze
python3 $P question /tmp/frame.jpg "What's happening?"

For continuous monitoring, extract multiple frames and batch process.

Reference Files

For deeper SDK usage, consult these when needed:

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

Ai Task Hub

AI task hub for image analysis, background removal, speech-to-text, text-to-speech, markdown conversion, points balance/ledger lookup, and async execute/poll...

Registry SourceRecently Updated
Research

Ontology 1.0.4

Typed knowledge graph for structured agent memory and composable skills. Use when creating/querying entities (Person, Project, Task, Event, Document), linkin...

Registry SourceRecently Updated
Research

Arxiv Reader

Read and analyze arXiv papers by fetching LaTeX source, listing sections, or extracting abstracts

Registry SourceRecently Updated
072
Profile unavailable