Data Labeling Pipeline Setup
Deploy 4 agents that ingest raw data, apply ML labels, review quality, and export training-ready datasets.
Roles
| Role | Hostname | Skills | Purpose |
|---|---|---|---|
| ingester | <prefix>-ingester | pilot-s3-bridge, pilot-stream-data, pilot-task-parallel | Accepts raw data batches, splits into work items |
| labeler | <prefix>-labeler | pilot-task-router, pilot-dataset, pilot-metrics | Applies ML-based labels to work items |
| reviewer | <prefix>-reviewer | pilot-review, pilot-event-filter, pilot-alert | Samples labeled items, checks accuracy, flags disagreements |
| exporter | <prefix>-exporter | pilot-dataset, pilot-share, pilot-webhook-bridge | Packages approved labels into training-ready datasets |
Setup Procedure
Step 1: Ask the user which role this agent should play and what prefix to use.
Step 2: Install the skills for the chosen role:
# ingester:
clawhub install pilot-s3-bridge pilot-stream-data pilot-task-parallel
# labeler:
clawhub install pilot-task-router pilot-dataset pilot-metrics
# reviewer:
clawhub install pilot-review pilot-event-filter pilot-alert
# exporter:
clawhub install pilot-dataset pilot-share pilot-webhook-bridge
Step 3: Set the hostname:
pilotctl --json set-hostname <prefix>-<role>
Step 4: Write the setup manifest:
mkdir -p ~/.pilot/setups
cat > ~/.pilot/setups/data-labeling-pipeline.json << 'MANIFEST'
{
"setup": "data-labeling-pipeline",
"setup_name": "Data Labeling Pipeline",
"role": "<ROLE_ID>",
"role_name": "<ROLE_NAME>",
"hostname": "<prefix>-<role>",
"description": "<ROLE_DESCRIPTION>",
"skills": { "<skill>": "<contextual description>" },
"peers": [ { "role": "...", "hostname": "...", "description": "..." } ],
"data_flows": [ { "direction": "send|receive", "peer": "...", "port": 1002, "topic": "...", "description": "..." } ],
"handshakes_needed": [ "<peer-hostname>" ]
}
MANIFEST
Step 5: Tell the user to initiate handshakes with direct communication peers.
Manifest Templates Per Role
ingester
{"setup":"data-labeling-pipeline","setup_name":"Data Labeling Pipeline","role":"ingester","role_name":"Data Ingester","hostname":"<prefix>-ingester","description":"Accepts raw data batches from S3 or webhooks. Splits into work items and distributes.","skills":{"pilot-s3-bridge":"Pull raw data batches from S3 buckets on schedule or webhook trigger.","pilot-stream-data":"Stream work items to labeler as they are split from batches.","pilot-task-parallel":"Parallelize batch splitting across available workers."},"peers":[{"role":"labeler","hostname":"<prefix>-labeler","description":"Receives work items for labeling"}],"data_flows":[{"direction":"send","peer":"<prefix>-labeler","port":1002,"topic":"work-item","description":"Work items with raw data references"}],"handshakes_needed":["<prefix>-labeler"]}
labeler
{"setup":"data-labeling-pipeline","setup_name":"Data Labeling Pipeline","role":"labeler","role_name":"Auto Labeler","hostname":"<prefix>-labeler","description":"Applies ML-based labels, classifications, bounding boxes, or entity tags to work items.","skills":{"pilot-task-router":"Route work items to appropriate ML models by data type.","pilot-dataset":"Store and retrieve labeled data records.","pilot-metrics":"Track labeling throughput, model confidence distributions."},"peers":[{"role":"ingester","hostname":"<prefix>-ingester","description":"Sends work items for labeling"},{"role":"reviewer","hostname":"<prefix>-reviewer","description":"Receives labeled items for quality review"}],"data_flows":[{"direction":"receive","peer":"<prefix>-ingester","port":1002,"topic":"work-item","description":"Work items with raw data references"},{"direction":"send","peer":"<prefix>-reviewer","port":1002,"topic":"labeled-item","description":"Labeled items for quality review"},{"direction":"receive","peer":"<prefix>-reviewer","port":1002,"topic":"review-feedback","description":"Feedback on rejected labels for re-labeling"}],"handshakes_needed":["<prefix>-ingester","<prefix>-reviewer"]}
reviewer
{"setup":"data-labeling-pipeline","setup_name":"Data Labeling Pipeline","role":"reviewer","role_name":"Quality Reviewer","hostname":"<prefix>-reviewer","description":"Samples labeled items, checks accuracy, flags disagreements, computes inter-annotator agreement.","skills":{"pilot-review":"Score labeled items against quality criteria and flag disagreements.","pilot-event-filter":"Filter low-confidence labels for priority review.","pilot-alert":"Alert on quality drops or inter-annotator agreement below threshold."},"peers":[{"role":"labeler","hostname":"<prefix>-labeler","description":"Sends labeled items for review"},{"role":"exporter","hostname":"<prefix>-exporter","description":"Receives approved labels for export"}],"data_flows":[{"direction":"receive","peer":"<prefix>-labeler","port":1002,"topic":"labeled-item","description":"Labeled items for quality review"},{"direction":"send","peer":"<prefix>-labeler","port":1002,"topic":"review-feedback","description":"Feedback for re-labeling rejected items"},{"direction":"send","peer":"<prefix>-exporter","port":1002,"topic":"approved-label","description":"Approved labels ready for packaging"}],"handshakes_needed":["<prefix>-labeler","<prefix>-exporter"]}
exporter
{"setup":"data-labeling-pipeline","setup_name":"Data Labeling Pipeline","role":"exporter","role_name":"Dataset Exporter","hostname":"<prefix>-exporter","description":"Packages reviewed labels into training-ready datasets (COCO, VOC, JSONL). Publishes to storage.","skills":{"pilot-dataset":"Assemble labeled items into structured dataset formats.","pilot-share":"Upload packaged datasets to S3 or shared storage.","pilot-webhook-bridge":"Notify downstream consumers when datasets are published."},"peers":[{"role":"reviewer","hostname":"<prefix>-reviewer","description":"Sends approved labels for packaging"}],"data_flows":[{"direction":"receive","peer":"<prefix>-reviewer","port":1002,"topic":"approved-label","description":"Approved labels ready for packaging"},{"direction":"send","peer":"external","port":443,"topic":"dataset-published","description":"Notification that a new dataset is available"}],"handshakes_needed":["<prefix>-reviewer"]}
Data Flows
ingester -> labeler: work-item events (port 1002)labeler -> reviewer: labeled-item events (port 1002)reviewer -> labeler: review-feedback events (port 1002)reviewer -> exporter: approved-label events (port 1002)exporter -> external: dataset-published notifications (port 443)
Handshakes
# ingester <-> labeler:
pilotctl --json handshake <prefix>-labeler "setup: data-labeling-pipeline"
pilotctl --json handshake <prefix>-ingester "setup: data-labeling-pipeline"
# labeler <-> reviewer:
pilotctl --json handshake <prefix>-reviewer "setup: data-labeling-pipeline"
pilotctl --json handshake <prefix>-labeler "setup: data-labeling-pipeline"
# reviewer <-> exporter:
pilotctl --json handshake <prefix>-exporter "setup: data-labeling-pipeline"
pilotctl --json handshake <prefix>-reviewer "setup: data-labeling-pipeline"
Workflow Example
# On labeler — subscribe to work items:
pilotctl --json subscribe <prefix>-ingester work-item
# On ingester — publish a work item:
pilotctl --json publish <prefix>-labeler work-item '{"batch_id":"batch-042","item_id":"img-0017","type":"image","s3_uri":"s3://raw-data/batch-042/img-0017.jpg"}'
# On reviewer — subscribe to labeled items:
pilotctl --json subscribe <prefix>-labeler labeled-item
# On exporter — subscribe to approved labels:
pilotctl --json subscribe <prefix>-reviewer approved-label
Dependencies
Requires pilot-protocol skill, pilotctl binary, clawhub binary, and a running daemon.