AI-Native Product Development

"AI products aren't deterministic. They require continuous calibration, not just A/B tests."

This skill covers AI-Native Product Development — the overlay that modifies discovery, architecture, and delivery when AI is at the core. It addresses the unique challenges of building products where AI agents perform tasks autonomously.

Part of: Modern Product Operating Model — a collection of composable product skills.

Related skills: product-strategy, product-discovery, product-architecture, product-delivery, product-leadership

When to Use This Skill

Use this skill when:

Building AI agents that act on behalf of users
Adding LLM-powered features to existing products
Designing human-AI interaction patterns
Deciding how much autonomy to give AI
Setting up eval strategies and calibration loops
Managing the "agency-control tradeoff"

Not needed for: Traditional software products, ML models used only for backend optimization (no user-facing autonomy)

What Makes AI Products Different

Traditional Software vs. AI Products

Dimension	Traditional Software	AI-Native Products
Behavior	Deterministic	Probabilistic
Testing	Unit tests, QA	Evals, calibration
Correctness	Binary (works or doesn't)	Spectrum (good enough?)
User role	Operator	Delegator + Reviewer
Failure mode	Error messages	Plausible but wrong outputs
Iteration	Ship → Measure → Iterate	Ship → Observe → Calibrate
Trust building	Feature completeness	Demonstrated reliability

The Core Challenge

AI products must navigate a fundamental tension:

More autonomy = More value (fewer steps, faster outcomes)
More autonomy = More risk (errors affect real work)

This is the Agency-Control Tradeoff.

Framework: The CCCD Loop

Credit: Aishwarya Goel & Kiriti Gavini

AI products require a Continuous Calibration and Confidence Development (CCCD) loop:

┌─────────────────────────────────────────────────────────────────┐
│                        CCCD LOOP                                │
│                                                                 │
│    CALIBRATE → CONFIDENCE → CONTINUOUS DISCOVERY → CALIBRATE   │
│         ↓           ↓              ↓                 ↓         │
│     Eval and    Build user    Observe AI       Update evals    │
│     adjust AI    trust over   interactions     and models      │
│     behavior     time         at scale                         │
└─────────────────────────────────────────────────────────────────┘

CCCD Components:

Component	Purpose	Activities
Calibrate	Tune AI behavior to match user expectations	Run evals, adjust prompts/models, set guardrails
Confidence	Build appropriate user trust	Show AI reasoning, enable verification, demonstrate reliability
Continuous Discovery	Observe AI-user interactions at scale	Log interactions, identify failure patterns, surface edge cases
→ Back to Calibrate	Update based on learnings	Improve evals, retrain, adjust prompts

The Agency-Control Progression

Five Levels of AI Agency

Level	Description	AI Does	User Does	Example
1. Assist	AI suggests, user executes	Generates options	Chooses and acts	Autocomplete, suggestions
2. Recommend	AI ranks, user approves	Analyzes and recommends	Reviews and approves	"AI recommends these 3 actions"
3. Execute with confirmation	AI acts after approval	Prepares action	Confirms before execution	"Send this email?" → Yes/No
4. Execute with notification	AI acts, notifies after	Acts autonomously	Reviews outcomes	"I scheduled the meeting and sent invites"
5. Fully autonomous	AI acts without notification	Handles end-to-end	Sets goals, reviews exceptions	AI handles routine tasks silently

Progression Strategy

Start lower, earn higher:

Level 1 → Build trust → Level 2 → Demonstrate reliability → Level 3 → ...

Graduation Criteria:

From Level	To Level	Requires
1 → 2	Assist → Recommend	User accepts suggestions > 70%
2 → 3	Recommend → Execute with confirm	User approves recommendations > 80%
3 → 4	Execute+confirm → Execute+notify	User confirms without edit > 90%
4 → 5	Execute+notify → Autonomous	User overrides < 5%, high-stakes scenarios excluded

Never fully autonomous for:

Irreversible actions (delete, send, purchase)
High-stakes decisions (financial, legal, health)
Novel situations outside training distribution
Actions affecting third parties

AI-Native Discovery

Standard discovery practices need adaptation for AI products.

Modified Discovery Focus

Standard Discovery	AI-Native Adaptation
"What job are you trying to do?"	+ "How much do you want to delegate?"
"What's your current workflow?"	+ "Which steps are you comfortable AI handling?"
"What would success look like?"	+ "What errors would be unacceptable?"
"Show me how you do this today"	+ "Show me how you verify AI work today"

AI-Specific Discovery Questions

Delegation appetite:

"Which parts of this task feel tedious vs. require your judgment?"
"If AI made an error here, what would the consequences be?"
"How would you want to verify AI's work?"

Trust calibration:

"What would AI need to demonstrate before you'd trust it to [action]?"
"Have you used AI tools before? What built or broke your trust?"
"Would you prefer AI to do more but occasionally err, or do less perfectly?"

Failure tolerance:

"What kinds of errors are annoying vs. damaging?"
"How quickly do you need to catch and fix AI mistakes?"
"What's your 'undo' option if AI gets it wrong?"

Observing AI Interactions

In addition to interviews, AI discovery includes:

Method	What to Look For
Session recordings	Where do users override AI? Where do they accept blindly?
Interaction logs	Patterns in edits, rejections, corrections
Feedback analysis	Explicit signals (thumbs down, ratings)
Support tickets	AI-related complaints and confusion

AI-Native Architecture

Solution Brief Additions

For AI features, add to standard solution brief:

AI-SPECIFIC SECTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

AGENCY LEVEL
Target: [Level 1-5]
Graduation path: [How might this evolve?]

FAILURE MODES
• [Failure mode 1]: [Consequence] → [Mitigation]
• [Failure mode 2]: [Consequence] → [Mitigation]

EVAL STRATEGY
• [Eval type 1]: [What we measure, how often]
• [Eval type 2]: [What we measure, how often]

CALIBRATION PLAN
• Initial calibration: [Approach]
• Ongoing calibration: [Cadence, triggers]

CONFIDENCE BUILDING
• How AI explains itself: [Approach]
• How users verify: [Mechanisms]
• Trust-building milestones: [Progression]

AI Bet Categories

In addition to standard bet categories:

Category	Description	Example
Capability expansion	AI can handle new task types	"AI can now summarize documents"
Agency graduation	Move to higher autonomy level	"AI sends emails without confirmation"
Calibration improvement	Better accuracy/reliability	"Reduce hallucination rate from 5% to 2%"
Confidence building	Better user trust	"Show AI reasoning before action"
Guardrail strengthening	Prevent harmful outputs	"Add content policy enforcement"

AI-Native Delivery

Eval Strategy (Replaces Traditional Testing)

Eval Types:

Eval Type	Purpose	When to Run
Unit evals	Test specific capabilities	Every code change
Behavioral evals	Test end-to-end flows	Daily/weekly
Adversarial evals	Test edge cases and attacks	Before major releases
Human evals	Test subjective quality	Weekly sample
Production evals	Test on real traffic	Continuous

Eval Metrics:

Metric	What It Measures	Target
Task success rate	Does AI complete the intended task?	> 95%
Factual accuracy	Is output factually correct?	> 98%
Hallucination rate	Does AI make things up?	< 2%
Harmful output rate	Does AI produce unsafe content?	< 0.1%
User acceptance rate	Do users accept AI output?	> 80%
Override rate	How often do users correct AI?	< 15%

Eval Cadence:

Code change → Unit evals (automated)
Daily → Behavioral evals (automated)
Weekly → Human evals (sample)
Release → Adversarial evals (red team)
Continuous → Production evals (monitoring)

Staged Rollout for AI Features

AI features require more cautious rollout:

Stage	Audience	Focus	Duration
Internal	Team	Find obvious failures	1 week
Alpha	5-10 trusted users	Qualitative feedback on AI behavior	2 weeks
Beta	5% of users	Quantitative eval metrics	2-4 weeks
Gradual GA	5% → 25% → 50% → 100%	Monitor at each stage	4+ weeks

AI-Specific Rollout Gates:

Gate	Criteria to Proceed
Alpha → Beta	Eval metrics above threshold, no harmful outputs
Beta → Gradual GA	User acceptance > 80%, override rate < 15%
Each GA increment	Metrics stable, no new failure modes

Calibration Loop

Continuous calibration process:

OBSERVE → IDENTIFY → CALIBRATE → VALIDATE → DEPLOY
   ↑                                           │
   └───────────────────────────────────────────┘

Step	Activities	Cadence
Observe	Monitor production interactions, logs, feedback	Continuous
Identify	Surface failure patterns, edge cases, drift	Daily/weekly
Calibrate	Adjust prompts, fine-tune, add guardrails	As needed
Validate	Run evals on calibrated version	Before deploy
Deploy	Ship updates, continue observing	Staged

Calibration Triggers:

Eval metrics below threshold
New failure pattern identified
User feedback trend (negative)
Model update available
New use case discovered

AI Metrics Hierarchy

LAGGING
├── User retention (AI users vs. non-AI users)
├── Task completion rate (with AI assist)
└── Revenue from AI features

CORE
├── User acceptance rate
├── Override rate
├── Time-to-completion (with AI)
└── User-reported satisfaction

LEADING
├── Eval metrics (accuracy, hallucination, etc.)
├── Interaction volume
├── Feature discovery rate
└── Feedback sentiment

GUARDRAILS
├── Harmful output rate
├── Latency P95
├── Error rate
└── Cost per interaction

AI-Specific Anti-Patterns

Anti-Pattern	Why It Fails	Instead
Ship and hope	AI behavior drifts without monitoring	Continuous calibration
Autonomous by default	Users don't trust, don't adopt	Earn autonomy progressively
Black box AI	Users can't verify, won't trust	Show reasoning, enable verification
No evals	Quality degrades silently	Comprehensive eval strategy
Ignore overrides	Miss calibration signals	Override patterns inform calibration
One-size-fits-all agency	Different tasks need different levels	Task-specific agency levels

Templates

This skill includes templates in the templates/ directory:

agency-assessment.md — Determine appropriate agency level
eval-strategy.md — Design eval suite for AI feature
calibration-plan.md — Set up continuous calibration

Using This Skill with Claude

Ask Claude to:

Assess agency level: "What agency level should [AI feature] have?"
Design agency progression: "Create a graduation path from assist to autonomous for [feature]"
Identify failure modes: "What could go wrong with [AI feature]? How do we mitigate?"
Design eval strategy: "Design an eval suite for [AI feature]"
Plan calibration: "Create a calibration plan for [AI feature]"
Adapt discovery: "What AI-specific questions should I ask in discovery for [use case]?"
Design confidence building: "How should [AI feature] show its reasoning?"
Plan AI rollout: "Create a staged rollout plan for [AI feature]"
Set AI metrics: "What metrics should we track for [AI feature]?"
Review AI brief: "Critique this solution brief for AI considerations"

Connection to Other Skills

When you need to...	Use skill
Define overall product strategy	`product-strategy`
Run discovery (with AI adaptations)	`product-discovery`
Structure bets and roadmap	`product-architecture`
Plan rollout and metrics	`product-delivery`
Scale AI products across teams	`product-leadership`

Quick Reference: AI Product Checklist

Before shipping AI features:

Sources & Influences

Aishwarya Goel & Kiriti Gavini — CCCD Loop, Agency-Control Trade-off
Anthropic — Constitutional AI, RLHF approaches
OpenAI — Eval best practices
Google DeepMind — AI safety frameworks

Part of the Modern Product Operating Model by Yannick Maurice