ai-native-product

Build AI-native products with agency-control tradeoffs, calibration loops, and eval strategies. Use when building AI agents, LLM features, or products where AI handles user tasks autonomously. Part of the Modern Product Operating Model collection.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ai-native-product" with this command: npx skills add yannickyamo/skills/yannickyamo-skills-ai-native-product

AI-Native Product Development

"AI products aren't deterministic. They require continuous calibration, not just A/B tests."

This skill covers AI-Native Product Development — the overlay that modifies discovery, architecture, and delivery when AI is at the core. It addresses the unique challenges of building products where AI agents perform tasks autonomously.

Part of: Modern Product Operating Model — a collection of composable product skills.

Related skills: product-strategy, product-discovery, product-architecture, product-delivery, product-leadership


When to Use This Skill

Use this skill when:

  • Building AI agents that act on behalf of users
  • Adding LLM-powered features to existing products
  • Designing human-AI interaction patterns
  • Deciding how much autonomy to give AI
  • Setting up eval strategies and calibration loops
  • Managing the "agency-control tradeoff"

Not needed for: Traditional software products, ML models used only for backend optimization (no user-facing autonomy)


What Makes AI Products Different

Traditional Software vs. AI Products

DimensionTraditional SoftwareAI-Native Products
BehaviorDeterministicProbabilistic
TestingUnit tests, QAEvals, calibration
CorrectnessBinary (works or doesn't)Spectrum (good enough?)
User roleOperatorDelegator + Reviewer
Failure modeError messagesPlausible but wrong outputs
IterationShip → Measure → IterateShip → Observe → Calibrate
Trust buildingFeature completenessDemonstrated reliability

The Core Challenge

AI products must navigate a fundamental tension:

More autonomy = More value (fewer steps, faster outcomes)
More autonomy = More risk (errors affect real work)

This is the Agency-Control Tradeoff.


Framework: The CCCD Loop

Credit: Aishwarya Goel & Kiriti Gavini

AI products require a Continuous Calibration and Confidence Development (CCCD) loop:

┌─────────────────────────────────────────────────────────────────┐
│                        CCCD LOOP                                │
│                                                                 │
│    CALIBRATE → CONFIDENCE → CONTINUOUS DISCOVERY → CALIBRATE   │
│         ↓           ↓              ↓                 ↓         │
│     Eval and    Build user    Observe AI       Update evals    │
│     adjust AI    trust over   interactions     and models      │
│     behavior     time         at scale                         │
└─────────────────────────────────────────────────────────────────┘

CCCD Components:

ComponentPurposeActivities
CalibrateTune AI behavior to match user expectationsRun evals, adjust prompts/models, set guardrails
ConfidenceBuild appropriate user trustShow AI reasoning, enable verification, demonstrate reliability
Continuous DiscoveryObserve AI-user interactions at scaleLog interactions, identify failure patterns, surface edge cases
→ Back to CalibrateUpdate based on learningsImprove evals, retrain, adjust prompts

The Agency-Control Progression

Five Levels of AI Agency

LevelDescriptionAI DoesUser DoesExample
1. AssistAI suggests, user executesGenerates optionsChooses and actsAutocomplete, suggestions
2. RecommendAI ranks, user approvesAnalyzes and recommendsReviews and approves"AI recommends these 3 actions"
3. Execute with confirmationAI acts after approvalPrepares actionConfirms before execution"Send this email?" → Yes/No
4. Execute with notificationAI acts, notifies afterActs autonomouslyReviews outcomes"I scheduled the meeting and sent invites"
5. Fully autonomousAI acts without notificationHandles end-to-endSets goals, reviews exceptionsAI handles routine tasks silently

Progression Strategy

Start lower, earn higher:

Level 1 → Build trust → Level 2 → Demonstrate reliability → Level 3 → ...

Graduation Criteria:

From LevelTo LevelRequires
1 → 2Assist → RecommendUser accepts suggestions > 70%
2 → 3Recommend → Execute with confirmUser approves recommendations > 80%
3 → 4Execute+confirm → Execute+notifyUser confirms without edit > 90%
4 → 5Execute+notify → AutonomousUser overrides < 5%, high-stakes scenarios excluded

Never fully autonomous for:

  • Irreversible actions (delete, send, purchase)
  • High-stakes decisions (financial, legal, health)
  • Novel situations outside training distribution
  • Actions affecting third parties

AI-Native Discovery

Standard discovery practices need adaptation for AI products.

Modified Discovery Focus

Standard DiscoveryAI-Native Adaptation
"What job are you trying to do?"+ "How much do you want to delegate?"
"What's your current workflow?"+ "Which steps are you comfortable AI handling?"
"What would success look like?"+ "What errors would be unacceptable?"
"Show me how you do this today"+ "Show me how you verify AI work today"

AI-Specific Discovery Questions

Delegation appetite:

  • "Which parts of this task feel tedious vs. require your judgment?"
  • "If AI made an error here, what would the consequences be?"
  • "How would you want to verify AI's work?"

Trust calibration:

  • "What would AI need to demonstrate before you'd trust it to [action]?"
  • "Have you used AI tools before? What built or broke your trust?"
  • "Would you prefer AI to do more but occasionally err, or do less perfectly?"

Failure tolerance:

  • "What kinds of errors are annoying vs. damaging?"
  • "How quickly do you need to catch and fix AI mistakes?"
  • "What's your 'undo' option if AI gets it wrong?"

Observing AI Interactions

In addition to interviews, AI discovery includes:

MethodWhat to Look For
Session recordingsWhere do users override AI? Where do they accept blindly?
Interaction logsPatterns in edits, rejections, corrections
Feedback analysisExplicit signals (thumbs down, ratings)
Support ticketsAI-related complaints and confusion

AI-Native Architecture

Solution Brief Additions

For AI features, add to standard solution brief:

AI-SPECIFIC SECTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

AGENCY LEVEL
Target: [Level 1-5]
Graduation path: [How might this evolve?]

FAILURE MODES
• [Failure mode 1]: [Consequence] → [Mitigation]
• [Failure mode 2]: [Consequence] → [Mitigation]

EVAL STRATEGY
• [Eval type 1]: [What we measure, how often]
• [Eval type 2]: [What we measure, how often]

CALIBRATION PLAN
• Initial calibration: [Approach]
• Ongoing calibration: [Cadence, triggers]

CONFIDENCE BUILDING
• How AI explains itself: [Approach]
• How users verify: [Mechanisms]
• Trust-building milestones: [Progression]

AI Bet Categories

In addition to standard bet categories:

CategoryDescriptionExample
Capability expansionAI can handle new task types"AI can now summarize documents"
Agency graduationMove to higher autonomy level"AI sends emails without confirmation"
Calibration improvementBetter accuracy/reliability"Reduce hallucination rate from 5% to 2%"
Confidence buildingBetter user trust"Show AI reasoning before action"
Guardrail strengtheningPrevent harmful outputs"Add content policy enforcement"

AI-Native Delivery

Eval Strategy (Replaces Traditional Testing)

Eval Types:

Eval TypePurposeWhen to Run
Unit evalsTest specific capabilitiesEvery code change
Behavioral evalsTest end-to-end flowsDaily/weekly
Adversarial evalsTest edge cases and attacksBefore major releases
Human evalsTest subjective qualityWeekly sample
Production evalsTest on real trafficContinuous

Eval Metrics:

MetricWhat It MeasuresTarget
Task success rateDoes AI complete the intended task?> 95%
Factual accuracyIs output factually correct?> 98%
Hallucination rateDoes AI make things up?< 2%
Harmful output rateDoes AI produce unsafe content?< 0.1%
User acceptance rateDo users accept AI output?> 80%
Override rateHow often do users correct AI?< 15%

Eval Cadence:

Code change → Unit evals (automated)
Daily → Behavioral evals (automated)
Weekly → Human evals (sample)
Release → Adversarial evals (red team)
Continuous → Production evals (monitoring)

Staged Rollout for AI Features

AI features require more cautious rollout:

StageAudienceFocusDuration
InternalTeamFind obvious failures1 week
Alpha5-10 trusted usersQualitative feedback on AI behavior2 weeks
Beta5% of usersQuantitative eval metrics2-4 weeks
Gradual GA5% → 25% → 50% → 100%Monitor at each stage4+ weeks

AI-Specific Rollout Gates:

GateCriteria to Proceed
Alpha → BetaEval metrics above threshold, no harmful outputs
Beta → Gradual GAUser acceptance > 80%, override rate < 15%
Each GA incrementMetrics stable, no new failure modes

Calibration Loop

Continuous calibration process:

OBSERVE → IDENTIFY → CALIBRATE → VALIDATE → DEPLOY
   ↑                                           │
   └───────────────────────────────────────────┘
StepActivitiesCadence
ObserveMonitor production interactions, logs, feedbackContinuous
IdentifySurface failure patterns, edge cases, driftDaily/weekly
CalibrateAdjust prompts, fine-tune, add guardrailsAs needed
ValidateRun evals on calibrated versionBefore deploy
DeployShip updates, continue observingStaged

Calibration Triggers:

  • Eval metrics below threshold
  • New failure pattern identified
  • User feedback trend (negative)
  • Model update available
  • New use case discovered

AI Metrics Hierarchy

LAGGING
├── User retention (AI users vs. non-AI users)
├── Task completion rate (with AI assist)
└── Revenue from AI features

CORE
├── User acceptance rate
├── Override rate
├── Time-to-completion (with AI)
└── User-reported satisfaction

LEADING
├── Eval metrics (accuracy, hallucination, etc.)
├── Interaction volume
├── Feature discovery rate
└── Feedback sentiment

GUARDRAILS
├── Harmful output rate
├── Latency P95
├── Error rate
└── Cost per interaction

AI-Specific Anti-Patterns

Anti-PatternWhy It FailsInstead
Ship and hopeAI behavior drifts without monitoringContinuous calibration
Autonomous by defaultUsers don't trust, don't adoptEarn autonomy progressively
Black box AIUsers can't verify, won't trustShow reasoning, enable verification
No evalsQuality degrades silentlyComprehensive eval strategy
Ignore overridesMiss calibration signalsOverride patterns inform calibration
One-size-fits-all agencyDifferent tasks need different levelsTask-specific agency levels

Templates

This skill includes templates in the templates/ directory:

  • agency-assessment.md — Determine appropriate agency level
  • eval-strategy.md — Design eval suite for AI feature
  • calibration-plan.md — Set up continuous calibration

Using This Skill with Claude

Ask Claude to:

  1. Assess agency level: "What agency level should [AI feature] have?"
  2. Design agency progression: "Create a graduation path from assist to autonomous for [feature]"
  3. Identify failure modes: "What could go wrong with [AI feature]? How do we mitigate?"
  4. Design eval strategy: "Design an eval suite for [AI feature]"
  5. Plan calibration: "Create a calibration plan for [AI feature]"
  6. Adapt discovery: "What AI-specific questions should I ask in discovery for [use case]?"
  7. Design confidence building: "How should [AI feature] show its reasoning?"
  8. Plan AI rollout: "Create a staged rollout plan for [AI feature]"
  9. Set AI metrics: "What metrics should we track for [AI feature]?"
  10. Review AI brief: "Critique this solution brief for AI considerations"

Connection to Other Skills

When you need to...Use skill
Define overall product strategyproduct-strategy
Run discovery (with AI adaptations)product-discovery
Structure bets and roadmapproduct-architecture
Plan rollout and metricsproduct-delivery
Scale AI products across teamsproduct-leadership

Quick Reference: AI Product Checklist

Before shipping AI features:

  • Agency level defined — Clear level for this feature
  • Graduation criteria set — How we'll earn higher autonomy
  • Failure modes mapped — Know what can go wrong
  • Evals in place — Automated quality checks
  • Human evals scheduled — Subjective quality review
  • Calibration loop running — Continuous improvement process
  • Confidence mechanisms built — Users can verify AI work
  • Guardrails active — Prevent harmful outputs
  • Rollout staged — More cautious than traditional features
  • Override tracking — Learning from user corrections

Sources & Influences

  • Aishwarya Goel & Kiriti Gavini — CCCD Loop, Agency-Control Trade-off
  • Anthropic — Constitutional AI, RLHF approaches
  • OpenAI — Eval best practices
  • Google DeepMind — AI safety frameworks

Part of the Modern Product Operating Model by Yannick Maurice

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

agents

No summary provided by upstream source.

Repository SourceNeeds Review
General

llm

No summary provided by upstream source.

Repository SourceNeeds Review
General

docs-keeper

No summary provided by upstream source.

Repository SourceNeeds Review
General

product-leadership

No summary provided by upstream source.

Repository SourceNeeds Review