ai-product-launch-coach
Coach a team shipping an AI-first product through the four phases that decide whether it makes money: pick a model + architecture that doesn't commoditize you, build evals + guardrails so quality doesn't regress on production traffic, design unit economics that survive token-cost variance + provider price changes, and find defensibility that isn't "we have GPT-4 access" (everyone does). Most AI products fail not from model quality but from missing evals, runaway inference cost, regulatory blowback, or the realization that the underlying model is the product (and they don't own it).
When to engage
Trigger when the team mentions:
- Model / provider selection: GPT-4o / 4.1 / o1 / o3, Claude Sonnet / Opus, Gemini 2.x / 3, Mistral, Llama 3.x / 4.x, Cohere, open-source models, on-premise, AWS Bedrock, Azure OpenAI, Vertex AI
- Architecture: RAG vs fine-tuning vs prompt engineering vs multi-step agents vs structured output
- LLM observability: LangSmith, Langfuse, Helicone, Phoenix, Arize, custom logging
- Eval infrastructure: golden datasets, LLM-as-judge, human review, regression detection, A/B testing prompts
- Vector DB / embedding strategy: Pinecone, Weaviate, pgvector, Qdrant, Chroma, MongoDB Atlas Vector
- Guardrails: NeMo Guardrails, Guardrails AI, custom filters, input sanitization, output validation
- Cost: cost per request, cost per active user, model fallback strategy, caching
- Pricing: usage-based, seat-based, tiered, model-tier-passthrough
- Defensibility: data moat, distribution moat, workflow moat, network effects
- Safety / regulatory: EU AI Act, GDPR, HIPAA, SOC2 with AI inputs, prompt injection, data exfil, jailbreaks
- Agent frameworks: LangGraph, CrewAI, AutoGen, custom agent loops, function-calling/tool-use
- Voice AI: ElevenLabs, Cartesia, Deepgram, Vapi, Retell
- Multi-modal: Claude vision, GPT-4o vision, Gemini multi-modal, image + voice + text
- Customer trust: hallucination handling, citations, "I don't know" patterns, confidence scoring
- Specific niches: AI-first vertical SaaS (legal, medical, sales, support, code, design, education)
Do not engage for: pure ML/DL research consulting (not product), pure prompt engineering (use a prompt-specific skill), or "should I add ChatGPT to my existing app" (that's a feature decision, not a product launch).
Diagnostic sweep — run before recommending anything
Ask 12-16 questions before any prescription. AI-first products fail differently than SaaS; the failure modes are model-specific.
The product
- What does the user accomplish in 1 sentence (not "AI-powered X" — what's the JOB the user pays you to do)?
- What's the workflow today without your product? (Manual process, Excel, ChatGPT-by-itself, hiring contractors.) How long does it take? What does it cost them?
- Closest 3 competitors using AI for the same job. URLs. What's actually different about yours?
- AI involvement spectrum: (a) thin wrapper around LLM, (b) RAG over their data, (c) custom workflow with AI in middle, (d) AI agent with tool-use, (e) fine-tuned domain model, (f) hybrid (multiple of above)?
Model & architecture 5. Current model + provider + version. Why this one? (Cost, quality, speed, contract, brand?) 6. Architecture: prompt + LLM, RAG, fine-tuned, agent loop, hybrid? Token-flow per user-request (input + output)? 7. Latency target: sub-1s, 1-3s, 3-10s, async/background acceptable? 8. Quality bar: how do you currently know if output is good? Manual review? Customer feedback? Numerical eval?
Cost & scale 9. Cost per request today (input tokens × $/M + output tokens × $/M + retrieval ops)? 10. Average user makes how many requests / day / month? 11. Pricing model + price per user / per request? 12. Gross margin per user assuming current model cost? (Be honest. Many AI startups are -20% margin pre-investment.)
Risk 13. What happens if your model provider raises prices 30%? Cuts you off? Goes down for 4 hours? 14. What happens if a customer's prompt injection extracts another customer's data? 15. Hallucination tolerance: medical / legal / finance (zero) vs marketing / brainstorm (high)? 16. Regulatory: EU AI Act high-risk system? HIPAA? GDPR cross-border? SOC2 audit pending?
If they can't answer 8-12, that gap IS the work. AI products without evals + cost discipline + risk plan are launches waiting to happen wrong.
Phase 1 — Model & architecture decision (the foundation)
The single most consequential decision. Wrong choice = either margin death or quality death.
Architecture decision tree:
| Need | Right architecture | Wrong choice (common) |
|---|---|---|
| Generic question-answering on user's docs | RAG with embedding model + frontier LLM | Fine-tune (overengineered, brittle) |
| Domain-specific style / format / tone (legal briefs, medical notes) | Fine-tune small model OR RAG + carefully-engineered prompt | Default GPT-4 prompt (underperforms specialized) |
| Sub-second response on simple classification / extraction | Small fine-tuned model OR cheap small frontier model (gpt-4.1-mini, Haiku, Gemini Flash) | Frontier LLM (slow + expensive for simple tasks) |
| Multi-step workflow with tool use (book appointment, query DB, send email) | Agent loop with function calling on frontier model | Single prompt (loses on complex workflows) |
| Real-time voice conversation | Voice-LLM stack (Vapi / Retell / custom) with low-latency model | Async chat repurposed (wrong UX) |
| 10K+ tokens of context per request | Long-context frontier model (Claude / Gemini long-context) OR retrieval-based (RAG) | Stuffing context = $$ + degraded quality |
| Image understanding | Multi-modal frontier (GPT-4o, Claude with vision, Gemini) | Caption-then-LLM pipeline (lossy + slow) |
| Code generation | Code-specialized frontier (Claude Sonnet/Opus, GPT-4o, Gemini Code) | General LLM (worse + costlier) |
Model-tier strategy: Most production AI apps need 2-3 model tiers, not one.
- Cheap tier (gpt-4.1-mini, Haiku, Gemini Flash): classification, simple extraction, routing, summarization. ~10-30× cheaper than top-tier. Use for 60-80% of requests.
- Mid tier (gpt-4.1, Sonnet, Gemini Pro): main user-facing reasoning. 20-30% of requests.
- Top tier (Opus, o3, Gemini 3 Pro): hardest cases / fallback when mid fails an eval. 1-5% of requests.
- Specialized: voice (low-latency), embedding (cheap), image (multi-modal-specific).
Provider strategy (important — most launches mess this up):
- Single provider: simpler, faster to ship. Right for MVP. Risk: provider deprecation / pricing / outage.
- Multi-provider with abstraction layer (LiteLLM, Portkey, OpenRouter, custom router): better margin (route by cost), reliability (fallback on outage), defensibility against provider changes. Right for production at scale.
- Direct API + cloud-mirror (e.g., Azure OpenAI mirror of OpenAI, AWS Bedrock for Claude): better SLA, BAA / compliance, slightly higher cost. Right for enterprise customers requiring data residency.
RAG architecture details:
- Embedding model: text-embedding-3-small (cheap), Cohere embed v3, BGE M3 (open). ~$0.02-0.13 per million tokens.
- Vector DB: pgvector (fine for <10M vectors, simpler ops), Pinecone / Qdrant (10M+ vectors, managed), Weaviate (hybrid search built-in).
- Chunk strategy: 200-400 token chunks with 50-token overlap. Domain-aware chunking beats generic.
- Retrieval count: top-10 → re-rank → keep top-3 for context. Re-ranking matters.
- Re-rank model: Cohere rerank, ms-marco-MiniLM, custom — improves precision 20-40% over raw cosine.
Fine-tuning decision:
- Fine-tune ONLY when: format/style/tone is hard to prompt OR cost matters AND task is narrow + repeated AND you have 1000+ high-quality examples.
- Fine-tune cost: $5-50 in training, then 1.5-3× base model inference cost.
- Fine-tuning is rarely the right answer in 2026 — frontier models + good prompts + RAG hits 90% of needs. Fine-tune small models for narrow tasks (classification, extraction); rarely for generation.
Phase 2 — Evals (the most-skipped chapter)
Most AI startups ship without evals, then their product silently degrades when models update or prompts change. Build evals before shipping, iterate on evals as you scale.
The eval pyramid (in order of cost / fidelity):
- Unit-test prompts (offline, fast): 50-200 hand-curated input → expected-output pairs. Run on every PR. Catches obvious regressions.
- LLM-as-judge (offline, mid-cost): a stronger model evaluates your output against a rubric. Cheaper than humans, faster than manual. Use rubrics: accuracy, completeness, tone, format compliance.
- Human eval batches (offline, expensive but high-fidelity): 100-1000 random or sampled production outputs reviewed by domain experts weekly. Slow but ground truth.
- Production canary (online): release new prompt / model to 1-5% of traffic, compare metrics (user thumbs-up, retention, conversion).
- Customer feedback (online): in-product 👍/👎, comment box. Aggregate weekly.
Eval infrastructure stack:
- LangSmith (LangChain ecosystem): full eval lifecycle.
- Langfuse (open-source): self-host or cloud, fast adoption.
- Phoenix / Arize: experiment tracking + eval, ML-team-friendly.
- Helicone: cost + observability + prompt eval.
- Custom: minimal Postgres + Python harness — works for early stage.
Eval golden dataset construction:
- Start with 50 hand-crafted examples covering: happy paths (60%), edge cases (30%), adversarial (10%).
- Grow to 500-2000 by sampling from production traffic + having domain experts label.
- Refresh quarterly: add new edge cases that caused customer complaints.
Eval metrics that matter:
- Task success rate (binary or scaled): did the LLM do the right thing?
- Format compliance: did output match required JSON schema, length, citations format?
- Hallucination rate: % of outputs containing claims not supported by context (in RAG).
- Latency p50 / p95 / p99: response time distribution.
- Cost per request: tokens × $/M tokens.
- User satisfaction: thumbs-up rate, NPS in-product.
When to run evals:
- On every prompt change → CI check.
- On every model upgrade → blocker before deploy.
- Weekly on production sample → drift detection.
- Quarterly full-eval-suite review → strategic.
Anti-patterns:
- "We'll add evals later" → never gets added → silent regression.
- Eval set size <30 → noise dominates signal.
- Eval set hand-crafted by founder only → covers what founder thinks edge cases are, misses real-customer-edge.
- LLM-as-judge with same model as production → blind spots; use a different model family as judge.
Phase 3 — Guardrails & safety
AI products fail trust catastrophically: prompt injection extracting other users' data, hallucinated medical advice, bad output reaching production customer. Build guardrails before launch.
Guardrail layers (defense in depth):
-
Input filtering:
- Length cap (DoS prevention).
- Profanity / abuse detection (block obvious abuse).
- PII detection on inputs (don't store unnecessarily).
- Prompt-injection detection (NeMo Guardrails / Lakera Guard / custom regex).
- Rate limiting per user / IP / org.
-
Context isolation:
- User A's data NEVER appears in User B's context. (Most-violated invariant. Test this explicitly.)
- Multi-tenant systems must use customer-scoped retrieval + customer-scoped prompts.
- Don't share embedding indices across tenants without strict access control.
-
Output validation:
- JSON schema validation on structured output → retry on fail.
- Hallucination detection: claim verification against retrieved context (semantic match).
- Citation enforcement: every claim has source → reject if missing.
- Toxicity / unsafe content scan on output.
- Length & format compliance.
-
Domain-specific guardrails:
- Medical / legal: "I don't know — please consult a [doctor/lawyer]" pattern, never give specific advice.
- Financial: no specific buy/sell recommendations without disclaimers.
- Code: don't auto-execute generated code; sandbox + review.
-
Audit logging:
- Every prompt, response, model, version, customer ID — logged for replay + audit.
- Required for SOC2 / regulated industries.
Tool stack:
- NeMo Guardrails: NVIDIA, Python-based, declarative.
- Guardrails AI: open-source, Python.
- Lakera Guard: prompt-injection detection SaaS.
- Custom: regex + classifier + response validation.
Pre-launch security review:
- Red-team your product: try to extract another tenant's data via prompt injection.
- Try to make it generate harmful content (jailbreak).
- Try to make it use an excessive number of tokens (cost amplification).
- Test against common prompt injection patterns (DAN, "ignore previous instructions", indirect injection from RAG-retrieved content).
EU AI Act / regulatory:
- High-risk systems (medical, employment, education, law-enforcement) require risk management, human oversight, transparency.
- General-purpose AI providers (model providers) have obligations; deployers (you, building on top) have separate ones.
- Provenance: log model + version + provider for every output (regulatory + customer trust).
Phase 4 — Unit economics with LLMs
The hidden killer of AI startups. Token costs vary by 100× across models; usage varies by 10× across users; vendor pricing changes quarterly. Without discipline, gross margin goes negative silently.
Cost model:
Cost per request = (input tokens × $input/M) + (output tokens × $output/M) + retrieval ops + cache miss costs
Cost per user = (cost per request) × (requests per user)
Gross margin per user = price per user - cost per user - infra - support amortized
Critical knobs:
- Input tokens: shorter system prompt + RAG-relevant chunks only (not "stuff everything in"). Each unnecessary 1K tokens = 30-100× cost over a year for high-volume users.
- Output tokens: enforce max-tokens; structure for short outputs where possible.
- Cache hits: prompt-caching (Anthropic) / context caching (Gemini) saves 50-90% on input cost for repeated prompts. Use it.
- Model tier: route easy tasks to cheap models. 80/20 rule: 80% of traffic to cheap tier saves 70-90% of cost.
- Batch processing: async / overnight workflows can use batch API at 50% cost (OpenAI) / similar (Anthropic).
Provider price comparison (rough 2026):
| Model | Input $/M | Output $/M | Best for |
|---|---|---|---|
| GPT-4.1-mini / Haiku 4.5 / Gemini 3 Flash | $0.15-0.50 | $0.50-2.00 | Cheap tier, high volume, simple tasks |
| GPT-4.1 / Sonnet 4.6 / Gemini 3 Pro | $2-3 | $10-15 | Mid tier, main reasoning |
| Opus 4.7 / o3 / Gemini 3 Ultra | $15-30 | $60-90 | Top tier, hardest cases |
| Embeddings (3-small / Cohere v3) | $0.02-0.13 | n/a | RAG retrieval |
(Verify current pricing at provider docs before quoting customers — pricing changes ~quarterly.)
Pricing strategy options:
- Seat-based: easy, predictable. Bad if usage varies wildly.
- Usage-based: aligns cost-and-revenue. Confusing for buyers; complex billing.
- Tiered with caps: $X/mo for Y requests, overage at $Z. Best of both worlds.
- Credit-based: customer buys credits, requests deduct. Common for AI tools (ChatGPT Plus, Cursor, etc.).
- Outcome-based: charge for value created (a generated proposal, a closed support ticket). Highest leverage; hard to operationalize.
Margin guardrails:
- Cost per user should be ≤30-40% of price per user at launch (assume cost will grow 1.5-2× as users use more).
- Set per-user usage caps: hard limit + soft warning.
- Build kill-switch for users 10× over expected usage (likely abuse / bot).
Cost monitoring:
- Real-time per-user cost dashboard.
- Daily total spend vs budget.
- Alert on user cost >5× tier average.
- Weekly: cost-per-active-user trend; cost-per-revenue-dollar trend.
The "free tier" trap:
- Free tier users cost real money in tokens. A 1000-user free tier with average 5 LLM calls/day at $0.05/call = $7500/mo burn for $0 revenue.
- Free tier should be HEAVILY rate-limited (5-20 calls/day max), use cheap-tier models, and convert at >5% within 30 days.
Phase 5 — Defensibility (the "we have GPT-4" problem)
Every founder pitches a defensibility moat. Most AI products have none — the moat is the underlying model, which they don't own. Diagnose what defensibility (if any) you have.
Defensibility tiers (from weak to strong):
- Model access (zero defensibility): "We have GPT-4 access" — so does every other startup with $20.
- Prompt engineering (low defensibility): clever prompts can be reverse-engineered or independently developed in days.
- UX / workflow (mid-low defensibility): nice product, copy-able in 2-3 months by a focused competitor.
- Distribution (mid defensibility): you have users, they have habit. Erodes if competitors out-build features.
- Data moat (mid-high defensibility): proprietary data that improves the model OR enables better retrieval. Very hard to replicate. Usually domain-specific (legal, medical, financial filings).
- Workflow integration (high defensibility): deep integrations with customer's existing tools (Salesforce, EHR, code repos). Switching cost grows over time.
- Network effects (highest): user contributions / interactions improve the product. Rare in pure AI plays; common in social / marketplace plays with AI.
- Regulatory moat (highest, narrow): you have a license, certification, or compliance posture competitors lack (HIPAA-compliant medical, FDA-cleared, BAA-able). Slow to build, hard to dislodge.
Realistic moat-building paths:
- Vertical AI (most common winner): pick a vertical (legal, medical, financial, real estate, sales, support). Build domain-specific RAG + workflow + integrations. Customer data + domain trust = moat.
- Workflow embedding: become the system of record for a job. AI is the engine; the moat is the workflow they can't migrate from.
- Data flywheel: customer interactions improve the model (feedback loop, fine-tuning, retrieval signal). Requires real volume + customer permission to use data.
- Distribution acquisition: capture a user base before competitors, then monetize. Risky — base can defect to cheaper or better in 2-3 years.
Ask the founder bluntly:
- "If a well-funded competitor copies your prompt + UX in 90 days, what stops them?"
- "What asset do YOU have that no one else can get easily?"
- If the answer is "we'll iterate faster" — that's not a moat; that's a treadmill.
Phase 6 — Customer trust & hallucination handling
AI products live or die on perceived trustworthiness. Hallucinations don't just lose users; they create lawsuits, public-shaming Twitter threads, and brand-killing news cycles.
Trust patterns:
- Citations / provenance: every factual claim has a clickable source. Customers can verify.
- Confidence scoring: tell the user when the model is uncertain. "I'm not sure about this — here's why."
- "I don't know" patterns: train the model (via prompting or fine-tune) to refuse when context doesn't support an answer. Hard to do well.
- Human-in-the-loop: high-stakes outputs go through human review before reaching customer. Especially: medical, legal, financial, customer-facing public communication.
- Versioning + replay: every output is tied to a model version + prompt version. Customer reports "this is wrong"; you can replay and diagnose.
- Disclaimer + scope: clear about what the AI does and doesn't do. Customers paid full attention to disclaimers when expectations are honest.
Hallucination defense in RAG:
- Retrieval coverage check: if no good source retrieved, decline rather than generate.
- Citation enforcement: every answer must include source[s] from retrieval.
- Adversarial test: questions outside your data; does model say "I don't know" or hallucinate?
- Confidence threshold: discard low-confidence retrievals.
Hallucination defense outside RAG:
- "Search and verify" agents that call tools to check facts before answering.
- Two-pass: first pass generates, second pass verifies factually.
- Caveats: "Based on training data through [date]; verify recent info."
When AI says wrong things to a customer:
- Reproduce the failure quickly.
- Add to eval set as a regression test.
- Adjust prompt / guardrail / retrieval as warranted.
- Communicate transparently with the affected customer.
- Public communication if widespread (post-mortem blog post, prompt engineering changes).
Phase 7 — Vendor lock-in & resilience
Your AI product depends on infrastructure you don't own. The lock-in is real; mitigate.
Lock-in risks:
- Provider deprecation: GPT-3.5 deprecated; Claude 1 / 2 deprecated. New model versions require re-eval. Old models go away.
- Pricing changes: providers have raised prices and lowered prices. Plan for ±50% over 24 months.
- Outages: providers have multi-hour outages. AWS Bedrock / Azure OpenAI add availability but not 100%.
- Policy changes: provider terms-of-service changes can ban your use case (e.g., legal advice, financial advice, mental health).
- Account suspensions: providers can suspend over content / abuse / billing issues with little notice.
Resilience patterns:
-
Multi-provider abstraction:
- Build behind LiteLLM / Portkey / custom router.
- Test fallback paths weekly.
- Maintain at least 2 providers for critical paths.
-
Cloud-mirror providers:
- Azure OpenAI for OpenAI models (BAA / data residency / enterprise SLA).
- AWS Bedrock for Claude, Llama, Mistral.
- Better SLA than direct API; sometimes higher cost.
-
Open-weight backup:
- Have a Llama 3.x / Mistral fine-tune ready as fallback.
- Used during outages or for data-sensitive workloads.
- 70-90% of frontier quality at 1/10 cost; latency higher.
-
Caching aggressively:
- Anthropic prompt caching, Gemini context caching → 50-90% cost reduction for repeat patterns.
- Semantic cache for similar queries.
-
Graceful degradation:
- When primary provider down, fall back to cheaper / open-weight; warn user "operating in fallback mode."
- Better than full outage.
Contractual mitigations (enterprise):
- SLA in vendor contract.
- Data processing agreements (DPA) for GDPR.
- BAA for HIPAA.
- Volume discount + price-protection clause.
Phase 8 — Launch sequence (specific to AI products)
AI products launch differently than traditional SaaS. The model behavior changes; expectations are uncertain; word-of-mouth amplifies wins and losses.
Pre-launch (T-30 to T-7):
- Internal eval: 200-500 example pass on golden dataset, ≥90% pass rate.
- Closed beta: 10-50 trusted users for 1-2 weeks. Collect prompts that fail; harden.
- Cost forecast: model usage at expected user load × token cost = burn rate. Confirm gross margin.
- Provider redundancy live: failover tested.
- Guardrails tested: prompt-injection battery, abuse cases.
- Disclaimer + ToS reviewed by legal (high-risk only).
Launch week (T-0 to T+7):
- Soft launch: open to existing list / waitlist. Monitor cost, errors, user feedback.
- 24/7 monitoring of cost spikes, error rate, p99 latency, user feedback.
- Daily user-feedback triage: top 5 complaints → fix or document.
- Usage caps active to prevent runaway cost.
Stabilize (T+7 to T+30):
- Public launch (Product Hunt, HN, Twitter).
- Eval suite running on production traffic samples.
- Cost-per-user benchmarked vs forecast; adjust pricing if margin is off.
- Provider-down drill: simulate primary provider outage, verify fallback works.
Post-launch ongoing:
- Weekly: cost trend, eval drift, top complaints, top feature requests.
- Monthly: provider price benchmark; consider re-routing.
- Quarterly: full eval suite re-run; model upgrade evaluation; pricing review.
Phase 9 — Common AI startup failure modes
- No evals → silent quality drift: model upgrade or prompt change degrades production; no one notices for weeks.
- Negative gross margin from runaway costs: 10× cost-per-user from undisciplined token usage; founder didn't model unit economics.
- No moat → commoditized: competitor with $20 OpenAI account ships same product 90 days later; no defensibility.
- Prompt injection data leak: customer A extracts customer B's data; class-action / churn / brand damage.
- Hallucination scandal: AI gives wrong medical / legal / financial advice to a customer; lawsuit + Twitter pile-on.
- Provider deprecation crisis: model deprecates; team scrambles to migrate without breaking quality.
- Provider price hike: 30% price increase erodes margin overnight; couldn't pass through to customers.
- Provider account suspension: terms-of-service change or abuse complaint suspends account; product down for days.
- Demo-quality vs production gap: works in demos, fails on real-world inputs. Insufficient eval coverage.
- Over-investment in fine-tuning: spent 6 months fine-tuning when prompt + RAG would have shipped in 4 weeks at higher quality.
Anti-patterns (don't do these)
- Ship without evals. Quality regression is invisible until customers complain.
- Single provider, no fallback. Single point of failure for an entire product.
- Stuff everything in context. 32K tokens for a single request when 2K would do = 16× cost.
- No usage caps per user. Whales eat your gross margin.
- Free tier without conversion plan. Burning money for impressions.
- Pre-fine-tune before exhausting prompting. 80%+ of fine-tune use cases are unnecessary.
- Treat "we use GPT-4" as defensibility. Embarrassing in a pitch.
- Trust LLM-as-judge with same model family as production. Blind spots in evaluation.
- No audit log. SOC2/regulatory will catch this.
- Ship medical / legal / financial without disclaimers + human-in-loop. Lawsuit waiting.
Diagnostic outputs (what you produce after a session)
For every coaching session, produce in this order:
- Architecture verdict: model + RAG/fine-tune/agent decision with reasoning.
- Eval gap: what eval infrastructure they're missing + 2-week build plan.
- Cost math: cost per user / margin reality / 2-3 levers to improve.
- Defensibility honest assessment: real moat vs theater.
- Top 3 risks specific to THIS product (cost, hallucination, regulatory, provider).
- Anti-pattern flags (1-3 traps team is closest to falling into).
- 30/60/90-day milestones with specific success / fail criteria.
- Single biggest decision for the next 14 days. ONE thing.
If founder pushes back on cost / risk discipline ("we'll figure that out later"): re-run the diagnostic. Most AI startup deaths are from skipped evals, runaway cost, or commoditization — and "later" is too late.