cohort-analysis-builder

Build cohort analyses that hold up under scrutiny. Most cohort charts presented in board decks are wrong — bad cohort definitions, mid-period accounting drift, survivorship bias, or comparing cohorts of different ages. This skill walks through cohort definition, building the analysis, validating it, and interpreting the resulting retention shapes correctly.

The work order is always: define → query → validate → interpret. Skipping validation is how "looking good in the deck" diverges from "actually good for the business".

When to engage

Trigger when the user asks:

For a cohort analysis: "build a cohort", "retention cohort", "M1 retention", "M3 / M6 / M12", "weekly cohorts", "monthly cohorts"
About retention shapes: "is this normal?", "smile curve", "cliff at week 4", "retention is flatlining"
For specific tools: SQL cohort query, Mode / Looker / Tableau cohort, Amplitude / Mixpanel / Heap cohort, Excel / Google Sheets cohort
About cohort comparisons: "are paid users retaining better than free?", "how does v2 compare to v1?", "geo cohort", "channel cohort"
For NRR / GRR analysis: "net revenue retention by cohort", "expansion cohort", "logo retention by cohort"
For investor / board prep: "investor wants M12 retention", "QofE cohort review", "due diligence cohort question"

Do not engage for: pure churn-prediction ML models (different skill), customer segmentation analyses (different — use customer-segmentation-coach), or LTV calculation alone (related but distinct).

Define the cohort first — this is half the work

Most analytical mistakes are at the definition step.

Cohort axis: what binds the cohort together

Different products demand different cohort definitions:

Signup cohort: all users who signed up in period X. Most common; useful for top-of-funnel evaluation. Risk: signup is not activation; many signups never use the product.
Activation cohort: all users who completed an "activation event" (e.g., first save, first invite, first purchase) in period X. Better for product-led products; aligns with "real" usage.
First-purchase cohort: all users who made their first paid transaction in period X. Best for paid SaaS / subscription / e-commerce.
First-paid-month cohort: all customers whose first paid invoice was in period X. Standard for SaaS where free trial precedes paid conversion.
Feature-adoption cohort: all users who first used feature F in period X. For feature-specific retention analysis.
Geography / segment cohort: all customers from segment S in period X. Combine with one of the above.

Default: for SaaS, use first-paid-month cohort (matches investor / board / financial framing). For product analytics, use activation cohort (matches actual user behavior).

Time grain: how to slice the cohort

Daily cohorts: useful only for very-high-volume consumer products (>1K signups/day) where daily-grain noise averages out
Weekly cohorts: standard for consumer products and PLG SaaS in early days
Monthly cohorts: standard for B2B SaaS and most subscription products at scale
Quarterly cohorts: for enterprise / contract-based businesses where customers come in seasonal waves

Don't mix grains in one chart. If you have weekly cohorts but only 8 weeks of data, that's a 12-week chart, not "monthly retention".

Retention period grain — DIFFERENT from cohort grain

You can have monthly cohorts measured at weekly retention granularity (W1, W2, W3...). The cohort axis (how cohorts are formed) and the retention axis (how often you measure them after) are independent decisions.

For SaaS with monthly billing: monthly cohorts × monthly retention measure (M1, M2, M3, ...) = standard. For consumer apps: weekly cohorts × daily retention measure (D1, D7, D14, D30) = standard.

Retention event: what does "retained" mean

Logo retention: customer is still paying. Easiest to measure. Misses expansion / contraction.
Revenue retention: dollars from cohort N at time T / dollars from cohort N at time 0. Captures expansion + churn + contraction.
Net revenue retention (NRR): customers in cohort, current revenue / cohort original revenue. Includes upsell, churn, contraction.
Gross revenue retention (GRR): customer in cohort, current revenue / cohort original revenue, capped at 100% per customer (no upsell counted). Measures churn / contraction only.
Activity retention: did the user perform action X in week N? For consumer products / engagement metrics.
Per-user revenue retention: ARPU evolution per cohort.

Choose one and stick with it. Don't switch mid-analysis. If the audience needs both NRR and GRR, build both as separate cohort tables and present together.

Building the cohort table — SQL pattern

The canonical cohort retention SQL:

-- Step 1: define cohort assignment
WITH cohort_assignment AS (
  SELECT
    customer_id,
    DATE_TRUNC('month', first_paid_date) AS cohort_month,
    first_paid_revenue
  FROM customers
  WHERE first_paid_date IS NOT NULL
),

-- Step 2: define monthly active / paying state
monthly_state AS (
  SELECT
    customer_id,
    DATE_TRUNC('month', invoice_date) AS month,
    SUM(invoice_amount) AS revenue_in_month,
    1 AS is_active
  FROM invoices
  WHERE status = 'paid'
  GROUP BY 1, 2
),

-- Step 3: compute cohort × retention period
cohort_retention AS (
  SELECT
    c.cohort_month,
    DATE_DIFF('month', c.cohort_month, m.month) AS month_index,
    COUNT(DISTINCT c.customer_id) AS active_customers,
    SUM(m.revenue_in_month) AS cohort_revenue
  FROM cohort_assignment c
  LEFT JOIN monthly_state m
    ON c.customer_id = m.customer_id
    AND m.month >= c.cohort_month
  GROUP BY 1, 2
)

-- Step 4: pivot for the chart
SELECT
  cohort_month,
  COUNT(DISTINCT customer_id_at_month_0) AS cohort_size,
  COUNT(DISTINCT customer_id_at_month_1) AS m1_active,
  COUNT(DISTINCT customer_id_at_month_3) AS m3_active,
  -- ...
FROM cohort_retention
GROUP BY 1
ORDER BY 1;

Adapt for: weekly cohorts (use DATE_TRUNC('week', ...)), feature retention (replace is_active with did_feature_X), revenue retention (use revenue_in_month instead of distinct count).

Common SQL mistakes

Using LEFT JOIN then WHERE m.month IS NOT NULL: turns it into an INNER JOIN, drops cohorts with zero retention
Forgetting to align cohort_month: comparing customer's first-paid-date to a fiscal calendar that doesn't match
Counting customer_id from invoices, not from cohort_assignment: misses customers who churned in M1
Using last_invoice_date instead of monthly invoice presence: misclassifies long-term customers who paid in M1 and skipped M2

Validation — before showing anyone

Run these checks. Most cohort errors are caught here, not at presentation.

Cohort-size sanity check

Each cohort row should have a customer count that matches your CRM / billing-system count for that period
If your "Jan 2024 cohort" has 47 customers but your finance system shows 50 paid customers acquired in Jan 2024, find the 3 missing — partial-month signups, returns, refunds, edge cases

Retention-vs-aggregate sanity check

M0 of every cohort = 100% by definition (customer is in the cohort by definition of being a customer in M0)
Sum across cohorts should match overall active customer count at each calendar month
If overall active customers = 1000 in Jan 2025, sum of all (cohort, month_index) where calendar = Jan 2025 should also = 1000

Monotonicity check (for logo retention)

Logo retention curve should be monotonically non-increasing — if M2 retention > M1 retention for the same cohort, you have a re-engagement counted as new retention (data leakage)
If you see "smile curve" (M3 < M6 < M9 retention), 95% of the time it's an analytical bug, not a real product trend
Common cause: re-subscribing after churn is being counted as "still retained" — fix the SQL to count only continuous retention

Edge case: customer left and returned

Should they be counted as retained in months they were paused / churned?
Default: NO. They're a NEW cohort if they came back.
If you do count them as retained-throughout, surface this explicitly.

Survivorship bias check

Does the cohort table include cohorts whose retention you can't yet measure to the same depth?
Example: if you're computing M12 retention and your most recent cohort is 6 months old, that cohort has incomplete M12 data. Either: drop incomplete cohorts, or clearly mark them.

Cross-validation

Run the same cohort logic in a different tool (e.g., compute in SQL, validate in spreadsheet for one cohort)
Compare to your finance / accounting system — total revenue per cohort over time should match invoice totals

Interpretation — what retention shapes mean

Cliff retention (sharp drop in early periods, then flat)

Pattern: 100% → 50% → 30% → 28% → 27% → 27%...
Typical of: products with a strong "evergreen user" segment after early activation
Healthy when: the flat tail is at a high enough percentage to support unit economics
Concerning when: cliff is too steep (e.g., 50% drop in M1 = poor activation / onboarding)

Linear-decline retention

Pattern: 100% → 90% → 80% → 70% → 60%...
Typical of: subscription products with predictable annual / multi-year churn
Concerning if: the slope steepens over time (cohorts are getting worse)
Healthy if: the slope is flat across cohorts (consistent retention) and acceptable level (e.g., 90% per year)

S-curve / leveled retention

Pattern: 100% → 60% → 40% → 35% → 33% → 32% → 32%...
Typical of: products with strong product-market fit; customers either churn early or stay forever
Healthy when: tail level is high enough; cohort sizes are growing

Smile curve (re-engagement)

Pattern: 100% → 50% → 30% → 25% → 30% → 35%...
Almost always a data error: re-subscribers being counted as continuously retained
Real smile curves exist but are rare — typically only on consumer products with seasonal / lifecycle reasons (fitness apps in January, tax software in April)
ALWAYS validate the smile curve before presenting

Aging / cohort-version retention

Plot multiple cohorts on the same chart (Jan 2024, Feb 2024, Mar 2024, ...)
Healthy: more recent cohorts retain at least as well as older ones (slope of retention curves is consistent)
Concerning: more recent cohorts retain WORSE than older cohorts ("retention is degrading") — often a sign of growth-driven loosening of ICP, or competitor pressure
Healthy: more recent cohorts retain BETTER than older cohorts (often a sign of product improvement)

Cohort layering — the high-leverage analysis

Single-cohort analysis answers "what's our retention". Layered cohort analysis answers "what's CAUSING our retention".

Layer by acquisition channel

Compare retention curves of customers from organic, paid Google, paid Meta, content, referrals, sales-led
High-leverage finding: paid channels often retain worse than organic; if so, CAC payback math should adjust

Layer by pricing tier

Compare retention of Starter, Pro, Business, Enterprise
Common: lower tiers churn more; helps justify investment in upmarket motion

Layer by product version / feature adoption

Compare cohorts who adopted Feature X vs cohorts who didn't
Finding: "users who adopted Feature X in M1 retain at 90% in M6 vs 50% for those who didn't"
Caution: correlation, not causation — Feature X adoption may be a proxy for engagement, not a driver

Layer by sales rep / customer success owner

Compare retention of customers across reps / CSMs
Surfaces individual performance differences

Layer by industry / company size

Compare SMB / mid-market / enterprise cohorts
Surfaces ICP fit gaps

Layer by geography

Compare US / EU / APAC cohorts
Surfaces market-fit differences

Layer by signup-date period

Compare cohorts pre-/post- a major product change
Cleanest test of whether a feature / pricing / messaging change actually improved retention

Common pitfalls

Mid-period accounting drift

Cohort defined by start-of-month signup but customer actually started mid-month (free trial period)
Their "M1" really only includes 15 days of post-signup activity
Fix: define cohorts by exact dates, then bucket; or accept the drift as systematic

Survivorship bias

M12 retention chart that omits cohorts with insufficient age (only computed for cohorts ≥12 months old)
Audience interprets as current-state; actually it's a historic-state view
Fix: clearly label "cohorts with sufficient age" + footnote

Currency / pricing changes

If you charge in local currency and report in USD, FX swings show up as "expansion" or "contraction" that isn't real
Fix: report constant-currency NRR / GRR

Plan migrations

Customer downgrades from Pro to Starter, or moves between plans
Counted as expansion / contraction depending on direction
Fix: track plan changes separately; surface migration as its own metric

Refunds / chargebacks

Customer paid in M1, refunded in M2 — should they count as M1 retained?
Default: count their full original cohort revenue, then subtract in subsequent months as a contraction
Fix: be explicit about treatment

Payment method changes

Customer's credit card expires, fails to renew, then re-enters payment 30 days later
Counted as churn-and-return or as continuous retention?
Default: continuous if return is < 30 days (grace period)
Fix: define grace period explicitly

Zombie customers

Customers paying $0 (free / trialing forever) included in cohort
Distorts revenue retention but not logo retention
Fix: define paying customer explicitly (e.g., "any customer with > $0 invoice in last 90 days")

Tool-specific cohort building

SQL (Postgres / Snowflake / BigQuery / Redshift)

Use the canonical pattern above
Window functions help: ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY month) for cohort assignment

Amplitude / Mixpanel

Built-in cohort visualization; risk is overconfidence in the tool's defaults
Always check: cohort definition (event-based vs property-based), retention type (Returning / Bracket / Unbounded)

Looker / Lookbridge

LookML-based; can build retention as a named metric
Watch for: persistent derived tables that don't refresh; aggregate awareness conflicts

dbt + Mode / Tableau / Metabase

Build the cohort table in dbt as a materialized model
Visualize in any BI tool
Best for reproducibility and audit trail

Spreadsheet (Excel / Google Sheets)

For one-off analysis or first cohort
Use INDEX/MATCH or pivot table; risky for production / repeated analysis

Output to user after diagnostic

After understanding the user's product, scale, and goal:

Cohort definition recommendation: cohort axis (signup / activation / first-paid), cohort grain (weekly / monthly / quarterly), retention period grain
Retention metric: logo / GRR / NRR / activity — match to the question being asked
SQL or tool template: canonical query adapted to the user's schema
Validation checklist: specific checks to run before presenting (cohort sizes, monotonicity, aggregate match)
Interpretation framework: what retention shape to expect, what would be concerning vs healthy
Layering plan: which 2-3 dimensions to layer first for maximum insight
Pitfalls to avoid: tied to the user's specific business model (e.g., currency for international, plan migration for tiered SaaS)
Output format: heatmap (cohort × period), unsmoothed curve overlay, table for board

A solid cohort analysis is one of the highest-leverage diagnostic tools in B2B and consumer subscription. A flawed cohort analysis is one of the most common ways founders / ops teams convince themselves of false retention narratives. This skill builds the discipline to do it right, every time.