Feature Flag Cleanup

Audit and retire stale feature flags across a codebase and a flag service (LaunchDarkly, Unleash, Flagsmith, GrowthBook, Split, custom). Produces a ranked removal plan, owner-tagged tickets, removal PRs grouped by risk, and a four-week cleanup sprint. Acts as a platform engineer who has decommissioned thousands of flags without breaking production.

Usage

Invoke this skill when feature flag debt is slowing engineering down: builds carry stale toggles, code paths exist for experiments that ended a year ago, the LaunchDarkly bill is climbing, or a refactor is blocked by unread flags.

Basic invocation:

Audit our LaunchDarkly account and our monorepo for stale flags We have 1,200 flags in Unleash — find the dead ones Write a removal PR for the new-checkout-v2 flag

With context:

Here's our LD export and the code grep — order removals by risk We're moving off Split.io to GrowthBook — what flags die in the migration? Audit only the flags owned by my team (CODEOWNERS: payments)

The agent produces a stale-flag inventory, a four-week cleanup schedule, removal PRs in safety order, and a per-owner ticket list.

How It Works

Step 1: Inventory The Flag Estate

The agent first builds a complete inventory across two surfaces and joins them:

Surface	What It Holds	How To Pull
Flag service	Targeting rules, rollout %, last-modified date, evaluation count	LaunchDarkly REST API `/api/v2/flags`, Unleash `/api/admin/features`, Flagsmith `/api/v1/features/`, GrowthBook `/api/v1/features`, Split `/internal/api/v2/splits`
Codebase	Flag references in source, configs, tests	`git grep`, AST parse, language-specific clients (`useFeatureFlag`, `client.boolVariation`, `unleash.isEnabled`)
Telemetry	Production evaluation counts per flag per day	Datadog, Honeycomb, vendor evaluation logs, OpenTelemetry traces
Ticket system	Originating ticket, intended sunset date	Jira `flag:` label, Linear cycle search

The join key is the flag's key (string id). Any flag that exists in only one surface is suspect — code references with no service definition are dead code; service definitions with no code references are dead config.

Step 2: Classify Every Flag

Not all flags retire the same way. The agent assigns one of five types before deciding what to do:

Type	Lifetime Expectation	Cleanup Default
Release toggle	Days to weeks during a rollout	Remove once 100% on for 30 days
Experiment	One experiment cycle (2-8 weeks)	Remove once analysis is shipped, winner picked
Permission / entitlement	Permanent	Migrate to authz system, then remove
Ops / kill switch	Permanent	Keep but document; review yearly
Config / parameter	Permanent	Migrate to config service if static, remove if dynamic decision is gone

A flag's type is rarely tagged at creation — the agent infers from name patterns (enable_, kill_, experiment_, tier_), targeting rules (percentage rollout vs user-list vs segment), and evaluation patterns (steady-state vs spiking on deploy days).

Step 3: Detect Staleness

The agent applies a layered staleness ruleset. A flag must trip at least two rules to be marked stale (single-rule failures produce a "watch list", not a removal).

Time rules:

R1. created_at older than 90 days
R2. last_modified older than 60 days (rules untouched)
R3. last_evaluation older than 30 days (no traffic)
R4. originating Jira ticket closed >180 days ago

State rules:

R5. served value is 100% (or 0%) for 30+ consecutive days
R6. all targeting rules collapse to a single variant (no branching)
R7. zero overrides, zero environment-specific differences
R8. variant served matches default for environment

Code rules:

R9. flag key not present in main branch (only in deleted branches / archived dirs)
R10. all code references are inside a single if branch with no else
R11. all references are tests (no production callsite)
R12. references exist only in disabled feature folders

Service rules:

R13. flag is archived in service but still referenced in code
R14. flag is in service but never linked to code (orphan)
R15. flag references a deleted segment / user list
R16. flag's prerequisites form a cycle or reference deleted flags

A removal candidate scores count(rules_tripped) plus a risk modifier (see Step 4). Anything ≥ 4 is a strong removal; 2-3 is staged with extra verification.

Step 4: Risk Classification (Severity Matrix)

Each flag gets a removal-risk grade. Risk is independent of staleness — a stale flag can still be high-risk to remove if the wrong default ships.

Grade	Criteria	Removal Approach
R0 — Trivial	Test-only references, dev-only flag, zero prod traffic in 90d	Single PR, batch with siblings
R1 — Low	Release toggle 100% on 60+ days, simple boolean, single owner	One PR per flag, standard review
R2 — Medium	Touches a paid feature, multiple owners, has variants beyond on/off, evaluated >1k/day	One PR per flag, two reviewers, deploy in own release window
R3 — High	Permission/entitlement, billing-adjacent, payment path, auth path	Migrate before remove. Owner sign-off required, integration tests, dark-launch the removal
R4 — Critical	Kill switch, regulatory, data residency, encryption toggle	Keep. Document and review yearly. Removal blocked unless replacement is in place.

The agent never auto-generates a removal PR for R3 or R4 — those produce migration tickets instead.

Step 5: Owner Tagging

Removals stall when nobody is on the hook. The agent assigns each flag exactly one owner using a fallback chain:

1. Service-side `tags` or `maintainer` field          (LaunchDarkly tags, Unleash project)
2. Originating Jira/Linear ticket assignee
3. CODEOWNERS for the file containing the most references
4. `git log --follow` first author of the introducing commit
5. `git blame` on the line of the most recent flag check
6. Team channel mapping (#payments → @payments-leads)

The agent emits a per-owner queue so each engineer sees only their flags. Bulk emails to "engineering@" produce zero cleanup; per-owner tickets with a 2-week SLA produce 70%+ completion.

Step 6: Generate Removal Recipes

For each removable flag the agent emits a deterministic recipe by language and client. The pattern always:

Replace the if (flag) { A } else { B } with the winning branch
Delete the dead branch and any helpers it called
Remove the flag-client import if it was the last reference in the file
Delete the flag's service definition (or archive it)
Drop tests for the dead branch; re-baseline snapshot tests
Remove documentation references (runbooks, ADRs, feature lists)

Example — TypeScript / LaunchDarkly:

// before
import { useFlags } from 'launchdarkly-react-client-sdk';
const { newCheckout } = useFlags();
return newCheckout ? <CheckoutV2 /> : <CheckoutV1 />;

// after (newCheckout was 100% on for 60 days)
return <CheckoutV2 />;

// then delete CheckoutV1.tsx, its tests, its CSS, and remove from routing

Example — Go / Unleash:

// before
if unleash.IsEnabled("ff_async_export") {
    go runAsyncExport(ctx, req)
} else {
    runSyncExport(ctx, req)
}

// after
go runAsyncExport(ctx, req)

// then delete runSyncExport, delete its mock, prune the unleash strategy

Example — Python / Flagsmith:

# before
if flagsmith.has_feature("legacy_pricing", default=False):
    price = legacy_price_engine(cart)
else:
    price = price_engine_v2(cart)

# after
price = price_engine_v2(cart)

# then drop legacy_price_engine module and its 14 fixture files

Example — Custom DB-backed flag:

# before
if FeatureToggle.objects.get(key="multi_currency").enabled:
    currency = detect_user_currency(user)
else:
    currency = "USD"

# after
currency = detect_user_currency(user)

# then drop the FeatureToggle row, the migration to delete is M0142

Step 7: Pull Request Strategy

The agent organizes PRs to balance reviewer load and blast radius:

Batch R0 trivial — one PR per service or per directory, up to 20 flags. Single reviewer.
One PR per R1 — small, mechanical, easy revert. Title format: chore(flags): remove ff_X (100% on since YYYY-MM-DD).
One PR per R2 with deploy gate — merge during low-traffic window, watch error budget for 24h, document the rollback flag default.
R3 splits into two PRs:
- PR-1: Land the replacement (authz rule, config value, kill switch in new system) without touching the flag.
- PR-2: Remove the flag once PR-1 has been in production for 7 days clean.

Every PR includes a Rollback Plan section. The agent generates it from the flag definition:

## Rollback Plan
If incident: revert this commit. The previous behavior was the
`disabled` branch which called `legacy_pricing_engine`. That code
is preserved in commit abc1234 of branch `archive/ff-legacy-pricing`
for 90 days post-merge. After 90 days, recover from git history
via `git log --all --source -- legacy_pricing_engine.py`.

Step 8: Gradual Rollback Plan

Even after removal, the agent leaves a 30-60-90 day safety net:

Day	Action
0	Merge removal PR. Tag the commit `flag-removed/ff_xxx`. Open a 30-day calendar reminder for the owner.
7	Verify error rate, latency, conversion metric vs baseline. If regression, the agent generates a re-introduction PR that restores the flag and pins it to the previous default.
30	Delete the flag definition in the service (was archived at PR merge). Drop the archive branch if no rollback called.
60	Review the removed-flags log; mark "permanent" in MEMORY.
90	Drop the flag from runbooks, dashboards, alerts.

For R3 / R4 retentions, extend the windows: 14 / 60 / 120 / 180.

Step 9: Service Provider Specifics

Each provider has its own retirement workflow. The agent uses the right API, the right resource hierarchy, and the right billing impact.

LaunchDarkly:

API: DELETE /api/v2/flags/{projKey}/{flagKey} (archives by default; pass ?archived=true to unarchive)
Use flag tags liberally — temporary, experiment, kill-switch make Step 2 automatic going forward
LD's Code references integration (via GitHub/GitLab) is the single highest-leverage tool — install before doing the audit
LD bills per MAU evaluated, not per flag — removing a low-traffic flag saves nothing on the bill; removing a high-traffic experiment that's still 50/50 on a logged-in segment saves the most
Use Workflows to schedule the staged rollback (e.g. archive after 30 days)

Unleash:

Open-source, often self-hosted; cleanup recovers DB rows but not license cost
Built-in stale flag UI under Project → Reports
API: DELETE /api/admin/features/{name} (archives), then DELETE /api/admin/archive/{name} (hard delete)
Strategies are independent objects — verify no other flag references the strategy before deleting it
Unleash 5+ supports Dependencies between flags — break dependencies before deletion

Flagsmith:

Per-environment overrides are common — verify all environments have collapsed to a single value before removal
API: DELETE /api/v1/projects/{id}/features/{id}/
Segments are project-scoped; orphaned segments accumulate fast — sweep them in the same audit
Self-hosted Flagsmith persists evaluation logs only if the influxdb integration is enabled — without it, R3 (last_evaluation) is unreliable

GrowthBook:

Experiment-first model; many flags are tied to a running experiment doc
Closing the experiment doesn't delete the flag — explicitly archive after analysis
API: DELETE /api/v1/features/{id} (requires admin role)
The Code references scan in the GrowthBook proxy is opt-in; turn it on before audit

Split.io:

"Splits" are the flag; "Treatments" are the variants
Killing a split via UI sets it to a single treatment but does not remove from code — agent must still produce the code-removal PR
API: DELETE /internal/api/v2/splits/ws/{wsId}/{splitName} (workspaces matter, easy to delete the wrong env)
Split's dynamic configurations are JSON-typed — removal recipes for these inline the JSON value, not a boolean

Custom / DB-backed flags:

Hardest case — no audit UI, no API. The agent generates SQL to find dead flags:

SELECT key, MAX(updated_at)
FROM   feature_toggles
WHERE  key NOT IN ($referenced_keys_from_grep)
   OR  updated_at < NOW() - INTERVAL '180 days';

Removal is a database migration plus a code change in the same PR
Add a flag definition test that fails CI when a code reference exists without a DB row, and vice versa

Step 10: Four-Week Cleanup Playbook

Cleanup as a one-off audit dies. Cleanup as a recurring sprint sticks. The agent emits a four-week schedule:

WEEK 1 — INVENTORY & BASELINE
Mon  Pull flag list from service API → CSV
Tue  git grep code references → join to CSV
Wed  Pull last 30d evaluation telemetry → join to CSV
Thu  Classify by type (R0-R4), tag owners, generate per-owner queue
Fri  Publish dashboard: total flags, removable count, debt $$ estimate

WEEK 2 — TRIVIAL & LOW (R0/R1)
Mon  Bulk-PR all R0 flags (test-only, dev-only)
Tue  Open R1 PRs in batches of 10 per team
Wed  Merge R0 batch (single reviewer), monitor CI
Thu  Merge R1 batch on standard review cadence
Fri  Service-side: archive the merged flags, refresh dashboard

WEEK 3 — MEDIUM (R2)
Mon  Per-flag PRs for R2; pair with feature owner
Tue  Schedule R2 deploys to low-traffic windows
Wed  Deploy first half, watch error budget for 24h
Thu  Deploy second half if green; revert plan exercised on any regression
Fri  Service-side cleanup; mark watch-period start date

WEEK 4 — HIGH (R3) MIGRATION + REPORT
Mon  R3 migration PRs land (NOT removals — replacements)
Tue  Replacement code burns in for the 7-day rule
Wed  Generate the 30/60/90 calendar reminders
Thu  Update runbooks, ADRs, onboarding docs to match new state
Fri  Post-mortem write-up: flags removed, $$ saved, debt remaining,
     and a date for the next quarter's audit

Step 11: Prevention — Stop The Debt At The Source

Cleanup without prevention runs the same audit again next quarter. The agent emits a prevention pack:

Mandatory expiry date at flag creation. Service-level enforcement: LaunchDarkly Workflows, Unleash flag templates, custom DB column expires_at NOT NULL.
PR template for new flags: type, owner, expected sunset, kill-switch criteria, removal ticket linked.
CI lint: a check that fails when a flag is created without an expires_at, or when the introducing PR has no linked removal ticket.
Quarterly audit cron (the agent ships the script): runs the staleness rules and opens a tracking ticket per owner.
Flag SLO: max 50 active flags per service (or N per 1k LOC). Breach blocks new flag creation in CI.
Onboarding doc: every new engineer reads the "flags are debt by default" page before getting service-side write access.

Worked Examples

Example 1: 1,200-Flag LaunchDarkly Account Cleanup

Inventory results:

Total flags                    1,212
  - Active                       743
  - Archived (still in code)     204
  - Code-only orphans            265

By type:
  - Release toggle               612 (50%)
  - Experiment                   188 (15%)
  - Permission/entitlement       121 (10%)
  - Ops/kill                      94 (8%)
  - Unclassified                 197 (16%)

Staleness (≥2 rules tripped):    687 candidates (57%)
By risk:
  - R0 trivial                   142
  - R1 low                       312
  - R2 medium                    178
  - R3 high                       49
  - R4 critical                    6

Plan:

Week 2: 142 R0 flags removed in 8 batched PRs. Engineering hours: ~12.
Week 3: 312 R1 flags removed across 14 teams. Per-team queue, average 22 flags. Engineering hours: ~120 (org-wide, not per team).
Week 4: 178 R2 staged across two release windows. Owner pairing required. Engineering hours: ~80.
Quarter 2: 49 R3 migrations begin (not full removals).
6 R4 documented and parked.

Estimated savings: 38% reduction in flag count, ~$22k/yr LD cost reduction (based on MAU savings from the 89 high-traffic experiments retired), and ~3,400 LOC removed.

Example 2: Single Flag Removal — `new_checkout_v2`

Audit:

Key:               new_checkout_v2
Service:           LaunchDarkly (project: web-app)
Created:           2025-01-12 (110 days ago)
Last modified:     2025-02-14 (rules frozen)
Rollout:           100% production for 84 days
Variants:          on / off
Code refs:         3 files, 7 callsites
  - src/checkout/route.tsx (4)
  - src/checkout/__tests__/route.test.tsx (2)
  - src/analytics/checkout.ts (1)
Telemetry:         184k evals/day, all → "on"
Owner (CODEOWNERS): @payments-team
Type:              Release toggle
Risk:              R1 (low)

Generated PR:

Title: chore(flags): remove new_checkout_v2 (100% on since 2025-02-14)

Summary
  Flag has served `on` to 100% of production for 84 days with zero
  toggle activity. Removing both the flag check and the dead v1
  checkout component.

Changes
  - src/checkout/route.tsx        : -28 +4
  - src/checkout/CheckoutV1.tsx    : deleted (-512)
  - src/checkout/CheckoutV1.css    : deleted (-180)
  - src/checkout/__tests__/...     : -1 file, -94 lines
  - src/analytics/checkout.ts      : -6 +1

Rollback Plan
  Revert this commit. v1 component preserved on branch
  archive/ff-new-checkout-v2 for 90 days. Re-enable flag in LD
  via the saved config in the linked ticket.

Service-side follow-up
  - Archive flag in LD on merge (Workflow scheduled)
  - Hard-delete flag 2026-08-01 (90 days post-merge)

Tickets
  PROJ-4421 (close on merge)

Example 3: Custom DB Flag Audit

A team running a home-grown feature_toggles table:

-- Step 1: code-referenced keys (output of grep)
WITH code_refs(key) AS (VALUES
  ('multi_currency'), ('beta_dashboard'), ('legacy_pricing'),
  ('async_export'),  ('new_search_v2')
)
SELECT ft.key,
       ft.enabled,
       ft.updated_at,
       (SELECT 1 FROM code_refs c WHERE c.key = ft.key) AS in_code
FROM   feature_toggles ft
ORDER  BY ft.updated_at;

The agent emits the migration plus the code PR in one bundle:

PR-1 (DB migration M0142_drop_dead_toggles.sql)
  DELETE FROM feature_toggles
  WHERE key IN ('legacy_pricing', 'beta_dashboard', 'old_v1');

PR-2 (code)
  - drop legacy_price_engine.py
  - drop beta_dashboard route
  - prune calls in 4 files

PR-2 merges first; PR-1 deploys in the next migration window.

Output

The agent produces:

Inventory CSV — every flag with service, code-refs, telemetry, owner, type, risk, recommendation
Per-owner ticket queue — Jira/Linear-ready with one ticket per flag, grouped by owner
Removal PRs — actual diffs, batched by R-grade
Rollback plan — per-PR rollback section + 30/60/90 calendar reminders
Cleanup dashboard — flag count, removable count, $/yr savings estimate, weekly burndown
Provider-specific scripts — API delete commands, archive commands, segment cleanup
Prevention pack — PR template, CI lint, expiry-required policy, onboarding doc
Four-week sprint plan — daily checklist, owners, definition of done

Common Scenarios

"We just acquired a company with 400 flags in a tool we don't use"

The agent maps each flag to one of: keep-and-migrate, remove-with-default, remove-as-dead-code. Migration target is your incumbent flag service. Output is two PR streams: code PRs (own repo) and import scripts (for the incumbent service).

"An R2 removal caused a P1 incident"

The agent generates the immediate-revert PR and a postmortem template. Then it adds the missing detection: which staleness rule should have caught this, and patches the ruleset (e.g. add "no diurnal pattern in evaluations" as R17).

"How much is flag debt actually costing?"

The agent estimates: vendor cost (MAU × $rate × removable_traffic_share), engineering time tax (avg 6 min per stale flag in PR review × refs/quarter), and incident risk (count of incidents in the last 12 months that mention a flag). One-page CFO-ready memo.

"Our team owns 80 flags and the rest of the org owns 1,200 — should we wait?"

No. The agent ships the team's queue independently. Cross-team coordination kills cleanups. Each team should own its flag retirement budget.

Tips for Best Results

Provide both the service export and a code-references grep — single-source audits miss orphans
Share 30+ days of evaluation telemetry, not a snapshot — variance reveals diurnal experiments
Include CODEOWNERS or an equivalent ownership map — owner-less queues stall
Specify which environments matter (prod only, or prod+staging) — staging-only flags often look stale but aren't
State your incumbent flag service if doing a migration — it changes the migration target, not just the cleanup logic
Mention regulatory constraints (PCI, HIPAA, GDPR) before starting — kill switches in those domains move from R4 to "do not touch"

When NOT To Use

Brand-new product with <30 flags — the audit overhead is larger than the debt; instead apply the prevention pack at flag #1.
Active migration between flag vendors — finish the migration first; mid-migration audits produce false positives because both systems are partially live.
Code-base under active rewrite — flags being removed by the rewrite anyway; wait until the rewrite ships, then audit what survived.
Compliance-driven flags only (banking kill switches, regulatory toggles) — these are R4 by default; they require legal/compliance review, not engineering cleanup.
You don't have evaluation telemetry — without last-evaluation data, the staleness rules collapse to "old" which produces too many false positives. Wire up telemetry first, audit second.