Feature Flag Cleanup
Audit and retire stale feature flags across a codebase and a flag service (LaunchDarkly, Unleash, Flagsmith, GrowthBook, Split, custom). Produces a ranked removal plan, owner-tagged tickets, removal PRs grouped by risk, and a four-week cleanup sprint. Acts as a platform engineer who has decommissioned thousands of flags without breaking production.
Usage
Invoke this skill when feature flag debt is slowing engineering down: builds carry stale toggles, code paths exist for experiments that ended a year ago, the LaunchDarkly bill is climbing, or a refactor is blocked by unread flags.
Basic invocation:
Audit our LaunchDarkly account and our monorepo for stale flags We have 1,200 flags in Unleash — find the dead ones Write a removal PR for the
new-checkout-v2flag
With context:
Here's our LD export and the code grep — order removals by risk We're moving off Split.io to GrowthBook — what flags die in the migration? Audit only the flags owned by my team (CODEOWNERS: payments)
The agent produces a stale-flag inventory, a four-week cleanup schedule, removal PRs in safety order, and a per-owner ticket list.
How It Works
Step 1: Inventory The Flag Estate
The agent first builds a complete inventory across two surfaces and joins them:
| Surface | What It Holds | How To Pull |
|---|---|---|
| Flag service | Targeting rules, rollout %, last-modified date, evaluation count | LaunchDarkly REST API /api/v2/flags, Unleash /api/admin/features, Flagsmith /api/v1/features/, GrowthBook /api/v1/features, Split /internal/api/v2/splits |
| Codebase | Flag references in source, configs, tests | git grep, AST parse, language-specific clients (useFeatureFlag, client.boolVariation, unleash.isEnabled) |
| Telemetry | Production evaluation counts per flag per day | Datadog, Honeycomb, vendor evaluation logs, OpenTelemetry traces |
| Ticket system | Originating ticket, intended sunset date | Jira flag: label, Linear cycle search |
The join key is the flag's key (string id). Any flag that exists in only one surface is suspect — code references with no service definition are dead code; service definitions with no code references are dead config.
Step 2: Classify Every Flag
Not all flags retire the same way. The agent assigns one of five types before deciding what to do:
| Type | Lifetime Expectation | Cleanup Default |
|---|---|---|
| Release toggle | Days to weeks during a rollout | Remove once 100% on for 30 days |
| Experiment | One experiment cycle (2-8 weeks) | Remove once analysis is shipped, winner picked |
| Permission / entitlement | Permanent | Migrate to authz system, then remove |
| Ops / kill switch | Permanent | Keep but document; review yearly |
| Config / parameter | Permanent | Migrate to config service if static, remove if dynamic decision is gone |
A flag's type is rarely tagged at creation — the agent infers from name patterns (enable_, kill_, experiment_, tier_), targeting rules (percentage rollout vs user-list vs segment), and evaluation patterns (steady-state vs spiking on deploy days).
Step 3: Detect Staleness
The agent applies a layered staleness ruleset. A flag must trip at least two rules to be marked stale (single-rule failures produce a "watch list", not a removal).
Time rules:
R1. created_at older than 90 days
R2. last_modified older than 60 days (rules untouched)
R3. last_evaluation older than 30 days (no traffic)
R4. originating Jira ticket closed >180 days ago
State rules:
R5. served value is 100% (or 0%) for 30+ consecutive days
R6. all targeting rules collapse to a single variant (no branching)
R7. zero overrides, zero environment-specific differences
R8. variant served matches default for environment
Code rules:
R9. flag key not present in main branch (only in deleted branches / archived dirs)
R10. all code references are inside a single if branch with no else
R11. all references are tests (no production callsite)
R12. references exist only in disabled feature folders
Service rules:
R13. flag is archived in service but still referenced in code
R14. flag is in service but never linked to code (orphan)
R15. flag references a deleted segment / user list
R16. flag's prerequisites form a cycle or reference deleted flags
A removal candidate scores count(rules_tripped) plus a risk modifier (see Step 4). Anything ≥ 4 is a strong removal; 2-3 is staged with extra verification.
Step 4: Risk Classification (Severity Matrix)
Each flag gets a removal-risk grade. Risk is independent of staleness — a stale flag can still be high-risk to remove if the wrong default ships.
| Grade | Criteria | Removal Approach |
|---|---|---|
| R0 — Trivial | Test-only references, dev-only flag, zero prod traffic in 90d | Single PR, batch with siblings |
| R1 — Low | Release toggle 100% on 60+ days, simple boolean, single owner | One PR per flag, standard review |
| R2 — Medium | Touches a paid feature, multiple owners, has variants beyond on/off, evaluated >1k/day | One PR per flag, two reviewers, deploy in own release window |
| R3 — High | Permission/entitlement, billing-adjacent, payment path, auth path | Migrate before remove. Owner sign-off required, integration tests, dark-launch the removal |
| R4 — Critical | Kill switch, regulatory, data residency, encryption toggle | Keep. Document and review yearly. Removal blocked unless replacement is in place. |
The agent never auto-generates a removal PR for R3 or R4 — those produce migration tickets instead.
Step 5: Owner Tagging
Removals stall when nobody is on the hook. The agent assigns each flag exactly one owner using a fallback chain:
1. Service-side `tags` or `maintainer` field (LaunchDarkly tags, Unleash project)
2. Originating Jira/Linear ticket assignee
3. CODEOWNERS for the file containing the most references
4. `git log --follow` first author of the introducing commit
5. `git blame` on the line of the most recent flag check
6. Team channel mapping (#payments → @payments-leads)
The agent emits a per-owner queue so each engineer sees only their flags. Bulk emails to "engineering@" produce zero cleanup; per-owner tickets with a 2-week SLA produce 70%+ completion.
Step 6: Generate Removal Recipes
For each removable flag the agent emits a deterministic recipe by language and client. The pattern always:
- Replace the
if (flag) { A } else { B }with the winning branch - Delete the dead branch and any helpers it called
- Remove the flag-client import if it was the last reference in the file
- Delete the flag's service definition (or archive it)
- Drop tests for the dead branch; re-baseline snapshot tests
- Remove documentation references (runbooks, ADRs, feature lists)
Example — TypeScript / LaunchDarkly:
// before
import { useFlags } from 'launchdarkly-react-client-sdk';
const { newCheckout } = useFlags();
return newCheckout ? <CheckoutV2 /> : <CheckoutV1 />;
// after (newCheckout was 100% on for 60 days)
return <CheckoutV2 />;
// then delete CheckoutV1.tsx, its tests, its CSS, and remove from routing
Example — Go / Unleash:
// before
if unleash.IsEnabled("ff_async_export") {
go runAsyncExport(ctx, req)
} else {
runSyncExport(ctx, req)
}
// after
go runAsyncExport(ctx, req)
// then delete runSyncExport, delete its mock, prune the unleash strategy
Example — Python / Flagsmith:
# before
if flagsmith.has_feature("legacy_pricing", default=False):
price = legacy_price_engine(cart)
else:
price = price_engine_v2(cart)
# after
price = price_engine_v2(cart)
# then drop legacy_price_engine module and its 14 fixture files
Example — Custom DB-backed flag:
# before
if FeatureToggle.objects.get(key="multi_currency").enabled:
currency = detect_user_currency(user)
else:
currency = "USD"
# after
currency = detect_user_currency(user)
# then drop the FeatureToggle row, the migration to delete is M0142
Step 7: Pull Request Strategy
The agent organizes PRs to balance reviewer load and blast radius:
- Batch R0 trivial — one PR per service or per directory, up to 20 flags. Single reviewer.
- One PR per R1 — small, mechanical, easy revert. Title format:
chore(flags): remove ff_X (100% on since YYYY-MM-DD). - One PR per R2 with deploy gate — merge during low-traffic window, watch error budget for 24h, document the rollback flag default.
- R3 splits into two PRs:
- PR-1: Land the replacement (authz rule, config value, kill switch in new system) without touching the flag.
- PR-2: Remove the flag once PR-1 has been in production for 7 days clean.
Every PR includes a Rollback Plan section. The agent generates it from the flag definition:
## Rollback Plan
If incident: revert this commit. The previous behavior was the
`disabled` branch which called `legacy_pricing_engine`. That code
is preserved in commit abc1234 of branch `archive/ff-legacy-pricing`
for 90 days post-merge. After 90 days, recover from git history
via `git log --all --source -- legacy_pricing_engine.py`.
Step 8: Gradual Rollback Plan
Even after removal, the agent leaves a 30-60-90 day safety net:
| Day | Action |
|---|---|
| 0 | Merge removal PR. Tag the commit flag-removed/ff_xxx. Open a 30-day calendar reminder for the owner. |
| 7 | Verify error rate, latency, conversion metric vs baseline. If regression, the agent generates a re-introduction PR that restores the flag and pins it to the previous default. |
| 30 | Delete the flag definition in the service (was archived at PR merge). Drop the archive branch if no rollback called. |
| 60 | Review the removed-flags log; mark "permanent" in MEMORY. |
| 90 | Drop the flag from runbooks, dashboards, alerts. |
For R3 / R4 retentions, extend the windows: 14 / 60 / 120 / 180.
Step 9: Service Provider Specifics
Each provider has its own retirement workflow. The agent uses the right API, the right resource hierarchy, and the right billing impact.
LaunchDarkly:
- API:
DELETE /api/v2/flags/{projKey}/{flagKey}(archives by default; pass?archived=trueto unarchive) - Use flag tags liberally —
temporary,experiment,kill-switchmake Step 2 automatic going forward - LD's Code references integration (via GitHub/GitLab) is the single highest-leverage tool — install before doing the audit
- LD bills per MAU evaluated, not per flag — removing a low-traffic flag saves nothing on the bill; removing a high-traffic experiment that's still 50/50 on a logged-in segment saves the most
- Use Workflows to schedule the staged rollback (e.g. archive after 30 days)
Unleash:
- Open-source, often self-hosted; cleanup recovers DB rows but not license cost
- Built-in stale flag UI under Project → Reports
- API:
DELETE /api/admin/features/{name}(archives), thenDELETE /api/admin/archive/{name}(hard delete) - Strategies are independent objects — verify no other flag references the strategy before deleting it
- Unleash 5+ supports Dependencies between flags — break dependencies before deletion
Flagsmith:
- Per-environment overrides are common — verify all environments have collapsed to a single value before removal
- API:
DELETE /api/v1/projects/{id}/features/{id}/ - Segments are project-scoped; orphaned segments accumulate fast — sweep them in the same audit
- Self-hosted Flagsmith persists evaluation logs only if the influxdb integration is enabled — without it, R3 (last_evaluation) is unreliable
GrowthBook:
- Experiment-first model; many flags are tied to a running experiment doc
- Closing the experiment doesn't delete the flag — explicitly archive after analysis
- API:
DELETE /api/v1/features/{id}(requires admin role) - The Code references scan in the GrowthBook proxy is opt-in; turn it on before audit
Split.io:
- "Splits" are the flag; "Treatments" are the variants
- Killing a split via UI sets it to a single treatment but does not remove from code — agent must still produce the code-removal PR
- API:
DELETE /internal/api/v2/splits/ws/{wsId}/{splitName}(workspaces matter, easy to delete the wrong env) - Split's dynamic configurations are JSON-typed — removal recipes for these inline the JSON value, not a boolean
Custom / DB-backed flags:
- Hardest case — no audit UI, no API. The agent generates SQL to find dead flags:
SELECT key, MAX(updated_at) FROM feature_toggles WHERE key NOT IN ($referenced_keys_from_grep) OR updated_at < NOW() - INTERVAL '180 days'; - Removal is a database migration plus a code change in the same PR
- Add a flag definition test that fails CI when a code reference exists without a DB row, and vice versa
Step 10: Four-Week Cleanup Playbook
Cleanup as a one-off audit dies. Cleanup as a recurring sprint sticks. The agent emits a four-week schedule:
WEEK 1 — INVENTORY & BASELINE
Mon Pull flag list from service API → CSV
Tue git grep code references → join to CSV
Wed Pull last 30d evaluation telemetry → join to CSV
Thu Classify by type (R0-R4), tag owners, generate per-owner queue
Fri Publish dashboard: total flags, removable count, debt $$ estimate
WEEK 2 — TRIVIAL & LOW (R0/R1)
Mon Bulk-PR all R0 flags (test-only, dev-only)
Tue Open R1 PRs in batches of 10 per team
Wed Merge R0 batch (single reviewer), monitor CI
Thu Merge R1 batch on standard review cadence
Fri Service-side: archive the merged flags, refresh dashboard
WEEK 3 — MEDIUM (R2)
Mon Per-flag PRs for R2; pair with feature owner
Tue Schedule R2 deploys to low-traffic windows
Wed Deploy first half, watch error budget for 24h
Thu Deploy second half if green; revert plan exercised on any regression
Fri Service-side cleanup; mark watch-period start date
WEEK 4 — HIGH (R3) MIGRATION + REPORT
Mon R3 migration PRs land (NOT removals — replacements)
Tue Replacement code burns in for the 7-day rule
Wed Generate the 30/60/90 calendar reminders
Thu Update runbooks, ADRs, onboarding docs to match new state
Fri Post-mortem write-up: flags removed, $$ saved, debt remaining,
and a date for the next quarter's audit
Step 11: Prevention — Stop The Debt At The Source
Cleanup without prevention runs the same audit again next quarter. The agent emits a prevention pack:
- Mandatory expiry date at flag creation. Service-level enforcement: LaunchDarkly Workflows, Unleash flag templates, custom DB column
expires_at NOT NULL. - PR template for new flags: type, owner, expected sunset, kill-switch criteria, removal ticket linked.
- CI lint: a check that fails when a flag is created without an
expires_at, or when the introducing PR has no linked removal ticket. - Quarterly audit cron (the agent ships the script): runs the staleness rules and opens a tracking ticket per owner.
- Flag SLO: max 50 active flags per service (or N per 1k LOC). Breach blocks new flag creation in CI.
- Onboarding doc: every new engineer reads the "flags are debt by default" page before getting service-side write access.
Worked Examples
Example 1: 1,200-Flag LaunchDarkly Account Cleanup
Inventory results:
Total flags 1,212
- Active 743
- Archived (still in code) 204
- Code-only orphans 265
By type:
- Release toggle 612 (50%)
- Experiment 188 (15%)
- Permission/entitlement 121 (10%)
- Ops/kill 94 (8%)
- Unclassified 197 (16%)
Staleness (≥2 rules tripped): 687 candidates (57%)
By risk:
- R0 trivial 142
- R1 low 312
- R2 medium 178
- R3 high 49
- R4 critical 6
Plan:
- Week 2: 142 R0 flags removed in 8 batched PRs. Engineering hours: ~12.
- Week 3: 312 R1 flags removed across 14 teams. Per-team queue, average 22 flags. Engineering hours: ~120 (org-wide, not per team).
- Week 4: 178 R2 staged across two release windows. Owner pairing required. Engineering hours: ~80.
- Quarter 2: 49 R3 migrations begin (not full removals).
- 6 R4 documented and parked.
Estimated savings: 38% reduction in flag count, ~$22k/yr LD cost reduction (based on MAU savings from the 89 high-traffic experiments retired), and ~3,400 LOC removed.
Example 2: Single Flag Removal — new_checkout_v2
Audit:
Key: new_checkout_v2
Service: LaunchDarkly (project: web-app)
Created: 2025-01-12 (110 days ago)
Last modified: 2025-02-14 (rules frozen)
Rollout: 100% production for 84 days
Variants: on / off
Code refs: 3 files, 7 callsites
- src/checkout/route.tsx (4)
- src/checkout/__tests__/route.test.tsx (2)
- src/analytics/checkout.ts (1)
Telemetry: 184k evals/day, all → "on"
Owner (CODEOWNERS): @payments-team
Type: Release toggle
Risk: R1 (low)
Generated PR:
Title: chore(flags): remove new_checkout_v2 (100% on since 2025-02-14)
Summary
Flag has served `on` to 100% of production for 84 days with zero
toggle activity. Removing both the flag check and the dead v1
checkout component.
Changes
- src/checkout/route.tsx : -28 +4
- src/checkout/CheckoutV1.tsx : deleted (-512)
- src/checkout/CheckoutV1.css : deleted (-180)
- src/checkout/__tests__/... : -1 file, -94 lines
- src/analytics/checkout.ts : -6 +1
Rollback Plan
Revert this commit. v1 component preserved on branch
archive/ff-new-checkout-v2 for 90 days. Re-enable flag in LD
via the saved config in the linked ticket.
Service-side follow-up
- Archive flag in LD on merge (Workflow scheduled)
- Hard-delete flag 2026-08-01 (90 days post-merge)
Tickets
PROJ-4421 (close on merge)
Example 3: Custom DB Flag Audit
A team running a home-grown feature_toggles table:
-- Step 1: code-referenced keys (output of grep)
WITH code_refs(key) AS (VALUES
('multi_currency'), ('beta_dashboard'), ('legacy_pricing'),
('async_export'), ('new_search_v2')
)
SELECT ft.key,
ft.enabled,
ft.updated_at,
(SELECT 1 FROM code_refs c WHERE c.key = ft.key) AS in_code
FROM feature_toggles ft
ORDER BY ft.updated_at;
The agent emits the migration plus the code PR in one bundle:
PR-1 (DB migration M0142_drop_dead_toggles.sql)
DELETE FROM feature_toggles
WHERE key IN ('legacy_pricing', 'beta_dashboard', 'old_v1');
PR-2 (code)
- drop legacy_price_engine.py
- drop beta_dashboard route
- prune calls in 4 files
PR-2 merges first; PR-1 deploys in the next migration window.
Output
The agent produces:
- Inventory CSV — every flag with service, code-refs, telemetry, owner, type, risk, recommendation
- Per-owner ticket queue — Jira/Linear-ready with one ticket per flag, grouped by owner
- Removal PRs — actual diffs, batched by R-grade
- Rollback plan — per-PR rollback section + 30/60/90 calendar reminders
- Cleanup dashboard — flag count, removable count, $/yr savings estimate, weekly burndown
- Provider-specific scripts — API delete commands, archive commands, segment cleanup
- Prevention pack — PR template, CI lint, expiry-required policy, onboarding doc
- Four-week sprint plan — daily checklist, owners, definition of done
Common Scenarios
"We just acquired a company with 400 flags in a tool we don't use"
The agent maps each flag to one of: keep-and-migrate, remove-with-default, remove-as-dead-code. Migration target is your incumbent flag service. Output is two PR streams: code PRs (own repo) and import scripts (for the incumbent service).
"An R2 removal caused a P1 incident"
The agent generates the immediate-revert PR and a postmortem template. Then it adds the missing detection: which staleness rule should have caught this, and patches the ruleset (e.g. add "no diurnal pattern in evaluations" as R17).
"How much is flag debt actually costing?"
The agent estimates: vendor cost (MAU × $rate × removable_traffic_share), engineering time tax (avg 6 min per stale flag in PR review × refs/quarter), and incident risk (count of incidents in the last 12 months that mention a flag). One-page CFO-ready memo.
"Our team owns 80 flags and the rest of the org owns 1,200 — should we wait?"
No. The agent ships the team's queue independently. Cross-team coordination kills cleanups. Each team should own its flag retirement budget.
Tips for Best Results
- Provide both the service export and a code-references grep — single-source audits miss orphans
- Share 30+ days of evaluation telemetry, not a snapshot — variance reveals diurnal experiments
- Include CODEOWNERS or an equivalent ownership map — owner-less queues stall
- Specify which environments matter (prod only, or prod+staging) — staging-only flags often look stale but aren't
- State your incumbent flag service if doing a migration — it changes the migration target, not just the cleanup logic
- Mention regulatory constraints (PCI, HIPAA, GDPR) before starting — kill switches in those domains move from R4 to "do not touch"
When NOT To Use
- Brand-new product with <30 flags — the audit overhead is larger than the debt; instead apply the prevention pack at flag #1.
- Active migration between flag vendors — finish the migration first; mid-migration audits produce false positives because both systems are partially live.
- Code-base under active rewrite — flags being removed by the rewrite anyway; wait until the rewrite ships, then audit what survived.
- Compliance-driven flags only (banking kill switches, regulatory toggles) — these are R4 by default; they require legal/compliance review, not engineering cleanup.
- You don't have evaluation telemetry — without last-evaluation data, the staleness rules collapse to "old" which produces too many false positives. Wire up telemetry first, audit second.