dbt Refactoring
Find ALL downstream dependencies before changing. Refactor in small steps. Verify output after each change.
Workflow
- Analyze Current Model
cat models/<path>/<model_name>.sql
Identify refactoring opportunities:
-
CTEs longer than 50 lines → extract to intermediate model
-
Logic repeated across models → extract to macro
-
Multiple joins in sequence → split into steps
-
Complex WHERE clauses → extract to staging filter
- Find All Downstream Dependencies
CRITICAL: Never refactor without knowing impact.
Get full dependency tree (model and all its children)
dbt ls --select model_name+ --output list
Find all models referencing this one
grep -r "ref('model_name')" models/ --include="*.sql"
Report to user: "Found X downstream models: [list]. These will be affected by changes."
- Check What Columns Downstream Models Use
BEFORE changing any columns, check what downstream models reference:
For each downstream model, check what columns it uses
cat models/<path>/<downstream_model>.sql | grep -E "model_name.\w+|alias.\w+"
If downstream models reference specific columns, you MUST ensure those columns remain available after refactoring.
- Plan Refactoring Strategy
Opportunity Strategy
Long CTE Extract to intermediate model
Repeated logic Create macro in macros/
Complex join Split into intermediate models
Multiple concerns Separate into focused models
- Execute Refactoring
Pattern: Extract CTE to Model
Before:
-- orders.sql (200 lines) with customer_metrics as ( -- 50 lines of complex logic ), order_enriched as ( select ... from orders join customer_metrics on ... ) select * from order_enriched
After:
-- customer_metrics.sql (new file) select customer_id, -- complex logic here from {{ ref('customers') }}
-- orders.sql (simplified) with order_enriched as ( select ... from {{ ref('raw_orders') }} orders join {{ ref('customer_metrics') }} cm on ... ) select * from order_enriched
Pattern: Extract to Macro
Before (repeated in multiple models):
case when amount < 0 then 'refund' when amount = 0 then 'zero' else 'positive' end as amount_category
After:
-- macros/categorize_amount.sql {% macro categorize_amount(column_name) %} case when {{ column_name }} < 0 then 'refund' when {{ column_name }} = 0 then 'zero' else 'positive' end {% endmacro %}
-- In models: {{ categorize_amount('amount') }} as amount_category
- Validate Changes
Compile to check syntax
dbt compile --select +model_name+
Build entire lineage
dbt build --select +model_name+
Check row counts (manual)
Before: Record expected counts
After: Verify counts match
- Verify Output Matches Original
CRITICAL: Refactoring should not change output.
Compare row counts before and after
dbt show --inline "select count(*) from {{ ref('model_name') }}"
Spot check key values
dbt show --select <model_name> --limit 10
- Update Downstream Models
If changing output columns:
-
Update all downstream refs
-
Update schema.yml documentation
-
Re-run downstream tests
Refactoring Checklist
-
All downstream dependencies identified
-
User informed of impact scope
-
One change at a time
-
Compile passes after each change
-
Build passes after each change
-
Output validated (row counts match)
-
Documentation updated
-
Tests still pass
Common Refactoring Triggers
Symptom Refactoring
Model > 200 lines Extract CTEs to models
Same logic in 3+ models Extract to macro
5+ joins in one model Create intermediate models
Hard to understand Add CTEs with clear names
Slow performance Split to allow parallelization
Anti-Patterns
-
Refactoring without checking downstream impact
-
Making multiple changes at once
-
Not validating output matches after refactoring
-
Extracting prematurely (wait for 3+ uses)
-
Breaking existing tests without updating them