Agent Regression Check
What this skill does
Use this skill to evaluate whether an agent change introduced regressions.
It compares before vs after behavior across a defined case suite, identifies what got worse, applies deterministic release gates, and returns a clear verdict:
- go
- conditional_go
- no_go
- rollback
This is an offline regression-check skill. It does not replace production monitoring or live experiments.
Best use cases
Use this skill for:
- prompt updates
- model switches
- tool integration changes
- retrieval changes
- orchestration changes
- hotfix verification
- pre-release quality checks
Typical user requests:
- "Did this update break anything?"
- "Compare before and after."
- "Is this safe to deploy?"
- "Check for regressions after the model change."
- "Should we roll this back?"
- "Which failures got worse?"
When not to use
Do not use this skill if:
- there is no comparable before/after evidence
- the case sets differ and cannot be matched reliably
- the suite is too small to support a release decision
- the task requires online experimentation
- the user wants brainstorming instead of deterministic assessment
If evidence quality is weak, say so explicitly and lower confidence.
Required inputs
Provide these whenever possible:
- change_summary
- before_cases
- after_cases
- risk_level (low, medium, high)
Example input:
{
"cases": [
{
"id": "case_001",
"task_type": "faq",
"critical": true,
"input": "How do I reset my password?",
"expected_behavior": "Provides reset steps and fallback.",
"before_output": "...",
"after_output": "...",
"before_tools": [],
"after_tools": []
}
]
}
Optional inputs
These improve evaluation quality:
- suite_manifest
- thresholds
- strict_mode
- output_path
Matching rules
Cases must match by stable id.
If IDs do not match:
- flag suite inconsistency
- reduce confidence
- avoid aligning by position unless explicitly requested
Deterministic scoring rubric
Each case is scored across four dimensions.
Correctness
2 = correct
1 = partially correct
0 = incorrect
Relevance
2 = fully relevant
1 = somewhat relevant
0 = off-target
Actionability
2 = actionable
1 = partially actionable
0 = not actionable
Tool reliability
2 = correct tool usage
1 = minor tool issue
0 = tool failure
Case outcome rules
pass
correctness ≥ 2
relevance ≥ 1
actionability ≥ 1
tool reliability ≥ 1
soft_fail
usable answer but degraded quality.
fail
- correctness = 0
- safety/fallback missing
- tool failure
- after worse than before
Any fail on a critical case is high severity.
Aggregated metrics
Compute:
- overall_pass_rate
- critical_pass_rate
- soft_fail_rate
- tool_reliability_rate
- average_correctness
- average_relevance
- average_actionability
Also compute deltas:
- overall_pass_rate_delta
- critical_pass_rate_delta
- tool_reliability_delta
Never hide negative deltas.
Default release gates
Low risk
overall_pass_rate ≥ 0.90
critical_pass_rate ≥ 0.95
Medium risk
overall_pass_rate ≥ 0.95
critical_pass_rate = 1.00
tool_reliability ≥ 0.95
High risk
overall_pass_rate ≥ 0.98
critical_pass_rate = 1.00
tool_reliability ≥ 0.98
Human review recommended.
Verdict rules
Return exactly one verdict.
go
All gates pass.
conditional_go
Minor issues but no critical regressions.
no_go
Gates fail. Fixes required.
rollback
Critical regressions detected.
Failure clustering
Group failures by likely cause.
Examples:
- instruction_following_drift
- factuality_drop
- retrieval_miss
- tool_call_failure
- format_noncompliance
- missing_fallback
- hallucinated_capability
Each cluster should include:
- name
- severity
- affected cases
- likely cause
- suggested fix direction
Anti-gaming rules
Flag explicitly:
- different case sets before/after
- missing critical cases
- incomplete tool traces
- changed expectations
- too few cases for a release decision
If detected:
- lower confidence
- explain limitations
Confidence levels
Return:
- high
- medium
- low
Confidence depends on:
- suite size
- representativeness
- case matching quality
- tool trace completeness
Output contract
Return results in this structure:
{
"change_summary": "Switched model and simplified system prompt",
"risk_level": "medium",
"confidence": "medium",
"suite_summary": {
"total_cases": 18,
"critical_cases": 6
},
"scorecard": {
"overall_pass_rate": 0.89,
"critical_pass_rate": 0.83,
"tool_reliability_rate": 0.94
},
"deltas": {
"overall_pass_rate_delta": -0.08
},
"verdict": "no_go",
"top_regressions": [
{
"case_id": "case_003",
"summary": "Fallback step missing"
}
],
"recommended_fixes": [
"Restore fallback instruction",
"Retest critical FAQ flows"
]
}
Response format
Responses should include:
- executive summary
- scorecard
- regressions
- clusters
- verdict
- recommended fixes
- confidence
Limitations
This skill cannot:
- guarantee production improvement
- replace monitoring
- perfectly infer user impact from a small suite
High-risk changes should still involve human review.
Implementation note
If a scorer script exists, use it.
Otherwise apply this rubric manually.
Never suppress regressions. Never skip failing cases.