Databricks Skills Testing Framework
Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement.
Quick References
-
Scorers - Available scorers and quality gates
-
YAML Schemas - Manifest and ground truth formats
-
Python API - Programmatic usage examples
-
Workflows - Detailed example workflows
-
Trace Evaluation - Session trace analysis
/skill-test Command
The /skill-test command provides an interactive CLI for testing Databricks skills with real execution on Databricks.
Basic Usage
/skill-test <skill-name> [subcommand]
Subcommands
Subcommand Description
run
Run evaluation against ground truth (default)
regression
Compare current results against baseline
init
Initialize test scaffolding for a new skill
add
Interactive: prompt -> invoke skill -> test -> save
add --trace
Add test case with trace evaluation
review
Review pending candidates interactively
review --batch
Batch approve all pending candidates
baseline
Save current results as regression baseline
mlflow
Run full MLflow evaluation with LLM judges
trace-eval
Evaluate traces against skill expectations
list-traces
List available traces (MLflow or local)
scorers
List configured scorers for a skill
scorers update
Add/remove scorers or update default guidelines
sync
Sync YAML to Unity Catalog (Phase 2)
Quick Examples
/skill-test databricks-spark-declarative-pipelines run /skill-test databricks-spark-declarative-pipelines add --trace /skill-test databricks-spark-declarative-pipelines review --batch --filter-success /skill-test my-new-skill init
See Workflows for detailed examples of each subcommand.
Execution Instructions
Environment Setup
uv pip install -e .test/
Environment variables for Databricks MLflow:
-
DATABRICKS_CONFIG_PROFILE
-
Databricks CLI profile (default: "DEFAULT")
-
MLFLOW_TRACKING_URI
-
Set to "databricks" for Databricks MLflow
-
MLFLOW_EXPERIMENT_NAME
-
Experiment path (e.g., "/Users/{user}/skill-test")
Running Scripts
All subcommands have corresponding scripts in .test/scripts/ :
uv run python .test/scripts/{subcommand}.py {skill_name} [options]
Subcommand Script
run
run_eval.py
regression
regression.py
init
init_skill.py
add
add.py
review
review.py
baseline
baseline.py
mlflow
mlflow_eval.py
scorers
scorers.py
scorers update
scorers_update.py
sync
sync.py
trace-eval
trace_eval.py
list-traces
list_traces.py
_routing mlflow
routing_eval.py
Use --help on any script for available options.
Command Handler
When /skill-test is invoked, parse arguments and execute the appropriate command.
Argument Parsing
-
args[0] = skill_name (required)
-
args[1] = subcommand (optional, default: "run")
Subcommand Routing
Subcommand Action
run
Execute run(skill_name, ctx) and display results
regression
Execute regression(skill_name, ctx) and display comparison
init
Execute init(skill_name, ctx) to create scaffolding
add
Prompt for test input, invoke skill, run interactive()
review
Execute review(skill_name, ctx) to review pending candidates
baseline
Execute baseline(skill_name, ctx) to save as regression baseline
mlflow
Execute mlflow_eval(skill_name, ctx) with MLflow logging
scorers
Execute scorers(skill_name, ctx) to list configured scorers
scorers update
Execute scorers_update(skill_name, ctx, ...) to modify scorers
init Behavior
When running /skill-test <skill-name> init :
-
Read the skill's SKILL.md to understand its purpose
-
Create manifest.yaml with appropriate scorers and trace_expectations
-
Create empty ground_truth.yaml and candidates.yaml templates
-
Recommend test prompts based on documentation examples
Follow with /skill-test <skill-name> add using recommended prompts.
Context Setup
Create CLIContext with MCP tools before calling any command. See Python API for details.
File Locations
Important: All test files are stored at the repository root level, not relative to this skill's directory.
File Type Path
Ground truth {repo_root}/.test/skills/{skill-name}/ground_truth.yaml
Candidates {repo_root}/.test/skills/{skill-name}/candidates.yaml
Manifest {repo_root}/.test/skills/{skill-name}/manifest.yaml
Routing tests {repo_root}/.test/skills/_routing/ground_truth.yaml
Baselines {repo_root}/.test/baselines/{skill-name}/baseline.yaml
For example, to test databricks-spark-declarative-pipelines in this repository:
/Users/.../ai-dev-kit/.test/skills/databricks-spark-declarative-pipelines/ground_truth.yaml
Not relative to the skill definition:
/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/... # WRONG
Directory Structure
.test/ # At REPOSITORY ROOT (not skill directory) ├── pyproject.toml # Package config (pip install -e ".test/") ├── README.md # Contributor documentation ├── SKILL.md # Source of truth (synced to .claude/skills/) ├── install_skill_test.sh # Sync script ├── scripts/ # Wrapper scripts │ ├── _common.py # Shared utilities │ ├── run_eval.py │ ├── regression.py │ ├── init_skill.py │ ├── add.py │ ├── baseline.py │ ├── mlflow_eval.py │ ├── routing_eval.py │ ├── trace_eval.py # Trace evaluation │ ├── list_traces.py # List available traces │ ├── scorers.py │ ├── scorers_update.py │ └── sync.py ├── src/ │ └── skill_test/ # Python package │ ├── cli/ # CLI commands module │ ├── fixtures/ # Test fixture setup │ ├── scorers/ # Evaluation scorers │ ├── grp/ # Generate-Review-Promote pipeline │ └── runners/ # Evaluation runners ├── skills/ # Per-skill test definitions │ ├── _routing/ # Routing test cases │ └── {skill-name}/ # Skill-specific tests │ ├── ground_truth.yaml │ ├── candidates.yaml │ └── manifest.yaml ├── tests/ # Unit tests ├── references/ # Documentation references └── baselines/ # Regression baselines
References
-
Scorers - Available scorers and quality gates
-
YAML Schemas - Manifest and ground truth formats
-
Python API - Programmatic usage examples
-
Workflows - Detailed example workflows
-
Trace Evaluation - Session trace analysis