skill-test

Databricks Skills Testing Framework

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "skill-test" with this command: npx skills add databricks-solutions/ai-dev-kit/databricks-solutions-ai-dev-kit-skill-test

Databricks Skills Testing Framework

Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement.

Quick References

  • Scorers - Available scorers and quality gates

  • YAML Schemas - Manifest and ground truth formats

  • Python API - Programmatic usage examples

  • Workflows - Detailed example workflows

  • Trace Evaluation - Session trace analysis

/skill-test Command

The /skill-test command provides an interactive CLI for testing Databricks skills with real execution on Databricks.

Basic Usage

/skill-test <skill-name> [subcommand]

Subcommands

Subcommand Description

run

Run evaluation against ground truth (default)

regression

Compare current results against baseline

init

Initialize test scaffolding for a new skill

add

Interactive: prompt -> invoke skill -> test -> save

add --trace

Add test case with trace evaluation

review

Review pending candidates interactively

review --batch

Batch approve all pending candidates

baseline

Save current results as regression baseline

mlflow

Run full MLflow evaluation with LLM judges

trace-eval

Evaluate traces against skill expectations

list-traces

List available traces (MLflow or local)

scorers

List configured scorers for a skill

scorers update

Add/remove scorers or update default guidelines

sync

Sync YAML to Unity Catalog (Phase 2)

Quick Examples

/skill-test databricks-spark-declarative-pipelines run /skill-test databricks-spark-declarative-pipelines add --trace /skill-test databricks-spark-declarative-pipelines review --batch --filter-success /skill-test my-new-skill init

See Workflows for detailed examples of each subcommand.

Execution Instructions

Environment Setup

uv pip install -e .test/

Environment variables for Databricks MLflow:

  • DATABRICKS_CONFIG_PROFILE

  • Databricks CLI profile (default: "DEFAULT")

  • MLFLOW_TRACKING_URI

  • Set to "databricks" for Databricks MLflow

  • MLFLOW_EXPERIMENT_NAME

  • Experiment path (e.g., "/Users/{user}/skill-test")

Running Scripts

All subcommands have corresponding scripts in .test/scripts/ :

uv run python .test/scripts/{subcommand}.py {skill_name} [options]

Subcommand Script

run

run_eval.py

regression

regression.py

init

init_skill.py

add

add.py

review

review.py

baseline

baseline.py

mlflow

mlflow_eval.py

scorers

scorers.py

scorers update

scorers_update.py

sync

sync.py

trace-eval

trace_eval.py

list-traces

list_traces.py

_routing mlflow

routing_eval.py

Use --help on any script for available options.

Command Handler

When /skill-test is invoked, parse arguments and execute the appropriate command.

Argument Parsing

  • args[0] = skill_name (required)

  • args[1] = subcommand (optional, default: "run")

Subcommand Routing

Subcommand Action

run

Execute run(skill_name, ctx) and display results

regression

Execute regression(skill_name, ctx) and display comparison

init

Execute init(skill_name, ctx) to create scaffolding

add

Prompt for test input, invoke skill, run interactive()

review

Execute review(skill_name, ctx) to review pending candidates

baseline

Execute baseline(skill_name, ctx) to save as regression baseline

mlflow

Execute mlflow_eval(skill_name, ctx) with MLflow logging

scorers

Execute scorers(skill_name, ctx) to list configured scorers

scorers update

Execute scorers_update(skill_name, ctx, ...) to modify scorers

init Behavior

When running /skill-test <skill-name> init :

  • Read the skill's SKILL.md to understand its purpose

  • Create manifest.yaml with appropriate scorers and trace_expectations

  • Create empty ground_truth.yaml and candidates.yaml templates

  • Recommend test prompts based on documentation examples

Follow with /skill-test <skill-name> add using recommended prompts.

Context Setup

Create CLIContext with MCP tools before calling any command. See Python API for details.

File Locations

Important: All test files are stored at the repository root level, not relative to this skill's directory.

File Type Path

Ground truth {repo_root}/.test/skills/{skill-name}/ground_truth.yaml

Candidates {repo_root}/.test/skills/{skill-name}/candidates.yaml

Manifest {repo_root}/.test/skills/{skill-name}/manifest.yaml

Routing tests {repo_root}/.test/skills/_routing/ground_truth.yaml

Baselines {repo_root}/.test/baselines/{skill-name}/baseline.yaml

For example, to test databricks-spark-declarative-pipelines in this repository:

/Users/.../ai-dev-kit/.test/skills/databricks-spark-declarative-pipelines/ground_truth.yaml

Not relative to the skill definition:

/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/... # WRONG

Directory Structure

.test/ # At REPOSITORY ROOT (not skill directory) ├── pyproject.toml # Package config (pip install -e ".test/") ├── README.md # Contributor documentation ├── SKILL.md # Source of truth (synced to .claude/skills/) ├── install_skill_test.sh # Sync script ├── scripts/ # Wrapper scripts │ ├── _common.py # Shared utilities │ ├── run_eval.py │ ├── regression.py │ ├── init_skill.py │ ├── add.py │ ├── baseline.py │ ├── mlflow_eval.py │ ├── routing_eval.py │ ├── trace_eval.py # Trace evaluation │ ├── list_traces.py # List available traces │ ├── scorers.py │ ├── scorers_update.py │ └── sync.py ├── src/ │ └── skill_test/ # Python package │ ├── cli/ # CLI commands module │ ├── fixtures/ # Test fixture setup │ ├── scorers/ # Evaluation scorers │ ├── grp/ # Generate-Review-Promote pipeline │ └── runners/ # Evaluation runners ├── skills/ # Per-skill test definitions │ ├── _routing/ # Routing test cases │ └── {skill-name}/ # Skill-specific tests │ ├── ground_truth.yaml │ ├── candidates.yaml │ └── manifest.yaml ├── tests/ # Unit tests ├── references/ # Documentation references └── baselines/ # Regression baselines

References

  • Scorers - Available scorers and quality gates

  • YAML Schemas - Manifest and ground truth formats

  • Python API - Programmatic usage examples

  • Workflows - Detailed example workflows

  • Trace Evaluation - Session trace analysis

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

databricks-python-sdk

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

python-dev

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

databricks-config

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

databricks-docs

No summary provided by upstream source.

Repository SourceNeeds Review