deep-debug-ensemble-v1

Benchmark variant of deep-debug that ensembles the hypothesis-plausibility judge phase (Phase 3) across three heterogeneous providers (Anthropic Sonnet 4.6 + OpenAI GPT-5.4 + Google Gemini 2.5 Pro) instead of a single Haiku judge. Purpose is skill-bench A/B comparison against baseline deep-debug to measure whether heterogeneous judges classify hypothesis plausibility more honestly than a single-model panel. Trigger phrases, argument semantics, and all other phases (critic, probe, fix, architect escalation) are identical to deep-debug. ONLY the pass-1 blind and pass-2 informed hypothesis judges differ.

Safety Notice

This listing is imported from SkillsMP metadata and should be treated as untrusted until upstream source review is completed.

Copy this and send it to your AI assistant to learn

Install skill "deep-debug-ensemble-v1" with this command: npx skills add npow/skillsmp-npow-npow-deep-debug-ensemble-v1

No markdown body

This source entry does not include full markdown content beyond metadata.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open in GitHub

Related Skills

Related by shared tags or category signals.

General

loop-until-done

Use when a task must be driven to guaranteed completion through a PRD-driven persistence loop — breaking work into user stories with structured acceptance criteria, iterating story-by-story with independent verification, and terminating only when every criterion has fresh passing evidence and an independent reviewer approves. Trigger phrases include "keep going until done", "loop until complete", "don't stop until", "finish this completely", "iterate until done", "persistence loop", "PRD-driven execution", "work through all stories", "drive this to completion", "until all tests pass", "keep iterating", "loop this", "self-loop until finished". Honest termination labels; no self-approval.

Repository SourceNeeds Review

-3npow

General

proposal-reviewer

Critically reviews project proposals, grant applications, and business plans. Use when the user asks to review, critique, evaluate, or assess a proposal, pitch, grant application, or business plan for viability, competition, or flaws. Fact-checks claims, maps competitive landscape, identifies structural problems, and provides honest recommendations.

Repository SourceNeeds Review

-3npow

Research

deep-research-temporal

Use when researching, investigating, or exploring a topic systematically via a durable Temporal-backed workflow. Trigger phrases include "deep-research temporal", "sagaflow research", "durable research", "temporal research". Spawns parallel researchers across WHO/WHAT/HOW/WHERE/WHEN/WHY/LIMITS plus cross-cut dimensions (PRIOR-FAILURE, BASELINE, ADJACENT-EFFORTS, STRATEGIC-TIMING, ACTUAL-USAGE). Fact-verification, novelty classification, vocabulary bootstrap for cold-start topics. Fire-and-forget while you do other work.

Repository SourceNeeds Review

-3npow

Security

deep-qa-ensemble-v1

Benchmark variant of deep-qa that ensembles the severity-judge phase across three heterogeneous providers (Anthropic Sonnet 4.6 + OpenAI GPT-5.4 + Google Gemini 2.5 Pro) instead of using a single Haiku judge. Purpose is skill-bench A/B comparison against baseline deep-qa to measure whether heterogeneous judges calibrate severity better than a single homogeneous judge. Trigger phrases, argument semantics, and critic phase are identical to deep-qa. ONLY the pass-1 blind and pass-2 informed severity judges differ. Rationalization auditor (Phase 5.6) stays single-Haiku to preserve its independence role.

Repository SourceNeeds Review

-3npow