peer-review

Multi-model peer review layer using local LLMs via Ollama to catch errors in cloud model output. Fan-out critiques to 2-3 local models, aggregate flags, synthesize consensus. Use when: validating trade analyses, reviewing agent output quality, testing local model accuracy, checking any high-stakes Claude output before publishing or acting on it. Don't use when: simple fact-checking (just search the web), tasks that don't benefit from multi-model consensus, time-critical decisions where 60s latency is unacceptable, reviewing trivial or low-stakes content. Negative examples: - "Check if this date is correct" → No. Just web search it. - "Review my grocery list" → No. Not worth multi-model inference. - "I need this answer in 5 seconds" → No. Peer review adds 30-60s latency. Edge cases: - Short text (<50 words) → Models may not find meaningful issues. Consider skipping. - Highly technical domain → Local models may lack domain knowledge. Weight flags lower. - Creative writing → Factual review doesn't apply well. Use only for logical consistency.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "peer-review" with this command: npx skills add staybased/peer-review

Peer Review — Local LLM Critique Layer

Hypothesis: Local LLMs can catch ≥30% of real errors in cloud output with <50% false positive rate.


Architecture

Cloud Model (Claude) produces analysis
        │
        ▼
┌────────────────────────┐
│   Peer Review Fan-Out  │
├────────────────────────┤
│  Drift (Mistral 7B)   │──► Critique A
│  Pip (TinyLlama 1.1B) │──► Critique B
│  Lume (Llama 3.1 8B)  │──► Critique C
└────────────────────────┘
        │
        ▼
  Aggregator (consensus logic)
        │
        ▼
  Final: original + flagged issues

Swarm Bot Roles

BotModelRoleStrengths
Drift 🌊Mistral 7BMethodical analystStructured reasoning, catches logical gaps
Pip 🐣TinyLlama 1.1BFast checkerQuick sanity checks, low latency
Lume 💡Llama 3.1 8BDeep thinkerNuanced analysis, catches subtle issues

Scripts

ScriptPurpose
scripts/peer-review.shSend single input to all models, collect critiques
scripts/peer-review-batch.shRun peer review across a corpus of samples
scripts/seed-test-corpus.shGenerate seeded error corpus for testing

Usage

# Single file review
bash scripts/peer-review.sh <input_file> [output_dir]

# Batch review
bash scripts/peer-review-batch.sh <corpus_dir> [results_dir]

# Generate test corpus
bash scripts/seed-test-corpus.sh [count] [output_dir]

Scripts live at workspace/scripts/ — not bundled in skill to avoid duplication.


Critique Prompt Template

You are a skeptical reviewer. Analyze the following text for errors.

For each issue found, output JSON:
{"category": "factual|logical|missing|overconfidence|hallucinated_source",
 "quote": "...", "issue": "...", "confidence": 0-100}

If no issues found, output: {"issues": []}

TEXT:
---
{cloud_output}
---

Error Categories

CategoryDescriptionExample
factualWrong numbers, dates, names"Bitcoin launched in 2010"
logicalNon-sequiturs, unsupported conclusions"X is rising, therefore Y will fall"
missingImportant context omittedIgnoring a major counterargument
overconfidenceCertainty without justification"This will definitely happen" on 55% event
hallucinated_sourceCiting nonexistent sources"According to a 2024 Reuters report..."

Discord Workflow

  1. Post analysis to #the-deep (or #swarm-lab)
  2. Drift, Pip, and Lume respond with independent critiques
  3. Celeste synthesizes: deduplicates flags, weights by model confidence
  4. If consensus (≥2 models agree) → flag is high-confidence
  5. Final output posted with recommendation: publish | revise | flag_for_human

Success Criteria

OutcomeTPRFPRDecision
Strong pass≥50%<30%Ship as default layer
Pass≥30%<50%Ship as opt-in layer
Marginal20–30%50–70%Iterate on prompts, retest
Fail<20%>70%Abandon approach

Scoring Rules

  • Flag = true positive if it identifies a real error (even if explanation is imperfect)
  • Flag = false positive if flagged content is actually correct
  • Duplicate flags across models count once for TPR but inform consensus metrics

Dependencies

  • Ollama running locally with models pulled: mistral:7b, tinyllama:1.1b, llama3.1:8b
  • jq and curl installed
  • Results stored in experiments/peer-review-results/

Integration

When peer review passes validation:

  • Package as Reef API endpoint: POST /review
  • Agents call before publishing any analysis
  • Configurable: model selection, consensus threshold, categories
  • Log all reviews to #reef-logs with TPR tracking

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

Academic Paper Fetcher

Fetch academic papers from Sci-Hub given a DOI. Automatically downloads PDFs and saves them to research/papers/ with clean filenames. Use when the user provides a DOI or requests a paper from PubMed.

Registry SourceRecently Updated
Research

Fitbit Insights

Fitbit fitness data integration. Use when the user wants fitness insights, workout summaries, step counts, heart rate data, sleep analysis, or to ask questions about their Fitbit activity data. Provides AI-powered analysis of fitness metrics.

Registry SourceRecently Updated
Research

Botcoin

A puzzle game for AI agents. Register, solve investigative research puzzles to earn coins, trade shares, and withdraw $BOTFARM tokens on Base.

Registry SourceRecently Updated
42.2K
Profile unavailable