results-audit

Results Audit — Authenticity & Statistical Validity

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "results-audit" with this command: npx skills add paulbroadmission/ncea_denoise/paulbroadmission-ncea-denoise-results-audit

Results Audit — Authenticity & Statistical Validity

Automated Red Flag Scan

Quick anomaly detection

python3 -c " import json, sys

results_file = 'workspace/results/iteration_001/test_results.json' # adjust iteration try: with open(results_file) as f: r = json.load(f)

flags = []

# Check for perfect metrics
for k, v in r.items():
    if isinstance(v, float) and v >= 1.0:
        flags.append(f'SUSPICIOUS: {k} = {v} (perfect score)')
    if isinstance(v, float) and v == 0.0:
        flags.append(f'SUSPICIOUS: {k} = {v} (zero)')

# Check seed is recorded
if 'seed' not in r:
    flags.append('MISSING: random seed not recorded')

if flags:
    print('🚩 RED FLAGS:')
    for f in flags:
        print(f'  - {f}')
else:
    print('✅ No obvious red flags')

except Exception as e: print(f'❌ Cannot read results: {e}') "

Verification Checklist

  1. Training Log Integrity
  • training_log.json exists and is complete

  • Loss curve is monotonically decreasing (overall trend)

  • No sudden jumps that suggest training restart without logging

  • Number of epochs matches config

  • Timestamps are sequential (not fabricated)

  1. Results Plausibility

Read expected range from workspace/logs/strategy_matrix.json :

  • Result within backward-induction expected range?

  • If result > expected + 5%: investigate data leak or bug

  • If result < expected - 10%: strategy may be failing

  • Variance across seeds is realistic (typically 0.5-3% for classification)

  1. Cross-Consistency
  • test_results.json numbers match final epoch in training_log.json

  • Numbers in comparison_table.tex match comparison_results.json

  • Baseline numbers match their cited source (paper table/figure number)

  1. Statistical Significance

For each "our method vs baseline" comparison:

  • Paired t-test or Wilcoxon computed (p < 0.05)

  • Multiple seeds used (minimum 3, recommended 5)

  • Mean AND standard deviation reported

  • No cherry-picked seeds

  1. Reproducibility
  • Random seeds listed

  • requirements.txt complete

  • Data preprocessing is deterministic

  • Hardware info recorded

IMMEDIATE FAIL Conditions

Any of these → status = CRITICAL, pipeline halts:

  • Results too good (>5% above ALL baselines simultaneously)

  • Zero variance across multiple runs

  • Missing training logs entirely

  • Metrics that are mathematically impossible (e.g., precision > 1.0)

  • Baseline numbers don't match their original papers

Output Format

Write to workspace/logs/results_audit.json :

{ "timestamp": "...", "iteration": N, "status": "PASS | WARN | CRITICAL", "expected_range": [low, high], "actual_result": X, "plausibility": "PLAUSIBLE | SUSPICIOUS | OUTSIDE_RANGE", "statistical_validity": "PASS | FAIL", "reproducibility": "CONFIRMED | UNCONFIRMED", "red_flags": [], "score": X }

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

compliance-evidence-assembler

把审计所需证据整理成目录、清单和缺失项,便于后续评审。;use for compliance, evidence, audit workflows;do not use for 伪造证据, 替代正式审计结论.

Archived SourceRecently Updated
Security

skillguard-hardened

Security guard for OpenClaw skills, developed and maintained by rose北港(小红帽 / 猫猫帽帽). Audits installed or incoming skills with local rules plus Zenmux AI intent review, then recommends pass, warn, block, or quarantine.

Archived SourceRecently Updated
Security

api-contract-auditor

审查 API 文档、示例和字段定义是否一致,输出 breaking change 风险。;use for api, contract, audit workflows;do not use for 直接改线上接口, 替代契约测试平台.

Archived SourceRecently Updated
Security

ai-workflow-red-team-lite

对 AI 自动化流程做轻量红队演练,聚焦误用路径、边界失败和数据泄露风险。;use for red-team, ai, workflow workflows;do not use for 输出可直接滥用的攻击脚本, 帮助破坏系统.

Archived SourceRecently Updated