pdf-brain-ingest

Ingest PDF/Markdown/TXT files into joelclaw's docs memory pipeline with Inngest durability, run monitoring, and OTEL verification. Use when adding docs, backfilling from manifest, reconciling coverage, or recovering stuck docs-ingest runs. Triggers on: 'ingest pdf', 'ingest markdown', 'docs add', 'pdf-brain ingest', 'backfill books', 'docs reconcile'.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "pdf-brain-ingest" with this command: npx skills add joelhooks/joelclaw/joelhooks-joelclaw-pdf-brain-ingest

PDF Brain Ingest (Joelclaw)

This is the joelclaw-native replacement for pdf-brain + swarm queue operations.

Use joelclaw docs and Inngest events instead of ad hoc queue workers:

  • docs/ingest.requested -> docs-ingest
  • docs/backlog.requested -> batch queueing from manifest
  • docs/backlog.drive.requested -> scheduled backlog driver with queue depth gates
  • docs/ingest.janitor.requested -> stuck-run detection and recovery

Core Workflow

1) Preflight

joelclaw status
joelclaw inngest status
joelclaw docs status

If registration is stale:

joelclaw inngest sync-worker --restart

2) Single File Ingest

joelclaw docs add "/absolute/path/to/file.pdf"
joelclaw docs add "/absolute/path/to/file.md"

Optional metadata:

joelclaw docs add "/absolute/path/to/file.pdf" \
  --title "Readable Title" \
  --tags "manifest,catalog-fill" \
  --category programming

Supported types: pdf, md, txt.

3) Bulk Backfill From Manifest

Queue a controlled batch:

joelclaw send docs/backlog.requested -d '{
  "maxEntries": 24,
  "booksOnly": true,
  "onlyMissing": true,
  "includePodcasts": false,
  "idempotencyPrefix": "manual"
}'

Let the driver decide based on queue depth:

joelclaw send docs/backlog.drive.requested -d '{
  "reason": "manual backfill kick",
  "maxEntries": 24,
  "force": false
}'

4) Monitor + Verify

joelclaw runs --count 20 --hours 1
joelclaw run <run-id>
joelclaw docs list --limit 20
joelclaw docs show <doc-id>
joelclaw docs search "your query"
joelclaw docs context <chunk-id> --mode snippet-window

5) Coverage Reconcile

joelclaw docs reconcile --sample 20

Use content_equivalent coverage to detect false-missing churn caused by path/category aliasing.

6) OTEL Verification

joelclaw otel search "docs.file.validated" --hours 1
joelclaw otel search "docs.taxonomy.classified" --hours 1
joelclaw otel search "docs.chunks.indexed" --hours 1
joelclaw otel search "docs.path.aliases.updated" --hours 24

7) Recovery / Maintenance

joelclaw send docs/ingest.janitor.requested -d '{"reason":"manual janitor sweep"}'
joelclaw docs enrich <doc-id>
joelclaw docs reindex --doc <doc-id>
joelclaw docs reindex

Legacy Mapping (Old -> Joelclaw)

  • pdf-brain add <file> --enrich -> joelclaw docs add <absolute-path>
  • pdf-brain ingest <dir> --enrich -> joelclaw send docs/backlog.requested -d '{...}'
  • swarm queue submit pdf-ingest '{"path":"..."}' -> joelclaw docs add <absolute-path>
  • pdf-brain-worker (nice -n10, concurrency 1) -> built into docs-ingest + backlog driver + janitor

Acquisition Handoff (aa-book -> Inngest, end to end)

Use the event workflow so acquisition, inference, download, and docs queueing stay durable:

joelclaw send pipeline/book.download -d '{
  "query": "designing data-intensive applications",
  "format": "pdf",
  "reason": "memory backfill"
}'

Behavior:

  • Runs aa-book search
  • Uses pi inference (Sonnet 4.6 model alias from system-bus model registry) to select MD5
  • Runs aa-book download <md5> <outputDir> --keep-local
  • Attempts a non-fatal NAS backup to /volume1/home/joel/books/<year>/... via SSH/SCP
  • Emits docs/ingest.requested with the local filePath for immediate ingest and nasPath when backup succeeds
  • Emits pipeline/book.downloaded

Optional direct MD5 mode:

joelclaw send pipeline/book.download -d '{
  "md5": "0123456789abcdef0123456789abcdef",
  "outputDir": "/Users/joel/clawd/data/pdf-brain/incoming"
}'

For full operator details and troubleshooting traces, see:

  • references/operator-guide.md

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Markdown to PDF (Styled)

Convert Markdown files to styled PDFs using pandoc and wkhtmltopdf with built-in or custom CSS style options.

Registry SourceRecently Updated
0303
Profile unavailable
General

File to Markdown Converter

Convert documents, spreadsheets, images, and structured files into clean, structured Markdown optimized for AI processing without authentication.

Registry SourceRecently Updated
0359
Profile unavailable
General

Mxe

Convert Markdown files to PDF, DOCX, or HTML with advanced formatting, Mermaid diagrams, custom fonts, and table of contents support.

Registry SourceRecently Updated
01.3K
Profile unavailable