jb-docs-scraper

Scrape documentation websites into local markdown files for AI context. Takes a base URL and crawls the documentation, storing results in ./docs (or custom path). Uses crawl4ai with BFS deep crawling.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "jb-docs-scraper" with this command: npx skills add bjesuiter/skills/bjesuiter-skills-jb-docs-scraper

Documentation Scraper

Scrape any documentation website into local markdown files. Uses crawl4ai for async web crawling.

Quick Start

# Scrape any documentation URL
uv run --with crawl4ai python ./references/scrape_docs.py <URL>

# Examples
uv run --with crawl4ai python ./references/scrape_docs.py https://mediasoup.org/documentation/v3/
uv run --with crawl4ai python ./references/scrape_docs.py https://docs.rombo.co/tailwind

Output goes to ./docs/<auto-detected-name>/ by default.

Prerequisites (First Time Only)

uv run --with crawl4ai playwright install

Usage

uv run --with crawl4ai python ./references/scrape_docs.py <URL> [OPTIONS]

Options

OptionDescriptionDefault
-o, --output PATHOutput directory./docs/<auto-detected-name>
--max-depth NMaximum link depth6
--max-pages NMaximum pages to scrape500
--url-pattern PATTERNURL filter (glob)Auto-detected
-q, --quietSuppress verbose outputFalse

Examples

# Basic - scrape to ./docs/documentation_v3/
uv run --with crawl4ai python ./references/scrape_docs.py \
  https://mediasoup.org/documentation/v3/

# Custom output directory
uv run --with crawl4ai python ./references/scrape_docs.py \
  https://docs.rombo.co/tailwind \
  --output ./my-tailwind-docs

# Limit crawl scope
uv run --with crawl4ai python ./references/scrape_docs.py \
  https://tanstack.com/start/latest/docs/framework/react/overview \
  --max-pages 50 \
  --max-depth 3

# Custom URL pattern filter
uv run --with crawl4ai python ./references/scrape_docs.py \
  https://example.com/docs/api/v2/ \
  --url-pattern "*api/v2/*"

How It Works

  1. Auto-detects domain and URL pattern from the input URL
  2. Crawls using BFS (breadth-first search) strategy
  3. Filters to stay within the documentation section
  4. Converts pages to clean markdown
  5. Saves with directory structure mirroring the URL paths

Output Structure

docs/<name>/
  index.md           # Root page
  getting-started.md
  api/
    overview.md
    client.md
  guides/
    installation.md

Troubleshooting

IssueSolution
Playwright browser binaries are missingRun uv run --with crawl4ai playwright install
Empty outputCheck if URL pattern matches actual doc URLs. Try --url-pattern
Missing pagesIncrease --max-depth or --max-pages
Wrong pages scrapedUse stricter --url-pattern

Tips

  1. Test first - Use --max-pages 10 to verify config before full crawl
  2. Check output name - Script auto-detects from URL path segments
  3. Rerun safe - Files are overwritten, duplicates skipped

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

summarize

No summary provided by upstream source.

Repository SourceNeeds Review
General

mole-mac-cleanup

No summary provided by upstream source.

Repository SourceNeeds Review
General

jb-browser-testing

No summary provided by upstream source.

Repository SourceNeeds Review