scraping-documentation

Scraping Documentation

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "scraping-documentation" with this command: npx skills add memyselfandm/cconami/memyselfandm-cconami-scraping-documentation

Scraping Documentation

Scrape documentation sites and convert to organized markdown files.

Usage

/scraping-documentation <url> [options]

Arguments

URL (required):

Options:

  • --output DIR : Output directory (default: ./docs )

  • --depth N : How many levels to crawl (default: 2)

  • --include PATTERN : Only include URLs matching pattern

  • --exclude PATTERN : Exclude URLs matching pattern

  • --format FORMAT : Output format (markdown|html|both)

  • --index : Generate index file

Examples

Scrape library documentation

/scraping-documentation https://docs.example.com --output ai_docs/knowledge/example

Limited depth crawl

/scraping-documentation https://api.example.com/docs --depth 1

Include only API reference

/scraping-documentation https://docs.example.com --include "/api/*"

Exclude changelog pages

/scraping-documentation https://docs.example.com --exclude "/changelog/*"

Workflow

Step 1: Discover Pages

Start from base URL

base_url = args.url discovered = set() to_crawl = [base_url]

while to_crawl and len(discovered) < max_pages: url = to_crawl.pop(0)

if url in discovered:
    continue

if not matches_include(url) or matches_exclude(url):
    continue

# Fetch page
content = fetch_url(url)

# Extract links
links = extract_links(content, base_url)

# Add to queue (respect depth)
depth = get_depth(url, base_url)
if depth &#x3C; max_depth:
    to_crawl.extend(links)

discovered.add(url)

Step 2: Fetch and Convert

for url in discovered: # Fetch content html = fetch_url(url)

# Convert to markdown
markdown = html_to_markdown(html)

# Clean up
markdown = clean_markdown(markdown)

# Determine output path
path = url_to_filepath(url, output_dir)

# Write file
write_file(path, markdown)

Step 3: HTML to Markdown Conversion

Handle common documentation patterns:

  • Code blocks with syntax highlighting

  • Tables

  • Admonitions/callouts

  • Navigation (strip)

  • Headers (preserve hierarchy)

  • Links (convert to relative)

  • Images (download and reference locally)

Step 4: Generate Index

{Site Name} Documentation

Scraped from: {base_url} Date: {timestamp}

Contents

{for section in sections:}

{section.title}

{for page in section.pages:}

Step 5: Report

Scraping Complete

Summary

  • Base URL: {base_url}
  • Pages scraped: {count}
  • Output directory: {output_dir}
  • Total size: {size}

Files Created

{for file in files:}

  • {file.path} ({file.size})

Structure

{directory tree}

Next Steps

Add to CLAUDE.md:

## Documentation
@{output_dir}/index.md

## Output Structure

ai_docs/knowledge/example/
├── index.md              # Table of contents
├── getting-started.md    # Converted pages
├── api/
│   ├── index.md
│   ├── authentication.md
│   └── endpoints.md
├── guides/
│   ├── index.md
│   └── quickstart.md
└── _assets/              # Downloaded images
└── diagram.png

## Conversion Rules

### Code Blocks
```html
&#x3C;pre>&#x3C;code class="language-python">print("hello")&#x3C;/code>&#x3C;/pre>

→

```python
print("hello")

Tables

HTML tables → Markdown tables

Callouts

<div class="warning">Important note</div>

⚠️ Warning: Important note

Navigation

Strip navigation, sidebars, footers - keep content only.

Error Handling

Issue Action

404 page Skip and log

Rate limited Back off and retry

Login required Report and skip

JavaScript rendered Warn (content may be incomplete)

Large file Skip with warning

Best Practices

  • Respect robots.txt - Check before scraping

  • Rate limiting - Don't overload servers

  • Attribution - Keep source URL in files

  • Updates - Re-run periodically to update

  • Selection - Use include/exclude to get relevant content only

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

workshopping-prds

No summary provided by upstream source.

Repository SourceNeeds Review
General

managing-sprints

No summary provided by upstream source.

Repository SourceNeeds Review
General

image-gen

Generate AI images from text prompts. Triggers on: "生成图片", "画一张", "AI图", "generate image", "配图", "create picture", "draw", "visualize", "generate an image".

Archived SourceRecently Updated
General

explainer

Create explainer videos with narration and AI-generated visuals. Triggers on: "解说视频", "explainer video", "explain this as a video", "tutorial video", "introduce X (video)", "解释一下XX(视频形式)".

Archived SourceRecently Updated