web-content-extraction

Extract documentation and content from websites. Supports Mintlify, Starlight/Astro, Docusaurus, GitBook, ReadTheDocs, Sphinx, and generic sites. Uses a tiered approach - try the simplest method first (direct curl, Jina AI Reader) before falling back to Crawl4AI for JS-heavy sites.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "web-content-extraction" with this command: npx skills add nikhilmaddirala/gtd-cc/nikhilmaddirala-gtd-cc-web-content-extraction

Content extraction

Overview

This skill extracts documentation and website content as markdown files. It uses a tiered approach: try the simplest method first, fall back to heavier tools only when needed.

Context

User provides a URL to extract content from. This skill is appropriate when:

  • Extracting documentation sites (API docs, tutorials, reference guides)
  • Crawling entire websites to markdown for offline reference
  • Archiving websites or generating sitemaps
  • Converting multi-page documentation to organized markdown

Process

  1. Discover all pages (llms.txt, sitemap XML, nav link extraction, or progressive crawling)
  2. Detect platform to choose extraction method
  3. Batch extract content using the simplest working method
  4. Organize output and verify results

Page discovery (ordered by preference)

Before extracting content, discover all pages on the site. Try these methods in order:

Method 1: llms.txt

Some documentation platforms (notably Mintlify) serve an llms.txt file listing all pages:

# Check for llms.txt
curl -sL "https://example.com/docs/llms.txt" -o /tmp/llms.txt
head -20 /tmp/llms.txt

# Extract all URLs
grep -oE 'https://[^\s]+\.md' /tmp/llms.txt

This is the most reliable method when available. It gives you every page URL immediately.

Method 2: sitemap XML

Most documentation sites have a sitemap:

# Check common sitemap locations
curl -sL "https://example.com/sitemap.xml" | head -20
curl -sL "https://example.com/sitemap-index.xml" | head -20
curl -sL "https://example.com/docs/sitemap-index.xml" | head -20
curl -sL "https://example.com/docs/sitemap-0.xml" | head -20

# Extract all URLs from sitemap
curl -sL "https://example.com/sitemap-0.xml" | grep -oE 'https://[^<]+' | sort

Starlight/Astro sites reliably have sitemaps. Look for <link rel="sitemap"> in the HTML source.

Method 3: Jina AI Reader nav extraction

Use Jina to render the page and extract sidebar/nav links:

curl -sL "https://r.jina.ai/https://example.com/docs/" -o /tmp/nav.md
grep -oE '/docs/[a-z0-9/-]+' /tmp/nav.md | sort -u

Method 4: progressive crawling (Crawl4AI)

For sites where the above methods fail, use Crawl4AI to discover pages by following links. See the "Advanced: Crawl4AI" section below.

Platform detection and extraction

Mintlify sites (Anthropic, Stripe, many API docs)

Mintlify serves raw markdown when you append .md to any page URL. This is the ideal case: zero dependencies, clean output, no nav/footer noise.

Detection:

  • llms.txt exists
  • Appending .md to a URL returns markdown (not HTML)

Extraction:

# Test: does the .md URL return markdown?
curl -sL "https://example.com/docs/en/overview.md" | head -5

# If it starts with markdown (# heading, etc.), use direct curl for all pages:
OUT="./output"
mkdir -p "$OUT"

pages=(overview quickstart setup cli-reference)  # from llms.txt

for page in "${pages[@]}"; do
  curl -sL "https://example.com/docs/en/${page}.md" -o "${OUT}/${page}.md" &
done
wait

# Verify
ls -1 "$OUT" | wc -l
find "$OUT" -name "*.md" -size 0  # check for empty files
du -sh "$OUT"

Starlight/Astro sites (OpenCode, many OSS projects)

Starlight renders HTML server-side. Use Jina AI Reader to convert to markdown.

Detection:

  • HTML contains Starlight or astro in meta tags / generator
  • Has sitemap-index.xml

Extraction:

# Discover pages via sitemap
curl -sL "https://example.com/docs/sitemap-0.xml" | grep -oE 'https://[^<]+' > /tmp/urls.txt

OUT="./output"
mkdir -p "$OUT"

# Batch download via Jina (rate limit: max 5 concurrent)
while read url; do
  slug=$(echo "$url" | sed 's|.*/docs/||; s|/$||; s|/|-|g')
  [ -z "$slug" ] && slug="index"
  curl -sL "https://r.jina.ai/${url}" -o "${OUT}/${slug}.md" &

  running=$(jobs -r | wc -l)
  if [ "$running" -ge 5 ]; then
    wait -n
  fi
done < /tmp/urls.txt
wait

Note: Jina output includes nav/sidebar noise. For reference docs this is acceptable. For cleaner output, use Crawl4AI with CSS selectors.

Docusaurus, GitBook, ReadTheDocs, Sphinx

These platforms render HTML. Use Jina as the first attempt; fall back to Crawl4AI if Jina output is too noisy.

Platform-specific CSS selectors (for Crawl4AI fallback):

PlatformContent selectorExclude
Docusaurusarticle, .markdown, .theme-doc-markdownnav, footer, .pagination, .table-of-contents
GitBook.gitbook-root, .page-body, .theme-docnav, header, .sidebar, .navigation
ReadTheDocs.document, .role-content, .bd-contentnav, .sidebar, .toctree, .related-topics
Sphinx.document, .body, .sectionnav, .related, .sphinxsidebar, .toctree-wrapper

Generic sites

Try methods in this order:

  1. Check for llms.txt or .md URL suffix
  2. Check for sitemap XML
  3. Use Jina AI Reader
  4. Fall back to Crawl4AI

Complete workflow examples

Example 1: Mintlify docs (simplest case)

# 1. Discover pages
curl -sL "https://code.claude.com/docs/llms.txt" -o /tmp/llms.txt
grep -oE 'https://[^\s]+\.md' /tmp/llms.txt > /tmp/urls.txt

# 2. Extract page names
sed 's|.*/en/||; s|\.md||' /tmp/urls.txt > /tmp/pages.txt

# 3. Batch download
OUT="./claude-code-docs"
mkdir -p "$OUT"

while read page; do
  curl -sL "https://code.claude.com/docs/en/${page}.md" -o "${OUT}/${page}.md" &
done < /tmp/pages.txt
wait

# 4. Verify
echo "$(ls -1 "$OUT"/*.md | wc -l) files, $(du -sh "$OUT" | cut -f1)"
find "$OUT" -name "*.md" -size 0 -exec echo "EMPTY: {}" \;

Example 2: Starlight/Astro docs (Jina approach)

# 1. Discover pages via sitemap
curl -sL "https://opencode.ai/docs/sitemap-index.xml"  # find sitemap URL
curl -sL "https://opencode.ai/docs/sitemap-0.xml" | grep -oE 'https://[^<]+' > /tmp/urls.txt

# 2. Batch download via Jina (throttled)
OUT="./opencode-docs"
mkdir -p "$OUT"

while read url; do
  slug=$(echo "$url" | sed 's|.*/docs/||; s|/$||; s|/|-|g')
  [ -z "$slug" ] && slug="index"
  curl -sL "https://r.jina.ai/${url}" -o "${OUT}/${slug}.md" &
  running=$(jobs -r | wc -l)
  [ "$running" -ge 5 ] && wait -n
done < /tmp/urls.txt
wait

# 3. Verify
echo "$(ls -1 "$OUT"/*.md | wc -l) files, $(du -sh "$OUT" | cut -f1)"

Example 3: full site crawl to markdown (Crawl4AI)

For sites that need JavaScript rendering or where simpler methods fail:

# Quick CLI approach
uvx crawl4ai crawl \
  --url "https://example.com" \
  --output-dir "output/example-com-$(date +%Y%m%d-%H%M%S)" \
  --max-depth 3 \
  --format markdown

Python implementation for more control:

import asyncio
import os
from pathlib import Path
from datetime import datetime
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from urllib.parse import urljoin, urlparse

async def crawl_to_markdown(base_url, output_dir, max_depth=3):
    """Crawl entire website and save as markdown files"""

    base_domain = urlparse(base_url).netloc
    timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
    crawl_dir = Path(output_dir) / f"{base_domain}-{timestamp}"
    crawl_dir.mkdir(parents=True, exist_ok=True)

    visited = set()
    queue = [(base_url, 0)]

    async with AsyncWebCrawler() as crawler:
        while queue and len(visited) < 100:
            url, depth = queue.pop(0)

            if url in visited or depth > max_depth:
                continue

            try:
                result = await crawler.arun(
                    url,
                    config=CrawlerRunConfig(
                        page_timeout=30000,
                        remove_overlay_elements=True
                    )
                )

                if result.success:
                    path = urlparse(url).path
                    if path.endswith('/') or path == '':
                        path = path + 'index.md'
                    else:
                        path = path + '.md' if not path.endswith('.md') else path

                    output_file = crawl_dir / path.lstrip('/')
                    output_file.parent.mkdir(parents=True, exist_ok=True)

                    with open(output_file, 'w', encoding='utf-8') as f:
                        f.write(f"# {result.metadata.get('title', 'Page')}\n\n")
                        f.write(str(result.markdown))

                    visited.add(url)

                    for link_info in result.links.get("internal", []):
                        href = link_info.get("href", "")
                        absolute_url = urljoin(base_url, href)
                        if (absolute_url not in visited and
                            not absolute_url.startswith('#') and
                            absolute_url.startswith(base_url)):
                            queue.append((absolute_url, depth + 1))

            except Exception as e:
                print(f"Failed: {url}: {e}")

    print(f"Done: {len(visited)} pages saved to {crawl_dir}")
    return crawl_dir

Advanced: Crawl4AI patterns

Universal documentation extraction

For sites where simple methods fail, use Crawl4AI with platform-aware selectors:

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def extract_documentation(url):
    """Universal documentation extractor with platform detection"""

    content_selectors = [
        "article", ".markdown", ".theme-doc-markdown",
        ".gitbook-root", ".document", ".bd-content",
        ".page-body", ".role-content", ".section",
        ".main-content", "[role='main']"
    ]

    config = CrawlerRunConfig(
        css_selector=", ".join(content_selectors),
        wait_for="css:article, .markdown, .document, [role='main']",
        remove_overlay_elements=True,
        exclude_tags=[
            "nav", "header", "footer",
            ".sidebar", ".navigation", ".menu",
            ".table-of-contents", ".toc",
            ".pagination", ".breadcrumbs",
            "script", "style", "noscript"
        ],
        page_timeout=45000
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url, config=config)
        if result.success:
            return str(result.markdown)

Relevance-based filtering

from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def extract_relevant_docs(url, focus_area):
    """Extract documentation focused on specific topics"""

    bm25_filter = BM25ContentFilter(
        user_query=focus_area,
        bm25_threshold=1.2,
        include_tables=True,
        include_code=True
    )

    md_generator = DefaultMarkdownGenerator(
        content_filter=bm25_filter,
        options={
            "ignore_links": False,
            "ignore_images": False,
            "code_block_format": "fenced"
        }
    )

    config = CrawlerRunConfig(
        css_selector="article, .document, .markdown",
        markdown_generator=md_generator
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url, config=config)
        return str(result.markdown.fit_markdown)

Batch extraction with concurrency

async def extract_complete_docs(urls, max_concurrent=3):
    """Extract multiple pages concurrently"""

    config = CrawlerRunConfig(
        css_selector="article, .document, .markdown",
        remove_overlay_elements=True,
        page_timeout=30000
    )

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(
            urls=urls,
            config=config,
            max_concurrent=max_concurrent
        )

        extracted = {}
        for result in results:
            if result.success:
                extracted[result.url] = str(result.markdown)

        return extracted

API documentation extraction

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "api_endpoints",
    "baseSelector": ".api-endpoint, .method, .endpoint",
    "fields": [
        {"name": "method", "selector": ".http-method, .verb", "type": "text"},
        {"name": "path", "selector": ".path, .route", "type": "text"},
        {"name": "description", "selector": ".description", "type": "text"},
        {"name": "parameters", "selector": ".parameters", "type": "text"},
        {"name": "example", "selector": ".example, .code-example", "type": "text"}
    ]
}

async def extract_api_docs(url):
    extraction_strategy = JsonCssExtractionStrategy(schema=schema)
    config = CrawlerRunConfig(
        css_selector=".api-reference, .endpoints, .methods",
        extraction_strategy=extraction_strategy
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url, config=config)
        return result.extracted_content

Advanced: intelligent crawling

For sites that require link-following discovery (no sitemap, no llms.txt):

class IntelligentCrawler:
    def __init__(self, base_url, max_depth=5, max_pages=100):
        self.base_url = base_url
        self.base_domain = urlparse(base_url).netloc
        self.max_depth = max_depth
        self.max_pages = max_pages
        self.visited = set()
        self.queue = [(base_url, 0)]

    async def crawl(self):
        """Crawl with intelligent depth control and link prioritization"""

        async with AsyncWebCrawler() as crawler:
            while self.queue and len(self.visited) < self.max_pages:
                url, depth = self.queue.pop(0)

                if url in self.visited or depth > self.max_depth:
                    continue

                config = CrawlerRunConfig(
                    page_timeout=30000,
                    remove_overlay_elements=True,
                    exclude_tags=["script", "style", "nav", "footer"]
                )

                try:
                    result = await crawler.arun(url, config=config)
                    if result.success:
                        self.visited.add(url)
                        yield url, str(result.markdown), result.metadata

                        for link_info in result.links.get("internal", []):
                            href = link_info.get("href", "")
                            if self._is_valid(href):
                                absolute = urljoin(self.base_url, href)
                                if absolute not in self.visited:
                                    priority = self._priority(href, link_info)
                                    self.queue.append((absolute, depth + 1))

                except Exception as e:
                    print(f"Failed: {url}: {e}")

                await asyncio.sleep(1)

    def _is_valid(self, href):
        if href.startswith(('#', 'mailto:', 'tel:', 'javascript:')):
            return False
        skip = ['.pdf', '.jpg', '.png', '.gif', '.zip', '.exe']
        return not any(href.lower().endswith(ext) for ext in skip)

    def _priority(self, href, link_info):
        text = (href + link_info.get("text", "")).lower()
        if any(kw in text for kw in ["docs", "guide", "tutorial", "api"]):
            return 10
        return 5

Sitemap generation

After crawling, generate a sitemap for navigation:

def create_markdown_sitemap(pages, base_url):
    """Generate markdown sitemap from crawled pages"""
    lines = [f"# Sitemap for {base_url}\n"]
    for url, title in sorted(pages.items()):
        depth = url.replace(base_url, '').count('/')
        indent = "  " * depth
        lines.append(f"{indent}- [{title or url}]({url})")
    return "\n".join(lines)

Guidelines

  • Always try the simplest extraction method first (direct curl > Jina > Crawl4AI)
  • Check for llms.txt and sitemap XML before crawling
  • Use parallel curl with & + wait for batch downloads (no tokens wasted in agent context)
  • Throttle Jina requests to max 5 concurrent
  • For Crawl4AI, use arun_many() with max_concurrent=3 for multi-page docs
  • Add delays between requests when crawling large sites
  • Always verify downloads: check file count, empty files, and total size
  • Maintain heading hierarchy and code block formatting

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

tools-catppuccin

No summary provided by upstream source.

Repository SourceNeeds Review
General

productivity-todoist

No summary provided by upstream source.

Repository SourceNeeds Review
General

ai-chats

No summary provided by upstream source.

Repository SourceNeeds Review
General

ai-orchestrator

No summary provided by upstream source.

Repository SourceNeeds Review
web-content-extraction | V50.AI