Sitemap Content Scraper
Use this skill to turn a public website into a sitemap-driven scraping job. Prefer the existing sitemap structure over ad hoc crawling so the scrape stays bounded, explainable, and easy for the user to steer.
Workflow
- Ask for the website or URL scope if it is not already provided.
- Run
python3 {baseDir}/scripts/discover_sitemaps.py <site-or-url>. - Summarize the discovered sitemap inventory in plain language.
- If user gave a scoped URL (for example
https://example.com/docs), usescope_hint_substringfrom discovery output as default filter guidance. - Ask which content family the user wants, such as documentation, knowledge base, blog, academy, changelog, or another category.
- Map the user request to the most relevant sitemap by name and sample URL patterns.
- If multiple sitemaps still match, ask the user to choose one or give a tighter scope.
- Ask for the destination folder if it is missing.
- Run
python3 {baseDir}/scripts/scrape_sitemap.py --sitemap-url <chosen-sitemap> --output-dir <destination>, and when a scoped URL was provided add--include-substring <scope_hint_substring>unless the user overrides scope. - Report what was scraped, where it was saved, and any skipped or failed pages.
Quick Commands
Discover sitemap inventory:
python3 {baseDir}/scripts/discover_sitemaps.py https://example.com
Discover and preserve scope hint from a direct URL prompt:
python3 {baseDir}/scripts/discover_sitemaps.py https://example.com/docs
Scrape one sitemap into a chosen folder:
python3 {baseDir}/scripts/scrape_sitemap.py \
--sitemap-url https://example.com/docs-sitemap.xml \
--output-dir /tmp/example-docs
Filter to a subset of URLs when the sitemap mixes sections:
python3 {baseDir}/scripts/scrape_sitemap.py \
--sitemap-url https://example.com/sitemap.xml \
--output-dir /tmp/example-docs \
--include-substring /docs/ \
--exclude-substring /tag/
Selection Rules
- Prefer sitemaps explicitly named for the requested content family, such as
docs-sitemap.xml,post-sitemap.xml,kb-sitemap.xml, oracademy-sitemap.xml. - Use the sample URLs returned by
discover_sitemaps.pyto explain why a sitemap looks like docs, blog, help center, or another category. - If the request is broad, offer the discovered choices instead of scraping everything by default.
- If no sitemap exists, stop and ask whether the user wants a bounded crawl workflow instead. Do not silently switch strategies.
Output Contract
- Save one Markdown file per scraped page.
- Save
manifest.jsonat the output root with success and failure details. - Keep source URLs in the Markdown header so the corpus remains traceable.
- Preserve a stable folder structure derived from the source URL path.
Read {baseDir}/references/sitemap-selection.md when mapping user intent to sitemap candidates, handling ambiguous sitemap names, or explaining the output layout.
Trigger Examples
- "Scrape
example.com/docscontent into./out/docs." - "Pull the help center pages from
https://example.com/help." - "Find blog sitemaps for
example.comand scrape only posts."
Guardrails
- Scrape only public content.
- Accept only
httpandhttpstargets. - Reject
localhost, private IP ranges, and internal-only hostnames. - Enforce public-only targets using both hostname resolution checks and redirect-target checks at request time.
- Respect the chosen sitemap scope instead of broad site crawling.
- Avoid login flows, private dashboards, carts, checkout paths, or user-specific pages.
- Do not use authentication headers, cookies, or tokens.
- Ask before writing outside the intended working area.
- Tell the user when extraction quality looks weak on JavaScript-heavy pages. The bundled scraper is HTML-first and may miss client-rendered content.