crawl-site

Use this skill when the user asks to "crawl a site", "download a website", "mirror a site", "scrape a site", or wants to download web pages for offline access or analysis. Configures Crawlio settings based on site type, starts the crawl, monitors progress, and reports results.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "crawl-site" with this command: npx skills add crawlio-app/crawlio-plugin/crawlio-app-crawlio-plugin-crawl-site

crawl-site

Crawl a website using Crawlio. Configures settings based on site type, starts the crawl, monitors progress, and reports results.

When to Use

Use this skill when the user wants to download, mirror, or crawl a website for offline access, analysis, or archival.

Workflow

1. Determine Site Type

Before configuring settings, identify the site type. Ask the user or infer from context:

Site TypeIndicatorsRecommended Settings
Static siteHTML/CSS, no JS frameworksmaxDepth: 5, maxConcurrent: 8
SPA (React, Vue, etc.)JS-heavy, client-side routingmaxDepth: 3, includeSupportingFiles: true, consider using crawlio-agent for enrichment first
CMS (WordPress, etc.)/wp-content/, admin pathsmaxDepth: 5, excludePatterns: ["/wp-admin/*", "/wp-json/*"]
Documentation site/docs/, versioned pathsmaxDepth: 10, excludePatterns: ["/v[0-9]*/*"] for old versions
Single page snapshotUser wants just one pagemaxDepth: 0, includeSupportingFiles: true

2. Configure Settings

Use update_settings to set appropriate configuration:

update_settings({
  settings: {
    maxConcurrent: 4,      // Parallel downloads (increase for large sites)
    crawlDelay: 0.5,       // Be polite — seconds between requests
    timeout: 60,           // Request timeout
    stripTrackingParams: true
  },
  policy: {
    scopeMode: "sameDomain",
    maxDepth: 5,
    respectRobotsTxt: true,
    includeSupportingFiles: true,
    downloadCrossDomainAssets: true,  // Get CDN assets
    autoUpgradeHTTP: true             // Use HTTPS
  }
})

3. Start the Crawl

start_crawl({ url: "https://example.com" })

For multi-page targeted downloads:

start_crawl({ urls: ["https://example.com/page1", "https://example.com/page2"] })

4. Monitor Progress

Poll get_crawl_status with the sequence number for efficient change detection:

get_crawl_status()
// Returns: seq: 42, downloaded: 85/150

get_crawl_status({ since: 42 })
// Returns: "No changes" or updated status

5. Check for Issues

After crawl completes:

get_failed_urls()     // Any failures to retry?
get_errors()          // Any engine errors?
get_site_tree()       // What was downloaded?

6. Retry Failures (if any)

recrawl_urls({ urls: ["https://example.com/failed-page"] })

7. Report Results

Summarize: pages downloaded, failures, site structure, any notable findings.

Tips

  • For large sites (1000+ pages), set maxPagesPerCrawl to avoid runaway crawls
  • Use excludePatterns to skip known junk paths (admin panels, API routes, search results)
  • If a site requires authentication, set customCookies or customHeaders in settings
  • For SPA sites, combine with the crawlio-agent Chrome extension for framework detection and JavaScript-rendered content

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

audit-site

No summary provided by upstream source.

Repository SourceNeeds Review
General

observe

No summary provided by upstream source.

Repository SourceNeeds Review
General

finding

No summary provided by upstream source.

Repository SourceNeeds Review