website-to-vite-scraper

Multi-provider website scraper that converts any website (including CSR/SPA) to deployable static sites. Uses Playwright, Apify RAG Browser, Crawl4AI, and Firecrawl for comprehensive scraping. Triggers on requests to clone, reverse-engineer, or convert websites.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "website-to-vite-scraper" with this command: npx skills add breverdbidder/life-os/breverdbidder-life-os-website-to-vite-scraper

Website-to-Vite Scraper V2

Multi-provider website scraper with AI-powered extraction for any website type.

Scraping Methods

MethodBest ForAnti-BotJS RenderingCost
PlaywrightGeneral sites, Next.js/React apps✅ FullFREE
Apify RAG BrowserLLM/RAG-optimized content✅ AdaptiveCredits
Crawl4AIAI training data, clean extractionCredits
FirecrawlProtected sites, anti-bot bypass✅✅$16/mo

Quick Start

GitHub Actions (Recommended)

# Go to: Actions → Website Scraper V2 → Run workflow
# Options:
#   - URL: https://www.reventure.app/
#   - Project name: reventure-clone
#   - Method: all (tries all providers)
#   - Deploy: true

API MEGA LIBRARY Integration

The following APIs from our library enhance this scraper:

APIPurposeStatus
APIFY_API_TOKENRAG Browser, Crawl4AI, Web Scraper✅ Configured
FIRECRAWL_API_KEYAnti-bot bypass, stealth mode✅ Configured
BROWSERLESS_API_KEYAlternative headless browser🔄 Available

MCP Server Integration

Connect Claude Desktop/Cursor to Apify MCP for AI-powered scraping:

{
  "mcpServers": {
    "apify": {
      "command": "npx",
      "args": ["@apify/actors-mcp-server"],
      "env": {
        "APIFY_TOKEN": "your-apify-api-token"
      }
    }
  }
}

Or use hosted: https://mcp.apify.com?token=YOUR_TOKEN

Apify Actors Used

apify/rag-web-browser

  • Purpose: LLM-optimized web content extraction
  • Output: Markdown, HTML, text
  • Features:
    • Playwright adaptive (handles JS)
    • Clean content extraction
    • Link following
    • Metadata extraction

raizen/ai-web-scraper (Crawl4AI)

  • Purpose: AI training data collection
  • Output: Cleaned markdown, structured links
  • Features:
    • Excludes boilerplate (headers, footers, nav)
    • Word count thresholding
    • External link filtering

Firecrawl

  • Purpose: Anti-bot protected sites
  • Output: Markdown, HTML, screenshots
  • Features:
    • Anti-detection technology
    • JavaScript rendering
    • Main content extraction
    • 5-second wait for dynamic content

Output Structure

project-name/
├── dist/
│   ├── index.html      # Best merged HTML
│   ├── screenshot.png  # Full page capture
│   ├── meta.json       # Scrape metadata
│   └── assets/
│       ├── images/     # Downloaded images
│       ├── css/        # Stylesheets
│       └── js/         # Scripts
└── results/
    ├── playwright/     # Raw Playwright output
    ├── apify-rag/      # RAG Browser output
    ├── crawl4ai/       # Crawl4AI output
    └── firecrawl/      # Firecrawl output

Handling CSR/SPA Sites

Sites like Next.js, React, Vue that render client-side require JavaScript execution:

  1. Playwright waits for networkidle + 5 seconds
  2. Apify RAG uses adaptive crawler (Playwright when needed)
  3. Firecrawl has built-in JS rendering

For __NEXT_DATA__ extraction (Next.js sites):

  • Playwright automatically extracts and saves to next_data.json
  • Can be parsed to reconstruct static pages

Workflow Parameters

ParameterTypeDefaultDescription
urlstringrequiredWebsite URL to scrape
project_namestringrequiredOutput folder/Cloudflare project name
scrape_methodchoiceplaywrightMethod to use
extract_assetsbooleantrueDownload images/CSS/JS
deploy_cloudflarebooleantrueDeploy to Cloudflare Pages

Cost Optimization

ScenarioRecommended Method
Simple static sitePlaywright (FREE)
JS-heavy SPAPlaywright → Apify RAG fallback
Protected site (Cloudflare)Firecrawl
AI/RAG pipelineApify RAG or Crawl4AI
Maximum coverageall method

Security Assessment

Per API_MEGA_LIBRARY guidelines:

APISecurity ScoreRecommendation
Apify85/100✅ ADOPT
Firecrawl82/100✅ ADOPT
Playwright90/100✅ ADOPT (local)

Troubleshooting

Site returns blank page

  1. Try scrape_method: all to use multiple providers
  2. Increase wait time in Playwright
  3. Check if site blocks datacenter IPs → use Firecrawl

Assets not downloading

  1. Some sites block direct asset requests
  2. Use relative paths from original HTML
  3. Check for CORS restrictions

Cloudflare protection detected

  1. Use Firecrawl (has anti-bot bypass)
  2. Or use Apify with residential proxies

Related Skills

  • auction-results - Uses similar scraping for auction data
  • bcpao-scraper - BCPAO property data extraction
  • youtube-transcript - Video content extraction

Changelog

V2.0 (Dec 2025)

  • Added multi-provider support (Playwright, Apify, Firecrawl)
  • MCP server integration
  • Automatic provider fallback
  • Asset downloading
  • Cloudflare Pages deployment

V1.0 (Dec 2025)

  • Initial Playwright-only scraper
  • Basic HTML/CSS/JS extraction

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

amazon-bestseller-launch

No summary provided by upstream source.

Repository SourceNeeds Review
General

kdp-listing-optimizer

No summary provided by upstream source.

Repository SourceNeeds Review
General

screen-control-operator

No summary provided by upstream source.

Repository SourceNeeds Review
General

adhd-task-management

No summary provided by upstream source.

Repository SourceNeeds Review