apify-scraper-builder

Apify Scraper Builder

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "apify-scraper-builder" with this command: npx skills add dvorkinguy/claude-skills-agents/dvorkinguy-claude-skills-agents-apify-scraper-builder

Apify Scraper Builder

Build production-ready Apify Actors using Node.js/TypeScript and Crawlee.

Crawler Type Decision Tree

Scenario Crawler Why

Static HTML, no JavaScript CheerioCrawler Fastest, lowest memory

JavaScript-rendered content PlaywrightCrawler Modern, cross-browser

Legacy sites, specific Chrome behavior PuppeteerCrawler Chrome-specific features

Need to handle both static and JS PlaywrightCrawler More versatile

High-volume scraping (1000s pages) CheerioCrawler Best performance

Actor Creation Workflow

Step 1: Initialize Project

python scripts/init_actor.py my-scraper --type cheerio

Or manually create structure:

my-scraper/ ├── .actor/ │ ├── actor.json # REQUIRED │ ├── input_schema.json # Recommended │ └── Dockerfile # REQUIRED ├── src/ │ └── main.ts # Entry point ├── package.json └── tsconfig.json

Step 2: Configure actor.json

{ "actorSpecification": 1, "name": "my-scraper", "version": "0.0", "buildTag": "latest", "input": "./input_schema.json", "dockerfile": "./Dockerfile" }

Step 3: Define Input Schema

python scripts/generate_input_schema.py "Scrape product pages with URLs, max items limit, and proxy support"

Or use templates from references/input-schema-guide.md

Step 4: Implement Crawler

Use patterns from references/crawlee-patterns.md

Step 5: Validate Configuration

python scripts/validate_actor.py /path/to/actor

Step 6: Deploy

apify login apify push

Project Structure

Required Files

.actor/actor.json

{ "actorSpecification": 1, "name": "my-scraper", "version": "0.0", "buildTag": "latest", "minMemoryMbytes": 256, "maxMemoryMbytes": 4096, "dockerfile": "./Dockerfile", "input": "./input_schema.json", "storages": { "dataset": "./dataset_schema.json" } }

.actor/Dockerfile (Node.js)

FROM apify/actor-node:20

COPY package*.json ./ RUN npm --quiet set progress=false
&& npm install --omit=dev --omit=optional
&& echo "Installed NPM packages:"
&& npm list || true
&& echo "Node.js version:"
&& node --version
&& echo "NPM version:"
&& npm --version

COPY . ./ CMD npm start

package.json

{ "name": "my-scraper", "version": "0.0.1", "type": "module", "main": "dist/main.js", "scripts": { "start": "node dist/main.js", "build": "tsc" }, "dependencies": { "apify": "^3.0.0", "crawlee": "^3.0.0" }, "devDependencies": { "typescript": "^5.0.0" } }

Input Schema Editors

Editor Use Case Example

textfield

Single-line text Name, URL

textarea

Multi-line text CSS selectors, notes

requestListSources

URL list with labels Start URLs

proxy

Proxy configuration Apify Proxy settings

json

JSON object/array Custom configuration

select

Dropdown options Country, category

checkbox

Boolean toggle Debug mode

number

Integer/float Max items, delay

datepicker

Date selection Date range filter

Common Input Schema Pattern

{ "title": "Scraper Input", "type": "object", "schemaVersion": 1, "properties": { "startUrls": { "title": "Start URLs", "type": "array", "description": "URLs to start scraping from", "editor": "requestListSources", "prefill": [{"url": "https://example.com"}] }, "maxItems": { "title": "Max Items", "type": "integer", "description": "Maximum number of items to scrape", "default": 100, "minimum": 1 }, "proxyConfig": { "title": "Proxy Configuration", "type": "object", "description": "Proxy settings for the scraper", "editor": "proxy", "default": {"useApifyProxy": true} } }, "required": ["startUrls"] }

Crawlee Patterns

CheerioCrawler (Fast HTML Parsing)

import { Actor } from 'apify'; import { CheerioCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{ startUrls: { url: string }[]; maxItems: number; }>();

const crawler = new CheerioCrawler({ maxRequestsPerCrawl: input?.maxItems || 100, async requestHandler({ request, $, enqueueLinks }) { const title = $('h1').text().trim(); const price = $('.price').text().trim();

    await Dataset.pushData({
        url: request.url,
        title,
        price,
    });

    // Enqueue pagination links
    await enqueueLinks({
        selector: 'a.next-page',
    });
},

});

await crawler.run(input?.startUrls?.map(u => u.url) || []); await Actor.exit();

PlaywrightCrawler (JavaScript Rendering)

import { Actor } from 'apify'; import { PlaywrightCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{ startUrls: { url: string }[]; maxItems: number; }>();

const proxyConfiguration = await Actor.createProxyConfiguration( input?.proxyConfig );

const crawler = new PlaywrightCrawler({ proxyConfiguration, maxRequestsPerCrawl: input?.maxItems || 100, async requestHandler({ page, request, enqueueLinks }) { // Wait for dynamic content await page.waitForSelector('.product-list');

    const products = await page.$$eval('.product', items =>
        items.map(item => ({
            title: item.querySelector('h2')?.textContent?.trim(),
            price: item.querySelector('.price')?.textContent?.trim(),
        }))
    );

    for (const product of products) {
        await Dataset.pushData({
            url: request.url,
            ...product,
        });
    }

    await enqueueLinks({
        selector: 'a.pagination',
    });
},

});

await crawler.run(input?.startUrls?.map(u => u.url) || []); await Actor.exit();

PuppeteerCrawler (Chrome-specific)

import { Actor } from 'apify'; import { PuppeteerCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{ startUrls: { url: string }[]; }>();

const crawler = new PuppeteerCrawler({ launchContext: { launchOptions: { headless: true, }, }, async requestHandler({ page, request }) { await page.waitForSelector('.content');

    const data = await page.evaluate(() => ({
        title: document.querySelector('h1')?.textContent,
        content: document.querySelector('.content')?.innerHTML,
    }));

    await Dataset.pushData({
        url: request.url,
        ...data,
    });
},

});

await crawler.run(input?.startUrls?.map(u => u.url) || []); await Actor.exit();

Scripts

Initialize New Actor

python scripts/init_actor.py <name> --type <cheerio|playwright|puppeteer> [--path <dir>]

Validate Actor Configuration

python scripts/validate_actor.py <actor-path>

Generate Input Schema

python scripts/generate_input_schema.py "<description>" [--output <path>]

Deployment Commands

Install Apify CLI

npm install -g @apify/cli

Login to Apify

apify login

Create new Actor from template (interactive)

apify create my-actor

Run Actor locally

apify run --purge

Push to Apify platform

apify push

Build Actor remotely

apify actors build

Call Actor remotely

apify actors call <actor-id>

Pull Actor code from Apify

apify actors pull <actor-id>

Validation Checklist

Before Building

  • Correct crawler type selected for target site

  • Input schema defines all required parameters

  • Dependencies in package.json are correct

Configuration

  • actor.json has actorSpecification: 1

  • actor.json has valid name and version

  • Dockerfile uses correct Node.js base image

  • Input schema editors match field types

Code Quality

  • Error handling for network failures

  • Proxy configuration used for production

  • Rate limiting/delays configured

  • Data validation before pushData

Pre-Deployment

  • apify run --purge succeeds locally

  • Output data structure is correct

  • Memory limits are appropriate

References

Topic File

actor.json Specification references/actor-json-spec.md

Input Schema Editors references/input-schema-guide.md

Crawlee Patterns references/crawlee-patterns.md

Templates

Template Description Path

Cheerio Fast HTML scraping templates/crawlee-cheerio/

Playwright JS-rendered content templates/crawlee-playwright/

Puppeteer Chrome-specific templates/crawlee-puppeteer/

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

vertical-playbooks

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

n8n-workflow-builder

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

rtl-css

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

supabase-rls

No summary provided by upstream source.

Repository SourceNeeds Review