Apify Scraper Builder
Build production-ready Apify Actors using Node.js/TypeScript and Crawlee.
Crawler Type Decision Tree
Scenario Crawler Why
Static HTML, no JavaScript CheerioCrawler Fastest, lowest memory
JavaScript-rendered content PlaywrightCrawler Modern, cross-browser
Legacy sites, specific Chrome behavior PuppeteerCrawler Chrome-specific features
Need to handle both static and JS PlaywrightCrawler More versatile
High-volume scraping (1000s pages) CheerioCrawler Best performance
Actor Creation Workflow
Step 1: Initialize Project
python scripts/init_actor.py my-scraper --type cheerio
Or manually create structure:
my-scraper/ ├── .actor/ │ ├── actor.json # REQUIRED │ ├── input_schema.json # Recommended │ └── Dockerfile # REQUIRED ├── src/ │ └── main.ts # Entry point ├── package.json └── tsconfig.json
Step 2: Configure actor.json
{ "actorSpecification": 1, "name": "my-scraper", "version": "0.0", "buildTag": "latest", "input": "./input_schema.json", "dockerfile": "./Dockerfile" }
Step 3: Define Input Schema
python scripts/generate_input_schema.py "Scrape product pages with URLs, max items limit, and proxy support"
Or use templates from references/input-schema-guide.md
Step 4: Implement Crawler
Use patterns from references/crawlee-patterns.md
Step 5: Validate Configuration
python scripts/validate_actor.py /path/to/actor
Step 6: Deploy
apify login apify push
Project Structure
Required Files
.actor/actor.json
{ "actorSpecification": 1, "name": "my-scraper", "version": "0.0", "buildTag": "latest", "minMemoryMbytes": 256, "maxMemoryMbytes": 4096, "dockerfile": "./Dockerfile", "input": "./input_schema.json", "storages": { "dataset": "./dataset_schema.json" } }
.actor/Dockerfile (Node.js)
FROM apify/actor-node:20
COPY package*.json ./
RUN npm --quiet set progress=false
&& npm install --omit=dev --omit=optional
&& echo "Installed NPM packages:"
&& npm list || true
&& echo "Node.js version:"
&& node --version
&& echo "NPM version:"
&& npm --version
COPY . ./ CMD npm start
package.json
{ "name": "my-scraper", "version": "0.0.1", "type": "module", "main": "dist/main.js", "scripts": { "start": "node dist/main.js", "build": "tsc" }, "dependencies": { "apify": "^3.0.0", "crawlee": "^3.0.0" }, "devDependencies": { "typescript": "^5.0.0" } }
Input Schema Editors
Editor Use Case Example
textfield
Single-line text Name, URL
textarea
Multi-line text CSS selectors, notes
requestListSources
URL list with labels Start URLs
proxy
Proxy configuration Apify Proxy settings
json
JSON object/array Custom configuration
select
Dropdown options Country, category
checkbox
Boolean toggle Debug mode
number
Integer/float Max items, delay
datepicker
Date selection Date range filter
Common Input Schema Pattern
{ "title": "Scraper Input", "type": "object", "schemaVersion": 1, "properties": { "startUrls": { "title": "Start URLs", "type": "array", "description": "URLs to start scraping from", "editor": "requestListSources", "prefill": [{"url": "https://example.com"}] }, "maxItems": { "title": "Max Items", "type": "integer", "description": "Maximum number of items to scrape", "default": 100, "minimum": 1 }, "proxyConfig": { "title": "Proxy Configuration", "type": "object", "description": "Proxy settings for the scraper", "editor": "proxy", "default": {"useApifyProxy": true} } }, "required": ["startUrls"] }
Crawlee Patterns
CheerioCrawler (Fast HTML Parsing)
import { Actor } from 'apify'; import { CheerioCrawler, Dataset } from 'crawlee';
await Actor.init();
const input = await Actor.getInput<{ startUrls: { url: string }[]; maxItems: number; }>();
const crawler = new CheerioCrawler({ maxRequestsPerCrawl: input?.maxItems || 100, async requestHandler({ request, $, enqueueLinks }) { const title = $('h1').text().trim(); const price = $('.price').text().trim();
await Dataset.pushData({
url: request.url,
title,
price,
});
// Enqueue pagination links
await enqueueLinks({
selector: 'a.next-page',
});
},
});
await crawler.run(input?.startUrls?.map(u => u.url) || []); await Actor.exit();
PlaywrightCrawler (JavaScript Rendering)
import { Actor } from 'apify'; import { PlaywrightCrawler, Dataset } from 'crawlee';
await Actor.init();
const input = await Actor.getInput<{ startUrls: { url: string }[]; maxItems: number; }>();
const proxyConfiguration = await Actor.createProxyConfiguration( input?.proxyConfig );
const crawler = new PlaywrightCrawler({ proxyConfiguration, maxRequestsPerCrawl: input?.maxItems || 100, async requestHandler({ page, request, enqueueLinks }) { // Wait for dynamic content await page.waitForSelector('.product-list');
const products = await page.$$eval('.product', items =>
items.map(item => ({
title: item.querySelector('h2')?.textContent?.trim(),
price: item.querySelector('.price')?.textContent?.trim(),
}))
);
for (const product of products) {
await Dataset.pushData({
url: request.url,
...product,
});
}
await enqueueLinks({
selector: 'a.pagination',
});
},
});
await crawler.run(input?.startUrls?.map(u => u.url) || []); await Actor.exit();
PuppeteerCrawler (Chrome-specific)
import { Actor } from 'apify'; import { PuppeteerCrawler, Dataset } from 'crawlee';
await Actor.init();
const input = await Actor.getInput<{ startUrls: { url: string }[]; }>();
const crawler = new PuppeteerCrawler({ launchContext: { launchOptions: { headless: true, }, }, async requestHandler({ page, request }) { await page.waitForSelector('.content');
const data = await page.evaluate(() => ({
title: document.querySelector('h1')?.textContent,
content: document.querySelector('.content')?.innerHTML,
}));
await Dataset.pushData({
url: request.url,
...data,
});
},
});
await crawler.run(input?.startUrls?.map(u => u.url) || []); await Actor.exit();
Scripts
Initialize New Actor
python scripts/init_actor.py <name> --type <cheerio|playwright|puppeteer> [--path <dir>]
Validate Actor Configuration
python scripts/validate_actor.py <actor-path>
Generate Input Schema
python scripts/generate_input_schema.py "<description>" [--output <path>]
Deployment Commands
Install Apify CLI
npm install -g @apify/cli
Login to Apify
apify login
Create new Actor from template (interactive)
apify create my-actor
Run Actor locally
apify run --purge
Push to Apify platform
apify push
Build Actor remotely
apify actors build
Call Actor remotely
apify actors call <actor-id>
Pull Actor code from Apify
apify actors pull <actor-id>
Validation Checklist
Before Building
-
Correct crawler type selected for target site
-
Input schema defines all required parameters
-
Dependencies in package.json are correct
Configuration
-
actor.json has actorSpecification: 1
-
actor.json has valid name and version
-
Dockerfile uses correct Node.js base image
-
Input schema editors match field types
Code Quality
-
Error handling for network failures
-
Proxy configuration used for production
-
Rate limiting/delays configured
-
Data validation before pushData
Pre-Deployment
-
apify run --purge succeeds locally
-
Output data structure is correct
-
Memory limits are appropriate
References
Topic File
actor.json Specification references/actor-json-spec.md
Input Schema Editors references/input-schema-guide.md
Crawlee Patterns references/crawlee-patterns.md
Templates
Template Description Path
Cheerio Fast HTML scraping templates/crawlee-cheerio/
Playwright JS-rendered content templates/crawlee-playwright/
Puppeteer Chrome-specific templates/crawlee-puppeteer/