Web Scraper

Overview

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

When to Use This Skill

When the user mentions "scraper" or related topics
When the user mentions "scraping" or related topics
When the user mentions "extrair dados web" or related topics
When the user mentions "web scraping" or related topics
When the user mentions "raspar dados" or related topics
When the user mentions "coletar dados site" or related topics

Do Not Use This Skill When

The task is unrelated to web scraper
A simpler, more specific tool can handle the request
The user needs general-purpose assistance without domain expertise

How It Works

Execute phases in strict order. Each phase feeds the next.

1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT

Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.

Fast path: If user provides URL + clear data target + the request is simple (single page, one data type), compress Phases 1-3 into a single action: fetch, classify, and extract in one WebFetch call. Still validate and format.

Capabilities

Multi-strategy: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
Extraction modes: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
Output formats: Markdown tables (default), JSON, CSV
Pagination: auto-detect and follow (page numbers, infinite scroll, load-more)
Multi-URL: extract same structure across sources with comparison and diff
Validation: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
Auto-escalation: WebFetch fails silently -> automatic Browser fallback
Data transforms: cleaning, normalization, deduplication, enrichment
Differential mode: detect changes between scraping runs

Web Scraper

Multi-strategy web data extraction with intelligent approach selection, automatic fallback escalation, data transformation, and structured output.

Phase 1: Clarify

Establish extraction parameters before touching any URL.

Required Parameters

Parameter	Resolve	Default
Target URL(s)	Which page(s) to scrape?	(required)
Data Target	What specific data to extract?	(required)
Output Format	Markdown table, JSON, CSV, or text?	Markdown table
Scope	Single page, paginated, or multi-URL?	Single page

Optional Parameters

Parameter	Resolve	Default
Pagination	Follow pagination? Max pages?	No, 1 page
Max Items	Maximum number of items to collect?	Unlimited
Filters	Data to exclude or include?	None
Sort Order	How to sort results?	Source order
Save Path	Save to file? Which path?	Display only
Language	Respond in which language?	User's lang
Diff Mode	Compare with previous run?	No

Clarification Rules

If user provides a URL and clear data target, proceed directly to Phase 2. Do NOT ask unnecessary questions.
If request is ambiguous (e.g. "scrape this site"), ask ONLY: "What specific data do you want me to extract from this page?"
Default to Markdown table output. Mention alternatives only if relevant.
Accept requests in any language. Always respond in the user's language.
If user says "everything" or "all data", perform recon first, then present what's available and let user choose.

Discovery Mode

When user has a topic but no specific URL:

Use WebSearch to find the most relevant pages
Present top 3-5 URLs with descriptions
Let user choose which to scrape, or scrape all
Proceed to Phase 2 with selected URL(s)

Example: "find and extract pricing data for CRM tools" -> WebSearch("CRM tools pricing comparison 2026") -> Present top results -> User selects -> Extract

Phase 2: Reconnaissance

Analyze the target page before extraction.

Step 2.1: Initial Fetch

Use WebFetch to retrieve and analyze the page structure:

WebFetch(
  url = TARGET_URL,
  prompt = "Analyze this page structure and report:
    1. Page type: article, product listing, search results, data table,
       directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
    2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
       accordion/collapsible sections, tabs
    3. Approximate number of distinct data items visible
    4. JavaScript rendering indicators: empty containers, loading spinners,
       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
    5. Pagination: next/prev links, page numbers, load-more buttons,
       infinite scroll indicators, total results count
    6. Data density: how much structured, extractable data exists
    7. List the main data fields/columns available for extraction
    8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
    9. Available download links: CSV, Excel, PDF, API endpoints"
)

Step 2.2: Evaluate Fetch Quality

Signal	Interpretation	Action
Rich content with data clearly visible	Static page	Strategy A (WebFetch)
Empty containers, "loading...", minimal text	JS-rendered	Strategy B (Browser)
Login wall, CAPTCHA, 403/401 response	Blocked	Report to user
Content present but poorly structured	Needs precision	Strategy B (Browser)
JSON or XML response body	API endpoint	Strategy C (Bash/curl)
Download links for CSV/Excel available	Direct data file	Strategy C (download)

Step 2.3: Content Classification

Classify into an extraction mode:

Mode	Indicators	Examples
`table`	HTML `<table>`, grid layout with headers	Price comparison, statistics, specs
`list`	Repeated similar elements, card grids	Search results, product listings
`article`	Long-form text with headings/paragraphs	Blog post, news article, docs
`product`	Product name, price, specs, images, rating	E-commerce product page
`contact`	Names, emails, phones, addresses, roles	Team page, staff directory
`faq`	Question-answer pairs, accordions	FAQ page, help center
`pricing`	Plan names, prices, features, tiers	SaaS pricing page
`events`	Dates, locations, titles, descriptions	Event listings, conferences
`jobs`	Titles, companies, locations, salaries	Job boards, career pages
`custom`	User specified CSS selectors or fields	Anything not matching above

Record: page type, extraction mode, JS rendering needed (yes/no), available fields, structured data present (JSON-LD etc.).

If user asked for "everything", present the available fields and let them choose.

Phase 3: Strategy Selection

Choose the extraction approach based on recon results.

Decision Tree

Structured data (JSON-LD, microdata) has what we need?
 |
 +-- YES --> STRATEGY E: Extract structured data directly
 |
 +-- NO: Content fully visible in WebFetch?
      |
      +-- YES: Need precise element targeting?
      |    |
      |    +-- NO  --> STRATEGY A: WebFetch + AI extraction
      |    +-- YES --> STRATEGY B: Browser automation
      |
      +-- NO: JavaScript rendering detected?
           |
           +-- YES --> STRATEGY B: Browser automation
           +-- NO:  API/JSON/XML endpoint or download link?
                |
                +-- YES --> STRATEGY C: Bash (curl + jq)
                +-- NO  --> Report access issue to user

Strategy A: Webfetch With Ai Extraction

Best for: Static pages, articles, simple tables, well-structured HTML.

Use WebFetch with a targeted extraction prompt tailored to the mode:

WebFetch(
  url = URL,
  prompt = "Extract [DATA_TARGET] from this page.
    Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
    Rules:
    - If a value is missing or unclear, use 'N/A'
    - Do not include navigation, ads, footers, or unrelated content
    - Preserve original values exactly (numbers, currencies, dates)
    - Include ALL matching items, not just the first few
    - For each item, also extract the URL/link if available"
)

Auto-escalation: If WebFetch returns suspiciously few items (less than 50% of expected from recon), or mostly empty fields, automatically escalate to Strategy B without asking user. Log the escalation in notes.

Strategy B: Browser Automation

Best for: JS-rendered pages, SPAs, interactive content, lazy-loaded data.

Sequence:

Get tab context: tabs_context_mcp(createIfEmpty=true) -> get tabId
Navigate to URL: navigate(url=TARGET_URL, tabId=TAB)
Wait for content to load: computer(action="wait", duration=3, tabId=TAB)
Check for cookie/consent banners: find(query="cookie consent or accept button", tabId=TAB)
- If found, dismiss it (prefer privacy-preserving option)
Read page structure: read_page(tabId=TAB) or get_page_text(tabId=TAB)
Locate target elements: find(query="[DESCRIPTION]", tabId=TAB)
Extract with JavaScript for precise data via javascript_tool

// Table extraction
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
  const cells = row.querySelectorAll('td, th');
  return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);

// List/card extraction
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
  field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
  field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
  link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);

For lazy-loaded content, scroll and re-extract: computer(action="scroll", scroll_direction="down", tabId=TAB) then computer(action="wait", duration=2, tabId=TAB)

Strategy C: Bash (Curl + Jq)

Best for: REST APIs, JSON endpoints, XML feeds, CSV/Excel downloads.


## Json Api

curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'

## Csv Download

curl -s "CSV_URL" -o /tmp/scraped_data.csv

## Xml Parsing

curl -s "XML_URL" | python3 -c "
import xml.etree.ElementTree as ET, json, sys
tree = ET.parse(sys.stdin)

## ... Parse And Output Json

"

Strategy D: Hybrid

When a single strategy is insufficient, combine:

WebSearch to discover relevant URLs
WebFetch for initial content assessment
Browser automation for JS-heavy sections
Bash for post-processing (jq, python for data cleaning)

Strategy E: Structured Data Extraction

When JSON-LD, microdata, or OpenGraph is present:

Use Browser javascript_tool to extract structured data:

const scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
  try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);

This often provides cleaner, more reliable data than DOM scraping
Fall back to DOM extraction only for fields not in structured data

Pagination Handling

When pagination is detected and user wants multiple pages:

Page-number pagination (any strategy):

Extract data from current page
Identify URL pattern (e.g. ?page=N, /page/N, &offset=N)
Iterate through pages up to user's max (default: 5 pages)
Show progress: "Extracting page 2/5..."
Concatenate all results, deduplicate if needed

Infinite scroll (Browser only):

Extract currently visible data
Record item count
Scroll down: computer(action="scroll", scroll_direction="down", tabId=TAB)
Wait: computer(action="wait", duration=2, tabId=TAB)
Extract newly loaded data
Compare count - if no new items after 2 scrolls, stop
Repeat until no new content or max iterations (default: 5)

"Load More" button (Browser only):

Extract currently visible data
Find button: find(query="load more button", tabId=TAB)
Click it: computer(action="left_click", ref=REF, tabId=TAB)
Wait and extract new content
Repeat until button disappears or max iterations reached

Phase 4: Extract

Execute the selected strategy using mode-specific patterns. See references/extraction-patterns.md for CSS selectors and JavaScript snippets.

Table Mode

WebFetch prompt:

"Extract ALL rows from the table(s) on this page.
Return as a markdown table with exact column headers.
Include every row - do not truncate or summarize.
Preserve numeric precision, currencies, and units."

List Mode

WebFetch prompt:

"Extract each [ITEM_TYPE] from this page.
For each item, extract: [FIELD_LIST].
Return as a JSON array of objects with these keys: [KEY_LIST].
Include ALL items, not just the first few. Include link/URL for each item if available."

Article Mode

WebFetch prompt:

"Extract article metadata:
- title, author, date, tags/categories, word count estimate
- Key factual data points, statistics, and named entities
Return as structured markdown. Summarize the content; do not reproduce full text."

Product Mode

WebFetch prompt:

"Extract product data with these exact fields:
- name, brand, price, currency, originalPrice (if discounted),
  availability, description (first 200 chars), rating, reviewCount,
  specifications (as key-value pairs), productUrl, imageUrl
Return as JSON. Use null for missing fields."

Also check for JSON-LD Product schema (Strategy E) first.

Contact Mode

WebFetch prompt:

"Extract contact information for each person/entity:
- name, title, role, email, phone, address, organization, website, linkedinUrl
Return as a markdown table. Only extract real contacts visible on the page."

Faq Mode

WebFetch prompt:

"Extract all question-answer pairs from this page.
For each FAQ item extract:
- question: the exact question text
- answer: the answer text (first 300 chars if long)
- category: the section/category if grouped
Return as a JSON array of objects."

Pricing Mode

WebFetch prompt:

"Extract all pricing plans/tiers from this page.
For each plan extract:
- planName, monthlyPrice, annualPrice, currency
- features (array of included features)
- limitations (array of limits or excluded features)
- ctaText (call-to-action button text)
- highlighted (true if marked as recommended/popular)
Return as JSON. Use null for missing fields."

Events Mode

WebFetch prompt:

"Extract all events/sessions from this page.
For each event extract:
- title, date, time, endTime, location, description (first 200 chars)
- speakers (array of names), category, registrationUrl
Return as JSON. Use null for missing fields."

Jobs Mode

WebFetch prompt:

"Extract all job listings from this page.
For each job extract:
- title, company, location, salary, salaryRange, type (full-time/part-time/contract)
- postedDate, description (first 200 chars), applyUrl, tags
Return as JSON. Use null for missing fields."

Custom Mode

When user provides specific selectors or field descriptions:

Use Browser automation with javascript_tool and user's CSS selectors
Or use WebFetch with a prompt built from user's field descriptions
Always confirm extracted schema with user before proceeding to multi-URL

Multi-Url Extraction

When extracting from multiple URLs:

Extract from the first URL to establish the data schema
Show user the first results and confirm the schema is correct
Extract from remaining URLs using the same schema
Add a source column/field to every record with the origin URL
Combine all results into a single output
Show progress: "Extracting 3/7 URLs..."

Phase 5: Transform

Clean, normalize, and enrich extracted data before validation. See references/data-transforms.md for patterns.

Automatic Transforms (Always Apply)

Transform	Action
Whitespace cleanup	Trim, collapse multiple spaces, remove `\n` in cells
HTML entity decode	`&` -> `&`, `<` -> `<`, `'` -> `'`
Unicode normalization	NFKC normalization for consistent characters
Empty string to null	`""` -> `null` (for JSON), `""` -> `N/A` (for tables)

Conditional Transforms (Apply When Relevant)

Transform	When	Action
Price normalization	Product/pricing modes	Extract numeric value + currency symbol
Date normalization	Any dates found	Normalize to ISO-8601 (YYYY-MM-DD)
URL resolution	Relative URLs extracted	Convert to absolute URLs
Phone normalization	Contact mode	Standardize to E.164 format if possible
Deduplication	Multi-page or multi-URL	Remove exact duplicate rows
Sorting	User requested or natural	Sort by user-specified field

Data Enrichment (Only When Useful)

Enrichment	When	Action
Currency conversion	User asks for single currency	Note original + convert (approximate)
Domain extraction	URLs in data	Add domain column from full URLs
Word count	Article mode	Count words in extracted text
Relative dates	Dates present	Add "X days ago" column if useful

Deduplication Strategy

When combining data from multiple pages or URLs:

Exact match: rows with identical values in all fields -> keep first
Near match: rows with same key fields (name+source) but different details -> keep most complete (fewer nulls), flag in notes
Report: "Removed N duplicate rows" in delivery notes

Phase 6: Validate

Verify extraction quality before delivering results.

Validation Checks

Check	Action
Item count	Compare extracted count to expected count from recon
Empty fields	Count N/A or null values per field
Data type consistency	Numbers should be numeric, dates parseable
Duplicates	Flag exact duplicate rows (post-dedup)
Encoding	Check for HTML entities, garbled characters
Completeness	All user-requested fields present in output
Truncation	Verify data wasn't cut off (check last items)
Outliers	Flag values that seem anomalous (e.g. $0.00 price)

Confidence Rating

Assign to every extraction:

Rating	Criteria
HIGH	All fields populated, count matches expected, no anomalies
MEDIUM	Minor gaps (<10% empty fields) or count slightly differs
LOW	Significant gaps (>10% empty), structural issues, partial data

Always report confidence with specifics:

Confidence: HIGH - 47 items extracted, all 6 fields populated, matches expected count from page analysis.

Auto-Recovery (Try Before Reporting Issues)

Issue	Auto-Recovery Action
Missing data	Re-attempt with Browser if WebFetch was used
Encoding problems	Apply HTML entity decode + unicode normalization
Incomplete results	Check for pagination or lazy-loading, fetch more
Count mismatch	Scroll/paginate to find remaining items
All fields empty	Page likely JS-rendered, switch to Browser strategy
Partial fields	Try JSON-LD extraction as supplement

Log all recovery attempts in delivery notes. Inform user of any irrecoverable gaps with specific details.

Phase 7: Format And Deliver

Structure results according to user preference. See references/output-templates.md for complete formatting templates.

Delivery Envelope

ALWAYS wrap results with this metadata header:


## Extraction Results

**Source:** [Page Title](http://example.com)
**Date:** YYYY-MM-DD HH:MM UTC
**Items:** N records (M fields each)
**Confidence:** HIGH | MEDIUM | LOW
**Strategy:** A (WebFetch) | B (Browser) | C (API) | E (Structured Data)
**Format:** Markdown Table | JSON | CSV

---

[DATA HERE]

---

**Notes:**
- [Any gaps, issues, or observations]
- [Transforms applied: deduplication, normalization, etc.]
- [Pages scraped if paginated: "Pages 1-5 of 12"]
- [Auto-escalation if it occurred: "Escalated from WebFetch to Browser"]

Markdown Table Rules

Left-align text columns (:---), right-align numbers (---:)
Consistent column widths for readability
Include summary row for numeric data when useful (totals, averages)
Maximum 10 columns per table; split wider data into multiple tables or suggest JSON format
Truncate long cell values to 60 chars with ... indicator
Use N/A for missing values, never leave cells empty
For multi-page results, show combined table (not per-page)

Json Rules

Use camelCase for keys (e.g. productName, unitPrice)

Wrap in metadata envelope:

{
  "metadata": {
    "source": "URL",
    "title": "Page Title",
    "extractedAt": "ISO-8601",
    "itemCount": 47,
    "fieldCount": 6,
    "confidence": "HIGH",
    "strategy": "A",
    "transforms": ["deduplication", "priceNormalization"],
    "notes": []
  },
  "data": [ ... ]
}

Pretty-print with 2-space indentation
Numbers as numbers (not strings), booleans as booleans
null for missing values (not empty strings)

Csv Rules

First row is always headers
Quote any field containing commas, quotes, or newlines
UTF-8 encoding with BOM for Excel compatibility
Use , as delimiter (standard)
Include metadata as comments: # Source: URL

File Output

When user requests file save:

Markdown: .md extension
JSON: .json extension
CSV: .csv extension
Confirm path before writing
Report full file path and item count after saving

Multi-Url Comparison Format

When comparing data across multiple sources:

Add Source as the first column/field
Use short identifiers for sources (domain name or user label)
Group by source or interleave based on user preference
Highlight differences if user asks for comparison
Include summary: "Best price: $X at store-b.com"

Differential Output

When user requests change detection (diff mode):

Compare current extraction with previous run
Mark new items with [NEW]
Mark removed items with [REMOVED]
Mark changed values with [WAS: old_value]
Include summary: "Changes since last run: +5 new, -2 removed, 3 modified"

Rate Limiting

Maximum 1 request per 2 seconds for sequential page fetches
For multi-URL jobs, process sequentially with pauses
If a site returns 429 (Too Many Requests), stop and report to user

Access Respect

If a page blocks access (403, CAPTCHA, login wall), report to user
Do NOT attempt to bypass bot detection, CAPTCHAs, or access controls
Do NOT scrape behind authentication unless user explicitly provides access
Respect robots.txt directives when known

Copyright

Do NOT reproduce large blocks of copyrighted article text
For articles: extract factual data, statistics, and structured info; summarize narrative content
Always include source attribution (http://example.com) in output

Data Scope

Extract ONLY what the user explicitly requested
Warn user before collecting potentially sensitive data at scale (emails, phone numbers, personal information)
Do not store or transmit extracted data beyond what the user sees

Failure Protocol

When extraction fails or is blocked:

Explain the specific reason (JS rendering, bot detection, login, etc.)
Suggest alternatives (different URL, API if available, manual approach)
Never retry aggressively or escalate access attempts

Quick Reference: Mode Cheat Sheet

User Says...	Mode	Strategy	Output Default
"extract the table"	table	A or B	Markdown table
"get all products/prices"	product	E then A	Markdown table
"scrape the listings"	list	A or B	Markdown table
"extract contact info / team page"	contact	A	Markdown table
"get the article data"	article	A	Markdown text
"extract the FAQ"	faq	A or B	JSON
"get pricing plans"	pricing	A or B	Markdown table
"scrape job listings"	jobs	A or B	Markdown table
"get event schedule"	events	A or B	Markdown table
"find and extract [topic]"	discovery	WebSearch	Markdown table
"compare prices across sites"	multi-URL	A or B	Comparison table
"what changed since last time"	diff	any	Diff format

References

Extraction patterns: references/extraction-patterns.md CSS selectors, JavaScript snippets, JSON-LD parsing, domain tips.
Output templates: references/output-templates.md Markdown, JSON, CSV templates with complete examples.
Data transforms: references/data-transforms.md Cleaning, normalization, deduplication, enrichment patterns.

Best Practices

Provide clear, specific context about your project and requirements
Review all suggestions before applying them to production code
Combine with other complementary skills for comprehensive analysis

Common Pitfalls

Using this skill for tasks outside its domain expertise
Applying recommendations without understanding your specific context
Not providing enough project context for accurate analysis

web-scraper

Safety Notice

Copy this and send it to your AI assistant to learn

Web Scraper

Overview

When to Use This Skill

Do Not Use This Skill When

How It Works

Capabilities

Web Scraper

Phase 1: Clarify

Required Parameters

Optional Parameters

Clarification Rules

Discovery Mode

Phase 2: Reconnaissance

Step 2.1: Initial Fetch

Step 2.2: Evaluate Fetch Quality

Step 2.3: Content Classification

Phase 3: Strategy Selection

Decision Tree

Strategy A: Webfetch With Ai Extraction

Strategy B: Browser Automation

Strategy C: Bash (Curl + Jq)

Strategy D: Hybrid

Strategy E: Structured Data Extraction

Pagination Handling

Phase 4: Extract

Table Mode

List Mode

Article Mode

Product Mode

Contact Mode

Faq Mode

Pricing Mode

Events Mode

Jobs Mode

Custom Mode

Multi-Url Extraction

Phase 5: Transform

Automatic Transforms (Always Apply)

Conditional Transforms (Apply When Relevant)

Data Enrichment (Only When Useful)

Deduplication Strategy

Phase 6: Validate

Validation Checks

Confidence Rating

Auto-Recovery (Try Before Reporting Issues)

Phase 7: Format And Deliver

Delivery Envelope

Markdown Table Rules

Json Rules

Csv Rules

File Output

Multi-Url Comparison Format

Differential Output

Rate Limiting

Access Respect

Copyright

Data Scope

Failure Protocol

Quick Reference: Mode Cheat Sheet

References

Best Practices

Common Pitfalls

Source Transparency

Related Skills

AutoClaw Browser Automation

Metal Price

Web Scraping & Data Extraction Engine

HARPA AI