Web Scraper
Extract structured data from websites using BeautifulSoup and requests - turn any webpage into usable data.
When to Use This Skill
-
Competitor research - Scrape pricing, features, positioning
-
Lead generation - Extract contact info from directories
-
Content audit - Pull headings, links, meta data
-
Price monitoring - Track competitor pricing changes
-
Data collection - Gather research data from multiple sources
What Claude Does vs What You Decide
Claude Does You Decide
Structures analysis frameworks Strategic priorities
Synthesizes market data Competitive positioning
Identifies opportunities Resource allocation
Creates strategic options Final strategy selection
Suggests implementation approaches Execution decisions
Dependencies
pip install beautifulsoup4 requests pandas click lxml
Commands
Scrape Elements
python scripts/main.py scrape https://example.com --selector "h1,h2,p" python scripts/main.py scrape https://example.com --selector ".product-price"
Extract Links
python scripts/main.py links https://example.com python scripts/main.py links https://example.com --internal-only
Extract Emails
python scripts/main.py emails https://example.com python scripts/main.py emails https://example.com --depth 2
Extract Structured Data
python scripts/main.py structured https://example.com/article --schema article python scripts/main.py structured https://example.com/product --schema product
Examples
Example 1: Scrape Competitor Pricing
python scripts/main.py scrape https://competitor.com/pricing --selector ".price,.plan-name"
Output:
Extracted 6 elements
1. Starter - $29/mo
2. Pro - $99/mo
3. Enterprise - Contact us
Example 2: Extract Article Content
python scripts/main.py structured https://blog.example.com/post --schema article
Output: article_data.json
{
"title": "How to Scale Your Startup",
"author": "Jane Doe",
"date": "2024-01-15",
"content": "...",
"word_count": 1523
}
CSS Selector Reference
Selector Description Example
tag
Element type h1 , p , div
.class
Class name .price , .title
#id
Element ID #main-content
tag.class
Tag with class div.product
tag[attr]
Has attribute a[href]
parent > child
Direct child ul > li
tag1, tag2
Multiple h1, h2, h3
Ethical Scraping Guidelines
-
Check robots.txt - Respect site's scraping policy
-
Rate limit - Don't overload servers (1-2 req/sec)
-
Identify yourself - Use descriptive User-Agent
-
Cache requests - Don't re-scrape unchanged pages
-
Terms of Service - Check if scraping is allowed
Skill Boundaries
What This Skill Does Well
-
Structuring strategic analysis
-
Identifying market opportunities
-
Creating strategic frameworks
-
Synthesizing competitive data
What This Skill Cannot Do
-
Replace market research
-
Guarantee strategic success
-
Know proprietary competitor info
-
Make executive decisions
Related Skills
-
competitor-monitor - Monitor competitor changes
-
pdf-extractor - Extract from PDFs
Skill Metadata
- Mode: centaur
category: automation subcategory: data-extraction dependencies: [beautifulsoup4, requests, pandas] difficulty: intermediate time_saved: 5+ hours/week