Web Scraper
A toolkit for extracting content from web pages using Python.
When to Use This Skill
Activate this skill when the user needs to:
-
Fetch the HTML content of a web page
-
Extract all links from a page
-
Get readable text content from HTML
-
Scrape data from websites
-
Download and analyze web content
Requirements
This skill requires external packages:
pip install requests beautifulsoup4
Available Scripts
Always run scripts with --help first to see all available options.
Script Purpose
fetch_page.py
Download HTML content from a URL
extract_links.py
Extract all links from a page
extract_text.py
Extract readable text from HTML
Decision Tree
Task → What do you need? │ ├─ Raw HTML content? │ └─ Use: fetch_page.py <url> │ ├─ List of links on a page? │ └─ Use: extract_links.py <url> │ └─ Text content (no HTML tags)? └─ Use: extract_text.py <url>
Quick Examples
Fetch page HTML:
python scripts/fetch_page.py https://example.com python scripts/fetch_page.py https://example.com --output page.html
Extract all links:
python scripts/extract_links.py https://example.com python scripts/extract_links.py https://example.com --absolute --filter ".pdf$"
Extract text content:
python scripts/extract_text.py https://example.com python scripts/extract_text.py https://example.com --paragraphs
Best Practices
-
Respect robots.txt - Check if scraping is allowed
-
Add delays - Don't overwhelm servers with rapid requests
-
Use appropriate User-Agent - Identify your scraper properly
-
Handle errors gracefully - Websites may block or timeout
-
Cache responses - Don't re-fetch unchanged pages
Common Issues
-
403 Forbidden: Site may be blocking scrapers. Try with --user-agent flag.
-
Timeout: Site may be slow. Increase --timeout value.
-
Empty content: Page may require JavaScript. These scripts handle static HTML only.
-
Encoding issues: Use --encoding flag if text appears garbled.
Reference Files
See references/selectors.md for CSS selector syntax reference.
Ethical Considerations
-
Only scrape public data
-
Respect rate limits and robots.txt
-
Don't scrape personal/private information
-
Check website terms of service
-
Consider using official APIs when available