web-scraper

A toolkit for extracting content from web pages using Python.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "web-scraper" with this command: npx skills add ivanvza/dspy-skills/ivanvza-dspy-skills-web-scraper

Web Scraper

A toolkit for extracting content from web pages using Python.

When to Use This Skill

Activate this skill when the user needs to:

  • Fetch the HTML content of a web page

  • Extract all links from a page

  • Get readable text content from HTML

  • Scrape data from websites

  • Download and analyze web content

Requirements

This skill requires external packages:

pip install requests beautifulsoup4

Available Scripts

Always run scripts with --help first to see all available options.

Script Purpose

fetch_page.py

Download HTML content from a URL

extract_links.py

Extract all links from a page

extract_text.py

Extract readable text from HTML

Decision Tree

Task → What do you need? │ ├─ Raw HTML content? │ └─ Use: fetch_page.py <url> │ ├─ List of links on a page? │ └─ Use: extract_links.py <url> │ └─ Text content (no HTML tags)? └─ Use: extract_text.py <url>

Quick Examples

Fetch page HTML:

python scripts/fetch_page.py https://example.com python scripts/fetch_page.py https://example.com --output page.html

Extract all links:

python scripts/extract_links.py https://example.com python scripts/extract_links.py https://example.com --absolute --filter ".pdf$"

Extract text content:

python scripts/extract_text.py https://example.com python scripts/extract_text.py https://example.com --paragraphs

Best Practices

  • Respect robots.txt - Check if scraping is allowed

  • Add delays - Don't overwhelm servers with rapid requests

  • Use appropriate User-Agent - Identify your scraper properly

  • Handle errors gracefully - Websites may block or timeout

  • Cache responses - Don't re-fetch unchanged pages

Common Issues

  • 403 Forbidden: Site may be blocking scrapers. Try with --user-agent flag.

  • Timeout: Site may be slow. Increase --timeout value.

  • Empty content: Page may require JavaScript. These scripts handle static HTML only.

  • Encoding issues: Use --encoding flag if text appears garbled.

Reference Files

See references/selectors.md for CSS selector syntax reference.

Ethical Considerations

  • Only scrape public data

  • Respect rate limits and robots.txt

  • Don't scrape personal/private information

  • Check website terms of service

  • Consider using official APIs when available

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

web-fingerprint

No summary provided by upstream source.

Repository SourceNeeds Review
General

file-utils

No summary provided by upstream source.

Repository SourceNeeds Review
General

network-recon

No summary provided by upstream source.

Repository SourceNeeds Review
General

json-tools

No summary provided by upstream source.

Repository SourceNeeds Review