Scrapy Web Scraping Skill

Scrapy is a fast, high-level Python web crawling and scraping framework. It enables structured data extraction from websites, supports crawling entire sites, and integrates pipelines to process and store scraped data.

When to use

Crawl entire websites or follow links across many pages
Extract structured data (prices, articles, product listings) into JSON/CSV
Run scheduled or large-scale scraping pipelines
Need built-in support for request throttling, retries, and middlewares

Required tools / APIs

No external API required
Python 3.8+ required
Scrapy: Web crawling and scraping framework

Install options:

pip

pip install scrapy

Ubuntu/Debian

sudo apt-get install -y python3-pip && pip install scrapy

macOS

brew install python && pip install scrapy

Verify installation

scrapy version

Skills

basic_usage

Create and run a simple Scrapy spider to scrape a single page.

Create a new Scrapy project

scrapy startproject myproject cd myproject

Generate a spider

scrapy genspider quotes quotes.toscrape.com

Run the spider and save to JSON

scrapy crawl quotes -o output.json

Run the spider and save to CSV

scrapy crawl quotes -o output.csv

Python spider (quotes.py):

import scrapy

class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = ["https://quotes.toscrape.com"]

def parse(self, response):
    for quote in response.css("div.quote"):
        yield {
            "text": quote.css("span.text::text").get(),
            "author": quote.css("small.author::text").get(),
            "tags": quote.css("a.tag::text").getall(),
        }

    # Follow pagination links
    next_page = response.css("li.next a::attr(href)").get()
    if next_page:
        yield response.follow(next_page, self.parse)

robust_usage

Production-oriented spider with settings, item pipelines, and error handling.

Run with custom settings (rate limiting, retries)

scrapy crawl quotes
-s DOWNLOAD_DELAY=1
-s AUTOTHROTTLE_ENABLED=True
-s RETRY_TIMES=3
-o output.json

Run from a script (no project required)

scrapy runspider spider.py -o output.json

Python with error handling and structured items:

import scrapy from scrapy import signals from scrapy.crawler import CrawlerProcess

class ArticleSpider(scrapy.Spider): name = "articles" custom_settings = { "DOWNLOAD_DELAY": 1, "AUTOTHROTTLE_ENABLED": True, "AUTOTHROTTLE_START_DELAY": 1, "AUTOTHROTTLE_MAX_DELAY": 10, "ROBOTSTXT_OBEY": True, "USER_AGENT": "open-skills-bot/1.0 (+https://github.com/besoeasy/open-skills)", "RETRY_TIMES": 3, "FEEDS": {"output.json": {"format": "json"}}, }

def __init__(self, start_url=None, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self.start_urls = [start_url or "https://quotes.toscrape.com"]

def parse(self, response):
    for article in response.css("article, div.post, div.entry"):
        yield {
            "url": response.url,
            "title": article.css("h1::text, h2::text").get("").strip(),
            "body": " ".join(article.css("p::text").getall()),
        }

    for link in response.css("a::attr(href)").getall():
        if link.startswith("/") or response.url in link:
            yield response.follow(link, self.parse)

def errback(self, failure):
    self.logger.error(f"Request failed: {failure.request.url} — {failure.value}")

Run without a Scrapy project

if name == "main": process = CrawlerProcess() process.crawl(ArticleSpider, start_url="https://quotes.toscrape.com") process.start()

extract_with_xpath

Use XPath selectors for precise extraction from complex HTML structures.

import scrapy

class XPathSpider(scrapy.Spider): name = "xpath_example" start_urls = ["https://quotes.toscrape.com"]

def parse(self, response):
    for quote in response.xpath("//div[@class='quote']"):
        yield {
            "text": quote.xpath(".//span[@class='text']/text()").get(),
            "author": quote.xpath(".//small[@class='author']/text()").get(),
            "tags": quote.xpath(".//a[@class='tag']/text()").getall(),
        }

Output format

Scrapy yields Python dicts (or Item objects) per scraped record. When saved to file:

output.json — Array of JSON objects, one per item
output.csv — CSV with headers matching dict keys
output.jsonl — One JSON object per line (memory-efficient for large crawls)

Example item:

{ "text": "The world as we have created it is a process of our thinking.", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"] }

Error shape: Scrapy logs errors to stderr; unhandled HTTP errors trigger the errback method if defined.

Rate limits / Best practices

Enable ROBOTSTXT_OBEY = True to respect robots.txt automatically
Set DOWNLOAD_DELAY (seconds between requests) to avoid overloading servers
Enable AUTOTHROTTLE_ENABLED = True for adaptive rate limiting
Set a descriptive USER_AGENT identifying your bot
Use CONCURRENT_REQUESTS_PER_DOMAIN = 1 for polite single-domain crawling
Cache responses during development: HTTPCACHE_ENABLED = True

Agent prompt

You have scrapy web-scraping capability. When a user asks to scrape or crawl a website:

Confirm the target URL and data fields to extract (e.g., title, price, link)
Create a Scrapy spider using CSS or XPath selectors to target those fields
Enable ROBOTSTXT_OBEY=True and set DOWNLOAD_DELAY>=1 to be polite
Follow pagination links if the user needs data across multiple pages
Save results to output.json or output.csv

Always identify your bot with a descriptive USER_AGENT and never scrape login-protected or paywalled content.

Troubleshooting

Error: "Forbidden by robots.txt"

Symptom: Spider skips URLs and logs "Forbidden by robots.txt"
Solution: Review the site's robots.txt; only scrape paths that are allowed, or set ROBOTSTXT_OBEY = False if you have explicit permission from the site owner

Error: "Empty or missing data"

Symptom: Items are yielded with empty strings or None values
Solution: Inspect the page source (scrapy shell <url> ) and adjust your CSS/XPath selectors to match the actual HTML structure

Error: "Too many redirects / 429 Too Many Requests"

Symptom: Requests fail with HTTP 429 or redirect loops
Solution: Increase DOWNLOAD_DELAY , enable AUTOTHROTTLE_ENABLED = True , or add a Retry-After respecting middleware

Error: "JavaScript-rendered content not found"

Symptom: Expected data is missing because the site uses client-side rendering
Solution: Use scrapy-playwright or scrapy-splash middleware to render JavaScript before parsing

using-scrapy

Safety Notice

Copy this and send it to your AI assistant to learn

pip

Ubuntu/Debian

macOS

Verify installation

Create a new Scrapy project

Generate a spider

Run the spider and save to JSON

Run the spider and save to CSV

Run with custom settings (rate limiting, retries)

Run from a script (no project required)

Run without a Scrapy project

Source Transparency

Related Skills

news-aggregation

anonymous-file-upload

free-geocoding-and-maps