E-commerce Market Analyzer

Automated workflow for scraping e-commerce websites, handling popups, extracting product data, and generating comprehensive market analysis reports.

Workflow Overview

This skill follows a 4-step workflow:

Setup & Scraping - Run Playwright scraper to capture homepages
Visual Analysis - Analyze screenshots to identify product categories
Data Extraction - Parse HTML to extract specific products and prices
Report Generation - Create comprehensive market analysis report

User provides website list
         ↓
Step 1: Run scraper (handles popups automatically)
         ↓
Step 2: Analyze screenshots visually
         ↓
Step 3: Extract structured data from HTML
         ↓
Step 4: Generate final report

Step 1: Setup & Scraping

Quick Start

When user provides a list of e-commerce websites, immediately run the scraper:

# Create output directory
mkdir -p screenshots_clean

# Run the scraper
uv run python scripts/scrape_websites.py

Customizing the Website List

Edit scripts/scrape_websites.py and update the WEBSITES list:

WEBSITES = [
    "amazon.de",
    "ebay.de",
    "otto.de",
    # Add more websites...
]

Key Features

The scraper automatically:

Handles cookie consent popups (German, English, universal selectors)
Handles region/language selection dialogs
Captures full-page screenshots (1920x1080)
Saves HTML source code
Uses German locale settings (or customize for other markets)
Waits for page stabilization

Important: The script uses popup patterns from references/popup_patterns.md. Consult this file if dealing with new popup types.

Expected Output

After running, you'll have:

screenshots_clean/*.png - Full-page screenshots
screenshots_clean/*.html - HTML source files
Console output with success/failure summary

Success rate target: 85-95%

Common failures:

Anti-bot protection (requires manual intervention)
HTTP/2 protocol errors (some sites block automation)
Timeout on slow-loading sites

Step 2: Visual Analysis

Read Screenshots

After scraping, read the screenshot files to visually identify:

Product categories
Featured products
Promotional items
Visual design patterns

Example approach:

from pathlib import Path

screenshot_dir = Path("screenshots_clean")
screenshots = list(screenshot_dir.glob("*.png"))

# Read screenshots using the Read tool
for screenshot in screenshots[:5]:  # Start with 5 sites
    # Use Read tool to view image
    # Note product categories and featured items

What to Look For

Product Categories:

Clothing & Fashion (Bekleidung)
Electronics (Elektronik)
Home & Furniture (Möbel & Wohnen)
Food & Groceries (Lebensmittel)
Books & Media (Bücher)
Beauty & Personal Care (Beauty & Pflege)
Sports & Outdoor (Sport)
Toys & Baby (Spielzeug & Baby)

Featured Products:

Homepage banners
Promotional sections
"Deal of the day" items
New arrivals

Take notes on recurring patterns across multiple sites - these indicate market trends.

Step 3: Data Extraction

Strategy Selection

Choose extraction strategy based on site structure. See references/html_parsing_patterns.md for complete patterns.

Quick decision tree:

Try JSON-LD schema extraction (best for structured data)
Fall back to data attribute extraction
Fall back to class-based extraction
Last resort: keyword matching

Example: Extract from REWE.de

import re
from pathlib import Path

html_file = Path("screenshots_clean/rewe.de.html")
content = html_file.read_text(encoding='utf-8')

# REWE-specific patterns
title_pattern = r'data-offer-title="([^"]+)"'
price_pattern = r'<div class="cor-offer-price__tag-price">([^<]+)</div>'

titles = re.findall(title_pattern, content)
prices = re.findall(price_pattern, content)

for i, title in enumerate(titles[:10]):
    price = prices[i] if i < len(prices) else "N/A"
    print(f"{title}: {price}€")

Platform-Specific Parsing

Each e-commerce platform has unique HTML structure. Consult references/html_parsing_patterns.md for:

Amazon.de patterns
eBay.de patterns
Otto.de patterns
Zalando/AboutYou patterns
REWE/Lidl supermarket patterns
And more...

Price Normalization

Always normalize prices:

def normalize_price(price_str):
    """Convert German format (1.234,56€) to float"""
    price_str = price_str.replace('€', '').replace('EUR', '').strip()
    if ',' in price_str and '.' in price_str:
        price_str = price_str.replace('.', '').replace(',', '.')
    elif ',' in price_str:
        price_str = price_str.replace(',', '.')
    try:
        return float(price_str)
    except:
        return None

Handling Large Files

For HTML files >25k tokens:

# Use grep to search for specific patterns
grep -o 'data-product-name="[^"]*"' amazon.de.html | head -20

# Or extract specific sections
grep -A 5 'product-title' ebay.de.html

Extraction Best Practices

Try multiple patterns - Start with JSON-LD, fall back as needed
Validate extractions - Check for reasonable length (10-100 chars)
Remove duplicates - Use sets to track seen products
Limit results - Cap at 10-20 products per site
Handle encoding - Always use encoding='utf-8'

Step 4: Report Generation

Use the Report Template

Copy and customize assets/report_template.md:

cp assets/report_template.md final_report.md

Report Structure

The template includes these sections:

Executive Summary - Key findings
Top Product Categories - Ranked list with percentages
Verified Product Prices - Extracted data with exact prices
Platform-Specific Analysis - Per-site breakdown
Market Trends - Growth trends and consumer behavior
Seasonal Characteristics - Current and predicted
Technical Implementation - Success metrics and limitations
Business Insights - Opportunities and recommendations
Data Sources - Success/failure breakdown
Conclusions - Actionable takeaways

Filling the Template

Replace placeholder tokens:

{MARKET} → German, UK, US, etc.
{NUM_SITES} → 23, 25, etc.
{DATE} → 2026-03-19
{SUCCESS_RATE} → 92
{CATEGORY_1} → Clothing & Fashion
{PERCENTAGE_1} → 28
And so on...

Data Quality Indicators

Include these metrics:

Success rate: % of successfully scraped sites
Popup handling: # of sites with popups handled
Price accuracy: % of verified prices
Screenshot quality: Resolution and file size
HTML completeness: Average file size

Writing Tips

Be bilingual (for German market):

Product names: German + Chinese/English translation
Categories: "Bekleidung / Clothing"
Maintain both languages throughout

Be specific:

❌ "Electronics are popular"
✅ "AirPods 4 (89,90€ on eBay), PlayStation 5, and Samsung smartphones are top electronics"

Include evidence:

Reference screenshot file names
Quote exact prices with sources
Link specific platforms to products

Troubleshooting

Issue: Popup Not Closed

Solution: Check references/popup_patterns.md for the specific site. Add custom selector if needed:

# In scripts/scrape_websites.py, add to popup_selectors list:
popup_selectors = [
    # ... existing selectors ...
    'button:has-text("Neue Popup Text")',  # Add custom
]

Issue: HTML Parsing Returns Empty

Diagnose:

Check if HTML file exists and has content
Verify the pattern with grep: grep -o "your-pattern" file.html
Try alternative patterns from references/html_parsing_patterns.md
Use keyword matching as fallback

Issue: Anti-Bot Detection

Symptoms: CAPTCHA, "Verify you are human", IP blocking

Solutions:

Add delays between requests (already in script)
Customize user agent string
Use browser fingerprinting evasion
For production: consider proxy rotation (not included)

Issue: Timeout Errors

Solution: Adjust timeout in script:

await page.goto(url, wait_until="domcontentloaded", timeout=120000)  # 2min

Or use more relaxed loading strategy:

await page.goto(url, wait_until="load", timeout=90000)

Market-Specific Configuration

German Market (Default)

context = await browser.new_context(
    locale="de-DE",
    timezone_id="Europe/Berlin",
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)..."
)

Popup patterns: See references/popup_patterns.md → German Market section

UK Market

context = await browser.new_context(
    locale="en-GB",
    timezone_id="Europe/London",
)

Popup patterns: Use English/International selectors

US Market

context = await browser.new_context(
    locale="en-US",
    timezone_id="America/New_York",
)

Other Markets

Adjust locale and timezone_id accordingly. Update popup selectors in script based on language.

Advanced Usage

Parallel Scraping

For large website lists, modify script to use concurrent scraping:

import asyncio

async def scrape_all(websites):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        tasks = [capture_homepage(browser, url, output_dir) for url in websites]
        results = await asyncio.gather(*tasks)
        await browser.close()
    return results

Note: Be respectful of rate limits. Use delays.

Custom Analysis

Beyond the standard workflow, you can:

Compare prices across platforms
Track price changes over time (run periodically)
Identify pricing patterns (premium vs discount)
Analyze promotional strategies
Monitor competitor activity

Exporting Data

Consider exporting to structured formats:

CSV: For spreadsheet analysis
JSON: For programmatic access
Database: For long-term tracking

Example CSV export:

import csv

with open('products.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['Platform', 'Product', 'Price', 'Category'])
    for product in products:
        writer.writerow([product['platform'], product['name'],
                        product['price'], product['category']])

Best Practices

Ethical Scraping

Respect robots.txt - Check before scraping
Rate limiting - Don't overwhelm servers (script includes delays)
Terms of Service - Review site ToS
Personal use - This skill is for market research, not commercial resale

Data Quality

Verify prices - Cross-check suspicious values
Update regularly - E-commerce changes fast
Document assumptions - Note any manual adjustments
Keep raw data - Save screenshots and HTML for reference

Report Quality

Be objective - Base conclusions on data
Show your work - Reference sources
Contextualize - Explain market-specific factors
Actionable - Provide specific recommendations

Resources Reference

scripts/scrape_websites.py

Main scraper with automatic popup handling. Uses Playwright to capture homepages.

Usage: uv run python scripts/scrape_websites.py

references/popup_patterns.md

Comprehensive collection of popup selectors for different markets and platforms.

When to read: When encountering new popup types or troubleshooting popup handling.

references/html_parsing_patterns.md

Platform-specific HTML parsing patterns and extraction strategies.

When to read: When extracting product data from HTML files. Contains patterns for Amazon, eBay, REWE, Otto, Zalando, and generic strategies.

assets/report_template.md

Structured template for the final market analysis report.

Usage: Copy and fill in with analysis results.