Web Scraping
Extract data with the lightest reliable method first.
Choose the approach
- Use
web_fetchfor simple public pages when the needed content is already in HTML. - Use
browserwhen the site is dynamic, needs clicking, infinite scroll, filters, tabs, or login/session state. - Use
web_searchonly to discover candidate pages when the target URL is unknown.
Default workflow
- Identify the target site and exact fields to collect.
- Test one page first.
- Decide the extraction method:
web_fetchfor readable article/listing textbrowser snapshotfor dynamic DOM inspection
- Normalize the output into a stable schema.
- If scraping multiple pages, avoid tight loops and serialize requests.
- Deduplicate by URL or stable item id.
- Save results in the workspace when the task is larger than a quick one-off.
Browser scraping pattern
- Open the page.
- Take a snapshot.
- Interact only as needed: search, click filters, pagination, expand sections.
- Re-snapshot after each meaningful state change.
- Extract only the fields the user asked for.
- Close tabs when finished.
Output guidance
Prefer one of these formats:
- concise bullet summary
- JSON array of objects
- CSV/TSV when the user wants exportable rows
Use explicit keys, for example:
[
{
"title": "...",
"url": "...",
"source": "...",
"date": "...",
"summary": "..."
}
]
Reliability rules
- Do not invent missing fields.
- If a site blocks access, say so and switch sources when appropriate.
- For news/results pages, capture source + title + link at minimum.
- For large jobs, checkpoint partial results to a workspace file.
- Prefer fewer larger writes over many tiny writes.
Cleanup
- Close browser tabs opened for scraping.
- If you create state/output files, store them under the workspace and name them clearly.