Web Extractor
Extract complete text content from web pages, even when content is dynamically loaded by JavaScript, behind authentication, or uses virtual scrolling.
When This Skill Is Needed
Many modern web pages don't serve their content as static HTML. Instead, content is loaded by JavaScript after the page renders, making simple HTTP fetches return empty or partial results. Common scenarios:
-
Authentication-protected pages: Sites requiring login (Google Docs, Notion, etc.)
-
JS-rendered SPAs: React/Vue/Angular apps where content lives in JavaScript state
-
Virtual scrolling: Long documents that only render visible content in the DOM (the content that scrolled past is removed, and content below isn't yet created)
-
Lazy-loaded content: Sections that load as you scroll down
The key insight: even though JS loads content dynamically, once it renders, the content enters the DOM and becomes readable via querySelector / innerText . The challenge is making sure each section gets rendered (usually by scrolling) and reading it before it gets removed (in virtual scroll cases).
Strategy Decision Tree
Start: navigate to URL, wait 3-5s, call get_page_text │ ├─ Got complete content? → DONE (Strategy 1: Simple) │ ├─ Got partial content? │ ├─ Page has "load more" or infinite scroll? → Strategy 2: Lazy Load │ └─ Content seems truncated at viewport boundary? → Strategy 3: Virtual Scroll │ ├─ Got almost nothing / page is data-heavy? │ └─ Check read_network_requests for API calls → Strategy 4: API Intercept │ └─ Got only script/boilerplate? Page uses <canvas> (Unity/WebGL/WebGPU)? └─ Strategy 5: Canvas Visual Extraction (screenshot + transcription)
Strategy 1: Simple Pages
For pages where all content loads once and stays in the DOM.
Steps:
-
navigate to the URL
-
wait 3-5 seconds for JS to finish rendering
-
get_page_text — this should capture everything
Works for: blogs, news articles, server-side rendered documentation.
Strategy 2: Lazy-Loaded Pages
Content appends to the DOM as you scroll, but previously-loaded content stays.
Steps:
-
Navigate and wait for initial load
-
get_page_text to capture the first section
-
Scroll down using computer action (scroll or End key)
-
wait 1-2 seconds for new content to load
-
get_page_text again — new content will be at the bottom
-
Repeat steps 3-5 until get_page_text returns no new content
-
Merge all captured text, removing duplicate overlapping sections
Works for: infinite-scroll feeds, long articles that load in chunks.
Strategy 3: Virtual Scrolling Pages
This is the hardest case. The page actively removes off-screen content from the DOM and only renders what's currently in the viewport. The full content never exists in the DOM simultaneously.
Step 1: Find the Scroll Container
Most virtual-scroll pages scroll inside a specific <div> , not the browser window. Use javascript_tool to locate it:
// Quick search by common class names
const el = document.querySelector(
'.doc-content, .article-content, .ql-editor, ' +
'[class*="editor"], [class*="scroll-container"], ' +
'[class*="virtual"], main article'
);
if (el && el.scrollHeight > el.clientHeight + 200) {
Found: <${el.tagName}> class="${el.className}" +
scrollHeight=${el.scrollHeight} clientHeight=${el.clientHeight};
}
If the quick search misses it, use a broader search:
const all = document.querySelectorAll('*');
for (const el of all) {
const s = getComputedStyle(el);
if ((s.overflowY === 'auto' || s.overflowY === 'scroll') &&
el.scrollHeight > el.clientHeight + 200) {
Found: <${el.tagName}> class="${el.className}" +
scrollHeight=${el.scrollHeight} clientHeight=${el.clientHeight};
break;
}
}
Record the scrollHeight (total document length) and clientHeight (viewport height).
Step 2: Scroll and Capture
Move through the document in steps equal to clientHeight , reading at each position:
positions = [0, clientHeight, 2*clientHeight, ..., scrollHeight] for each position: 1. javascript_tool: set container.scrollTop = position 2. wait 1-2 seconds (separate tool call — don't loop in JS, it'll timeout) 3. get_page_text to capture the currently visible content 4. store the result
Each tool call is separate because javascript_tool has a 30-second timeout. Trying to scroll+wait+read in a single JS execution for many positions will fail.
Step 3: Merge and Deduplicate
Each capture includes repeated elements (navigation, sidebar, TOC) plus the unique content visible at that scroll position. To merge:
-
The repeating parts (nav, sidebar, headers) appear identically in every capture
-
The unique content portion changes with each scroll position
-
Strip the repeated portions and concatenate the unique content in order
Real-World Example: 轻雀文档
Here's what we actually did to extract a 33,000px-tall document:
- Found scrollable container: .vodka-appview-editor (scrollHeight: 33202)
- Scrolled to positions: 0, 2500, 5000, 10000, 15000, 17500, 20000, 22500, 25000, 30000
- Called get_page_text at each position
- Each call returned ~2000-4000 chars of unique content
- Merged into a structured Markdown file with proper headings
Strategy 4: API Interception
Instead of scraping the rendered DOM, capture the raw data the page fetches.
Steps:
-
Navigate to the URL and let it load
-
read_network_requests with a urlPattern filter (e.g., /api/ , /graphql )
-
Identify responses containing the document data (usually JSON)
-
Extract content directly from the API response payload
This is often the cleanest approach for data-heavy pages, dashboards, or apps with clear REST/GraphQL backends. The data is structured and complete, without the noise of navigation elements.
Strategy 5: Canvas-Rendered Content (Unity WebGL, WebGPU, etc.)
When text is rendered inside an HTML <canvas> by a game engine or custom WebGL/WebGPU application, it exists only as GPU-drawn pixels — not as DOM text. All DOM-based extraction methods (get_page_text, innerText, clipboard) will fail completely.
How to Detect Canvas-Rendered Pages
Signs that you're dealing with canvas-rendered content:
-
get_page_text returns only JavaScript loader code, no readable content
-
The page contains a prominent <canvas> element (check with read_page )
-
Page title mentions "Unity", "Unreal", "WebGL", "WebGPU", or a game engine
-
The HTML source has createUnityInstance , .wasm , or .data file references
-
The page loads large binary assets (multi-MB .data or .wasm files)
What Does NOT Work (Do Not Attempt)
These approaches have been tested and confirmed to fail for canvas content:
-
get_page_text / DOM extraction: Canvas is opaque to DOM — returns nothing useful
-
Ctrl+A / Ctrl+C (keyboard copy): Game engines intercept keyboard events; clipboard stays empty
-
navigator.clipboard.readText() : Returns undefined; no text to copy from canvas
-
Unity JavaScript API (SendMessage ): The Unity instance is typically not exposed globally (captured inside a .then() callback). Even if found, there's no standard method to export UI text
-
Binary data file parsing (Python strings ): Unity's serialization fragments text across the binary — extraction produces heavily corrupted output with garbled characters
-
Scrollbar drag (left_click_drag ): Unity's UI drag handlers don't respond to browser drag events
-
Keyboard scrolling (Page Down, Home, arrow keys): Unity captures these keys but doesn't bind them to Scroll Rect components by default
What DOES Work: Mouse Wheel Scrolling + Screenshot Transcription
Unity's Scroll Rect component natively responds to OnScroll events from the mouse wheel. This is the one browser input that reliably propagates to Unity's EventSystem.
Steps:
-
Navigate to the URL and wait for the canvas to finish loading
-
Take an initial screenshot to see the first portion of visible text
-
Use computer tool with scroll action (mouse wheel) over the canvas area:
-
Start with 3–5 scroll ticks downward
-
Use the coordinate parameter to target the center of the text area
-
Take a screenshot after each scroll
-
Continue scrolling and capturing until you see empty space below the last line of text
-
If gaps exist between captures, scroll back up in smaller increments (1–2 ticks) and re-capture
-
Visually read/transcribe the text from the series of screenshots into a text file
-
Save the transcribed content to the user's workspace
Scroll Increment Guidelines
Content density Recommended ticks Overlap strategy
Large text 3–5 ticks ~2 lines overlap between screenshots
Small text 1–3 ticks ~3 lines overlap between screenshots
Unknown 3 ticks Check overlap, adjust as needed
Example: Unity WebGL Text Extraction
- Navigated to http://127.0.0.1:5500/SRW2/index.html
- get_page_text → returned only JS loader code (Strategy 5 triggered)
- screenshot → saw "Self Rendered Web Test" title + first portion of text
- scroll(coordinate=[650,500], direction=down, amount=3) → took screenshot
- Repeated scroll+screenshot ~8 more times
- Scrolled back up by 3 ticks to fill one gap between captures
- Transcribed all visible text from ~10 screenshots
- Saved complete content (2 paragraphs, ~400 words) to extracted_content.txt
Applicability Beyond Unity
This strategy works for any canvas-based rendering:
-
Unity WebGL builds
-
Unreal Engine HTML5 exports
-
Godot web exports
-
Custom WebGL/WebGPU applications
-
PDF.js canvas-rendered PDFs (when DOM fallback is disabled)
-
Figma embedded previews
-
Any app drawing text to <canvas> via 2D context or WebGL
Common Pitfalls
Reading too fast after scrolling: JS needs time to render. Always use a separate wait call (1-2 seconds) between scrolling and reading.
Virtual scroll erases content: Don't scroll to the bottom expecting to read everything — only the bottom section exists in the DOM at that point.
Wrong scroll target: Many apps scroll inside a <div> , not window . If window.scrollTo doesn't trigger new content, find the real scroll container.
Cookie banners / modals blocking content: Dismiss these first. Use find
to locate "Accept" or close buttons and click them before extracting.
JS timeout in loops: javascript_tool times out at 30s. Never put await loops or setTimeout chains in a single JS call. Instead, make each scroll-wait-read cycle a separate set of tool calls.
Blocked content: Some text in javascript_tool results may be blocked by safety filters. If a substring retrieval returns [BLOCKED] , try get_page_text at that scroll position instead, which uses a different extraction path.
Canvas content mistaken for DOM content: If get_page_text returns only script/boilerplate and the page has a <canvas> , don't keep trying DOM-based methods — switch immediately to Strategy 5 (visual extraction). All DOM approaches will fail for canvas-rendered text.
Scrollbar drag on canvas apps: Never use left_click_drag to try to scroll inside Unity/Unreal/Godot canvas areas. These engines use their own event systems. Only mouse wheel (scroll action) works reliably.
Binary asset parsing temptation: It's tempting to extract text from Unity .data files or Unreal .pak files using strings or byte scanning. This produces corrupted output due to engine-specific serialization. Don't waste time on this — go straight to visual extraction.
Output Format
Save extracted content as Markdown:
[Document Title]
Source: [URL] Extracted: [YYYY-MM-DD]
[Content organized with proper headings and structure]
Save to the user's workspace folder with a descriptive filename.