article-extractor

Extract clean article content from URLs, removing ads, navigation, and clutter. Save as readable text files for research, archiving, or offline reading.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "article-extractor" with this command: npx skills add founderjourney/claude-skills/founderjourney-claude-skills-article-extractor

Article Extractor Skill

This skill extracts clean article content from web URLs, removing ads, navigation, sidebars, and other clutter to save readable text files.

When to Use This Skill

  • Downloading article text from a URL
  • Saving blog posts as clean text
  • Removing distractions from web articles
  • Archiving content for offline reading
  • Extracting content for research
  • Creating a local reading library

How to Use

Basic Extraction

Extract the article from https://example.com/article

Save to Specific Location

Extract this article and save to ~/reading/
https://example.com/interesting-post

Multiple Articles

Extract these articles:
- https://example.com/post-1
- https://example.com/post-2
- https://example.com/post-3

Extraction Methods

The skill uses multiple tools in priority order:

1. Reader (Mozilla Readability)

  • Uses Firefox Reader View algorithm
  • Excellent at removing clutter
  • Preserves article structure

2. Trafilatura (Python)

  • Very accurate extraction
  • Works great for blogs and news
  • Options: --no-comments, --precision

3. Fallback (curl + parsing)

  • No dependencies required
  • Basic HTML parsing
  • Less reliable but always works

What Gets Preserved

  • Article text and paragraphs
  • Section headings
  • Author information
  • Publication date
  • Article structure

What Gets Removed

  • Navigation bars
  • Advertisements
  • Newsletter signup forms
  • Sidebars
  • Comments sections
  • Social sharing buttons
  • Cookie notices
  • Related article widgets

Filename Generation

Files are named based on:

  1. Article title (cleaned)
  2. Special characters removed (/, :, ?, ", <, >, |)
  3. Length limited to 80-100 characters
  4. Extension: .txt

Example:

"How to Build a Great Product: A Guide"
  → "How to Build a Great Product - A Guide.txt"

Output Format

After extraction:

Title: [Article Title]
Author: [Author Name]
Date: [Publication Date]
Source: [Original URL]

---

[Clean article content...]

Error Handling

The skill handles:

  • Paywalled content: Extracts available preview
  • Missing tools: Falls back to alternatives
  • Invalid URLs: Provides clear error message
  • Failed extraction: Suggests manual copy
  • Filename issues: Auto-sanitizes problematic characters

Advanced Options

With Metadata Only

Extract just the title and author from this URL

Specific Format

Extract this article as markdown

Research Mode

Extract and summarize the key points from this article

Best Practices

  1. Check Output: Always verify extraction quality
  2. Save Originals: Keep the source URL for reference
  3. Organize Files: Use meaningful folder structures
  4. Batch Processing: Extract multiple related articles together
  5. Respect Copyright: Use for personal research only

Dependencies

For best results, install:

# Mozilla Readability
npm install -g @nicolo-ribaudo/readability-cli

# Or Trafilatura (Python)
pip install trafilatura

Without dependencies, the skill uses fallback methods.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

content-research-writer

No summary provided by upstream source.

Repository SourceNeeds Review
Research

lead-research-assistant

No summary provided by upstream source.

Repository SourceNeeds Review
General

firecrawl

No summary provided by upstream source.

Repository SourceNeeds Review