Article Extractor

This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.

When to Use This Skill

Activate when the user:

Provides an article/blog URL and wants the text content
Asks to "download this article"
Wants to "extract the content from [URL]"
Asks to "save this blog post as text"
Needs clean article text without distractions

How It Works

Priority Order:

Check if tools are installed (reader or trafilatura)
Download and extract article using best available tool
Clean up the content (remove extra whitespace, format properly)
Save to file with article title as filename
Confirm location and show preview

Installation Check

Check for article extraction tools in this order:

Option 1: reader (Recommended - Mozilla's Readability)

command -v reader

If not installed:

npm install -g @mozilla/readability-cli

or

npm install -g reader-cli

Option 2: trafilatura (Python-based, very good)

command -v trafilatura

If not installed:

pip3 install trafilatura

Option 3: Fallback (curl + simple parsing)

If no tools available, use basic curl + text extraction (less reliable but works)

Extraction Methods

Method 1: Using reader (Best for most articles)

Extract article

reader "URL" > article.txt

Pros:

Based on Mozilla's Readability algorithm
Excellent at removing clutter
Preserves article structure

Method 2: Using trafilatura (Best for blogs/news)

Extract article

trafilatura --URL "URL" --output-format txt > article.txt

Or with more options

trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt

Pros:

Very accurate extraction
Good with various site structures
Handles multiple languages

Options:

--no-comments : Skip comment sections
--no-tables : Skip data tables
--precision : Favor precision over recall
--recall : Extract more content (may include some noise)

Method 3: Fallback (curl + basic parsing)

Download and extract basic content

curl -s "URL" | python3 -c " from html.parser import HTMLParser import sys

class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'} self.current_tag = None

def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
            self.in_content = True
    self.current_tag = tag

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\n\n'.join(self.content)

parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > article.txt

Note: This is less reliable but works without dependencies.

Getting Article Title

Extract title for filename:

Using reader:

reader outputs markdown with title at top

TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')

Using trafilatura:

Get metadata including title

TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")

Using curl (fallback):

TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .//' | sed 's/ | .//')

Filename Creation

Clean title for filesystem:

Get title

TITLE="Article Title from Website"

Clean for filesystem (remove special chars, limit length)

FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')

Add extension

FILENAME="${FILENAME}.txt"

Complete Workflow

ARTICLE_URL="https://example.com/article"

Check for tools

if command -v reader &> /dev/null; then TOOL="reader" echo "Using reader (Mozilla Readability)" elif command -v trafilatura &> /dev/null; then TOOL="trafilatura" echo "Using trafilatura" else TOOL="fallback" echo "Using fallback method (may be less accurate)" fi

Extract article

case $TOOL in reader) # Get content reader "$ARTICLE_URL" > temp_article.txt

    # Get title (first line after # in markdown)
    TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
    ;;

trafilatura)
    # Get title from metadata
    METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
    TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")

    # Get clean content
    trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
    ;;

fallback)
    # Get title
    TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '&#x3C;title>\K[^&#x3C;]+' | head -n 1)
    TITLE=${TITLE%% - *}  # Remove site name
    TITLE=${TITLE%% | *}  # Remove site name (alternate)

    # Get content (basic extraction)
    curl -s "$ARTICLE_URL" | python3 -c "

from html.parser import HTMLParser import sys

class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}

def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main'}:
            self.in_content = True
    if tag in {'h1', 'h2', 'h3'}:
        self.content.append('\n')

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\n\n'.join(self.content)

parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > temp_article.txt ;; esac

Clean filename

FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//') FILENAME="${FILENAME}.txt"

Move to final filename

mv temp_article.txt "$FILENAME"

Show result

echo "✓ Extracted article: $TITLE" echo "✓ Saved to: $FILENAME" echo "" echo "Preview (first 10 lines):" head -n 10 "$FILENAME"

Error Handling

Common Issues

Tool not installed

Try alternate tool (reader → trafilatura → fallback)
Offer to install: "Install reader with: npm install -g reader-cli"

Paywall or login required

Extraction tools may fail
Inform user: "This article requires authentication. Cannot extract."

Invalid URL

Check URL format
Try with and without redirects

No content extracted

Site may use heavy JavaScript
Try fallback method
Inform user if extraction fails

Special characters in title

Clean title for filesystem
Remove: / , : , ? , " , < , > , |
Replace with - or remove

Output Format

Saved File Contains:

Article title (if available)
Author (if available from tool)
Main article text
Section headings
No navigation, ads, or clutter

What Gets Removed:

Navigation menus
Ads and promotional content
Newsletter signup forms
Related articles sidebars
Comment sections (optional)
Social media buttons
Cookie notices

Tips for Best Results

Use reader for most articles

Best all-around tool
Based on Firefox Reader View
Works on most news sites and blogs

Use trafilatura for:

Academic articles
News sites
Blogs with complex layouts
Non-English content

Fallback method limitations:

May include some noise
Less accurate paragraph detection
Better than nothing for simple sites

Check extraction quality:

Always show preview to user
Ask if it looks correct
Offer to try different tool if needed

Example Usage

Simple extraction:

User: "Extract https://example.com/article"

reader "https://example.com/article" > temp.txt TITLE=$(head -n 1 temp.txt | sed 's/^# //') FILENAME="$(echo "$TITLE" | tr '/' '-').txt" mv temp.txt "$FILENAME" echo "✓ Saved to: $FILENAME"

With error handling:

if ! reader "$URL" > temp.txt 2>/dev/null; then if command -v trafilatura &> /dev/null; then trafilatura --URL "$URL" --output-format txt > temp.txt else echo "Error: Could not extract article. Install reader or trafilatura." exit 1 fi fi

Best Practices

✅ Always show preview after extraction (first 10 lines)
✅ Verify extraction succeeded before saving
✅ Clean filename for filesystem compatibility
✅ Try fallback method if primary fails
✅ Inform user which tool was used
✅ Keep filename length reasonable (< 100 chars)

After Extraction

Display to user:

"✓ Extracted: [Article Title]"
"✓ Saved to: [filename]"
Show preview (first 10-15 lines)
File size and location

Ask if needed:

"Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
"Should I extract another article?"

article-extractor

Safety Notice

Copy this and send it to your AI assistant to learn

or

Extract article

Extract article

Or with more options

Download and extract basic content

reader outputs markdown with title at top

Get metadata including title

Get title

Clean for filesystem (remove special chars, limit length)

Add extension

Check for tools

Extract article

Clean filename

Move to final filename

Show result

User: "Extract https://example.com/article"

Source Transparency

Related Skills

docker & kubernetes orchestrator

enterprise erp consultant

canvas-design

webapp-testing