Pdf Extractor Skill

# PDF Extractor Skill

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "Pdf Extractor Skill" with this command: npx skills add a851445115/pdf-extractor-skill

PDF Extractor Skill

Extract text and mathematical formulas from academic PDF papers. Supports both English and Chinese content.

When to Use This Skill

Use this skill when:

  • User needs to extract text and LaTeX formulas from PDF papers
  • User mentions "PDF转文本", "PDF提取公式", "论文OCR"
  • User wants to convert academic papers to Markdown format

Tool Selection

ToolBest ForLanguagesMath Quality
Marker (推荐)中英文论文、复杂公式Chinese + EnglishExcellent
Nougat纯英文论文、arXivEnglish onlyExcellent

推荐使用 Marker:支持中英文混排,公式识别效果更好。


Environment Setup

Conda Environment: pdf-extractor Python Path: D:\anaconda3\envs\pdf-extractor\python.exe

Key Dependencies

  • PyTorch 2.10.0+cu128 (CUDA 12.8)
  • marker-pdf (Surya OCR + Texify)
  • nougat-ocr 0.1.17
  • transformers

Important: Keep This Skill Self-Contained (No Extra Installs)

This skill is expected to run using ONLY the existing pdf-extractor conda environment and the scripts in scripts/.

Rules:

  • Do NOT run pip install ... / conda install ... / download random libraries during extraction.
  • If a dependency is missing (e.g., Nougat crashes due to missing torchvision), do NOT try to fix by installing packages. Switch tools (prefer Marker) or report the environment issue.
  • Slow runtime is normal for Marker (especially with --ark-code-latest). Prefer splitting the PDF rather than changing tools or adding dependencies.

Recommended approach for long PDFs:

  • Use --page-range (0-based) to extract per page or small page batches.
  • Merge the resulting markdown files afterward (simple concatenation is fine). Keep the combined file in the same folder as the per-page outputs so image links remain valid.

Example (per-page extraction with LLM mode):

D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest --page-range "0" -o "out/page_01.md"

Tool 1: Marker (推荐 - 中英文支持)

Command Line

# 转换中文论文 (默认支持中英文)
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "论文.pdf"

# 指定输出路径
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" -o "output.md"

# 强制 OCR (用于扫描版 PDF)
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "scanned.pdf" --force-ocr

# 使用火山方舟 Coding Plan (OpenAI-compatible) 增强转换质量(表格/公式/跨页结构更稳)
# 注意:默认走 ark-code-latest,后台会自动路由到合适的模型
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest

# 只跑第 1 页做快速验证(0-based page index)
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest --page-range "0" -o "out_first_page.md"

# 如需自定义(不推荐):也可以手动指定 --openai-base-url/--openai-api-key/--openai-model

# 指定语言
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" --languages Chinese English Japanese

Python API

import sys
sys.path.insert(0, r'C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts')
from pdf2md_marker import convert_pdf, convert_pdf_cli

# 简单用法
output_file = convert_pdf_cli('论文.pdf', 'output.md')

# 完整 API
markdown_text, metadata = convert_pdf(
    'paper.pdf',
    output_dir='./output',
    force_ocr=False,
    batch_multiplier=2,
    languages=['Chinese', 'English']
)
print(markdown_text)

Marker Options

OptionDescription
-o, --outputOutput file (.md) or directory
--force-ocrForce OCR even for text PDFs
--batch-multiplierBatch size multiplier (default: 2)
--languagesLanguages in document (default: Chinese English)

Tool 2: Nougat (纯英文论文)

Command Line

# Convert entire PDF
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2latex.py "paper.pdf"

# Convert specific pages
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2latex.py "paper.pdf" -p 0-5

# Custom output
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2latex.py "paper.pdf" -o output.mmd

# Save each page separately
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2latex.py "paper.pdf" --per-page

Python API

import sys
sys.path.insert(0, r'C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts')
from pdf2latex import load_model, process_pdf, save_results

# Load model (uses GPU if available)
model, device = load_model()

# Process PDF
results = process_pdf('paper.pdf', model, device)

# Save as single markdown file
save_results(results, 'output.mmd')

# Or save per page
save_results(results, 'output_pages/', format='pages')

Nougat Options

OptionDescription
-o, --outputOutput file or directory
-p, --pagesPage range (e.g., "0-5" or "1,3,5")
-m, --modelModel tag (default: 0.1.0-base)
--dpiRender DPI (default: 300)
--cpuForce CPU mode
--per-pageSave each page separately

Output Format

Both tools output Markdown with LaTeX math:

  • Text is extracted as regular markdown
  • Mathematical formulas are in LaTeX format:
    • Inline: $formula$
    • Display: $$formula$$
  • Tables, figures, and references are preserved
  • Marker also extracts images to separate folder

Comparison

FeatureMarkerNougat
Chinese Support✓ Excellent✗ Poor
English Support✓ Excellent✓ Excellent
Math Formulas✓ (Texify)✓ (Native)
Table Extraction
Image Extraction
Speed (RTX 4060)~2 min/page~10-15 sec/page
OCR QualityExcellentGood

Troubleshooting

Import Errors

Make sure you're using the correct Python:

D:\anaconda3\envs\pdf-extractor\python.exe your_script.py

CUDA Out of Memory

Try CPU mode (Nougat) or reduce batch size (Marker):

# Nougat: use CPU
D:\anaconda3\envs\pdf-extractor\python.exe pdf2latex.py paper.pdf --cpu

# Marker: reduce batch multiplier
D:\anaconda3\envs\pdf-extractor\python.exe pdf2md_marker.py paper.pdf --batch-multiplier 1

Chinese Characters Not Recognized

Use Marker instead of Nougat for Chinese documents.

Slow Processing

  • Marker is slower but more accurate (uses multiple ML models)
  • For faster processing on English-only papers, use Nougat
  • Ensure GPU is being used (check CUDA availability)

Model Information

Marker Models (downloaded automatically):

  • Surya OCR: Text detection and recognition
  • Texify: Math formula recognition
  • Layout analysis models

Nougat Base Model (1.31 GB):

  • Location: C:\Users\cr\.cache\torch\hub\nougat-0.1.0-base
  • Best for: Standard academic papers, arXiv papers

Example Workflow

import sys
sys.path.insert(0, r'C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts')

def extract_paper(pdf_path, is_chinese=True):
    """
    Extract text and formulas from academic paper.
    
    Args:
        pdf_path: Path to PDF file
        is_chinese: True for Chinese papers, False for English only
    
    Returns:
        Extracted markdown text
    """
    if is_chinese:
        from pdf2md_marker import convert_pdf
        text, _ = convert_pdf(pdf_path, languages=['Chinese', 'English'])
    else:
        from pdf2latex import load_model, process_pdf
        model, device = load_model()
        results = process_pdf(pdf_path, model, device)
        text = '\n\n'.join([t for _, t in results])
    
    return text

# Usage
text = extract_paper('中文论文.pdf', is_chinese=True)
print(text)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

search-specialist

Expert search specialist mastering advanced information retrieval, query optimization, and knowledge discovery. Specializes in finding needle-in-haystack inf...

Registry SourceRecently Updated
Research

Biomarker Investigation

Search the academic and patent literatures related to the biomarkers, based on the queries Load the skill when the queries are about - Refer a specific bioma...

Registry SourceRecently Updated
Research

Household Renewal Command Center

Build a household renewal tracker for IDs, passports, insurance, vehicle registration, warranties, subscriptions, school forms, and medical paperwork.

Registry SourceRecently Updated
Research

Inbox Paperwork Triage Sprint

Turns a pile of email, mail, forms, receipts, and notices into a fast triage board with action buckets, deadlines, missing information, and next-step scripts.

Registry SourceRecently Updated
350Profile unavailable