Sci-Data-Extractor

AI-powered tool for extracting structured data from scientific literature PDFs

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "Sci-Data-Extractor" with this command: npx skills add JackKuo666/sci-data-extractor

You are a professional scientific literature data extraction assistant, helping users extract structured data from scientific paper PDFs.

Core Features

PDF Content Extraction

  • Extract text from PDFs using Mathpix OCR or PyMuPDF
  • Support for formula and table recognition

Data Extraction

  • Use LLMs (Claude/GPT-4o/compatible APIs) to extract structured data from literature
  • Automatically identify field types and data structures
  • Support custom extraction rules and prompts

Output Formats

  • Markdown tables
  • CSV files

Installation

Prerequisites

  • Python 3.8+
  • pip package manager

Setup Steps

  1. Install Python dependencies (choose one method):

    Method 1: Using uv (Recommended - Fastest)

    # Install uv
    curl -LsSf https://astral.sh/uv/install.sh | sh
    
    # Create virtual environment and install dependencies
    cd /path/to/sci-data-extractor
    uv venv
    source .venv/bin/activate  # Linux/macOS
    # or .venv\Scripts\activate  # Windows
    uv pip install -r requirements.txt
    

    Method 2: Using conda (Best for scientific/research users)

    cd /path/to/sci-data-extractor
    conda create -n sci-data-extractor python=3.11 -y
    conda activate sci-data-extractor
    pip install -r requirements.txt
    

    Method 3: Using pip directly (Built-in, no extra installation)

    cd /path/to/sci-data-extractor
    pip install -r requirements.txt
    
  2. Configure API credentials:

    # Copy example configuration
    cp .env.example .env
    
    # Edit .env and add your API key
    # Get API key from: https://console.anthropic.com/
    EXTRACTOR_API_KEY=your-api-key-here
    EXTRACTOR_BASE_URL=https://api.anthropic.com
    EXTRACTOR_MODEL=claude-sonnet-4-5-20250929
    EXTRACTOR_MAX_TOKENS=16384
    
  3. Optional: Configure Mathpix OCR (for high-precision OCR):

    # Get credentials from: https://api.mathpix.com/
    MATHPIX_APP_ID=your-mathpix-app-id
    MATHPIX_APP_KEY=your-mathpix-app-key
    

Verify Installation

python extractor.py --help

Get API Keys

How to Use

When users request data extraction:

  1. Understand requirements: Ask what type of data to extract
  2. Choose method:
    • Use preset templates (enzyme/experiment/review)
    • Use custom extraction prompts
  3. Execute extraction:
    python extractor.py input.pdf --template enzyme -o output.md
    
  4. Verify results: Display extracted data and ask if adjustments needed

Preset Templates

Enzyme Kinetics Data (enzyme)

Fields: Enzyme, Organism, Substrate, Km, Unit_Km, Kcat, Unit_Kcat, Kcat_Km, Unit_Kcat_Km, Temperature, pH, Mutant, Cosubstrate

Experimental Results Data (experiment)

Fields: Experiment, Condition, Result, Unit, Standard_Deviation, Sample_Size, p_value

Literature Review Data (review)

Fields: Author, Year, Journal, Title, DOI, Key_Findings, Methodology

Configuration Requirements

Users should set environment variables (optional, can also be in .env file):

  • EXTRACTOR_API_KEY: LLM API key
  • EXTRACTOR_BASE_URL: API endpoint
  • EXTRACTOR_MODEL: Model name (default: claude-sonnet-4-5-20250929)
  • EXTRACTOR_TEMPERATURE: Temperature parameter (default: 0.1)
  • EXTRACTOR_MAX_TOKENS: Maximum output tokens (default: 16384)
  • MATHPIX_APP_ID: Mathpix OCR App ID (optional)
  • MATHPIX_APP_KEY: Mathpix OCR Key (optional)

Best Practices

  1. Verify API key configuration before extraction
  2. Recommend users validate extracted data for accuracy
  3. Long documents may require segmented processing
  4. Remind users to cite original literature

Usage Examples

Example command for enzyme kinetics extraction:

python extractor.py paper.pdf --template enzyme -o results.md

Example for custom extraction:

python extractor.py paper.pdf -p "Extract all protein structures with PDB IDs" -o custom.md

Example for CSV output:

python extractor.py paper.pdf --template enzyme -o results.csv --format csv

Notes

  • This tool is for academic research use only
  • Always validate AI-extracted results
  • Respect copyright when using extracted data
  • Cite original sources appropriately

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Charging Ledger

充电记录账本 - 从截图提取充电信息并记录,支持按周、月查询汇总。**快速暗号**: 充电记录、充电账本、充电汇总。**自然触发**: 记录充电、查询充电费用、充电统计。

Registry SourceRecently Updated
General

qg-skill-sync

从团队 Git 仓库同步最新技能到本机 OpenClaw。支持首次设置、定时自动更新、手动同步和卸载。当用户需要同步技能、设置技能同步、安装或更新团队技能,或提到「技能同步」「同步技能」时使用。

Registry SourceRecently Updated
General

Ad Manager

广告投放管理 - 自动管理广告投放、优化ROI、生成报告。适合:营销人员、电商运营。

Registry SourceRecently Updated