arxiv-paper-processor

Tool-only paper processing skill with a manual language parameter: supports batch artifact download for many papers or single-paper download, then the model manually reads source/PDF and writes summary.md in the selected language. Use when per-paper comprehension should be model-driven instead of script-generated.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "arxiv-paper-processor" with this command: npx skills add xukp20/arxiv-paper-processor

ArXiv Paper Processor

Use this skill for per-paper manual summarization, with optional batch artifact download.

  • Single-paper mode: process one paper directory (e.g. <run_dir>/<arxiv_id>/).
  • Batch predownload mode: process many paper directories under one run dir before writing summaries.

Language Parameter

  • Use a workflow language parameter (for example English or Chinese) and apply it manually.
  • The per-paper summary.md must be written in the selected language.
  • If download scripts are called directly, pass --language <LANG> for traceability.

Core Principle

Scripts only fetch artifacts. The model performs reading and writing.

Non-negotiable Constraint

  • Do not generate summary.md by script-based snippet extraction, regex harvesting, or template autofill.
  • Do not use Python/shell scripts to auto-compose section text from abstract/introduction fragments.
  • Scripts in this skill are only for artifact download (source/pdf) and trace logs.
  • The final summary.md must come from model-side reading and synthesis of the paper content.

Optional Batch Artifact Download (Many Papers)

Use this first when Stage B has many papers:

python3 scripts/download_papers_batch.py \
  --run-dir /path/to/run \
  --artifact source_then_pdf \
  --max-workers 3 \
  --min-interval-sec 5 \
  --language English

Key behavior:

  • Supports --artifact source, --artifact pdf, or --artifact source_then_pdf (default).
  • Supports concurrency (--max-workers) and safe throttling/retry (--min-interval-sec, retry args).
  • Uses run-local throttle state by default (<run_dir>/.runtime/arxiv_download_state.json) to reduce 429 risk.
  • Skips papers that already have usable source/source_extract/*.tex or existing source/paper.pdf (unless --force).
  • Resume-friendly: if a paper already has a completed summary.md, you can skip that paper's summary-writing step.
  • Writes batch log to <run_dir>/download_batch_log.json by default.

Step 1: Download Source (Preferred)

python3 scripts/download_arxiv_source.py \
  --paper-dir /path/to/run/2602.00528 \
  --language English

This writes:

  • source/source_bundle.bin
  • source/source_extract/
  • source/download_source_log.json

If usable source already exists and --force is not set, the script reuses local artifacts.

Step 2: If Needed, Download PDF

python3 scripts/download_arxiv_pdf.py \
  --paper-dir /path/to/run/2602.00528 \
  --language English

This writes:

  • source/paper.pdf
  • source/download_pdf_log.json

If PDF already exists and --force is not set, the script reuses local artifacts.

Step 3: Model Reads and Summarizes

  1. If summary.md already exists and follows the required format, skip this paper and mark it complete.
  2. Read metadata.md first.
  3. If source/source_extract/ already exists with readable .tex files, use it directly.
  4. Otherwise, if source/paper.pdf already exists, use PDF directly.
  5. If neither exists, run download scripts (single-paper scripts or batch script) first.
  6. Manually write summary.md in the same paper directory, in the selected language.

Do not rely on rule-based auto summarization. Do not rely on auto-extracted snippets as the primary writing basis.

Quality Requirement

  • Every section should include paper-specific details that are traceable to full-text reading.
  • Section 4/5/10 should reflect concrete method and evaluation details, not generic wording.
  • If key details are unclear in the source, explicitly note uncertainty instead of guessing.
  • Match the detail level shown in references/summary-example-en.md and references/summary-example-zh.md.
  • If your draft is clearly shorter or less specific than the examples, expand it before finishing.

Required Output

  • <paper_dir>/summary.md in fixed section format.
  • Pay special attention to section ## 10. Brief Conclusion: write a 3-4 sentence mini-conclusion that covers contribution, method, evaluation setup, and results with paper-specific details.
  • In section ## 1. Paper Snapshot, use exact keys: ArXiv ID, Title, Authors, Publish date, Primary category, Reading basis.
  • Do not use key variants such as Reading source, Author list, Published on, or lowercase key names.

See references/summary-format.md for exact section requirements.

Related Skills

This skill is a sub-skill of arxiv-summarizer-orchestrator.

Pipeline position:

  1. Step 1 (upstream): arxiv-search-collector produces the selected paper directories and metadata.
  2. Step 2 (this skill): arxiv-paper-processor downloads artifacts and writes one summary.md per paper.
  3. Step 3 (downstream): arxiv-batch-reporter uses these per-paper summaries to generate the final collection report.

Use this skill together with Step 1 and Step 3 for full end-to-end execution.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

deep-research-surf

Conducts deep, multi-angle research using Surf MCP tools and parallel subagents. Use for deep research, competitive landscape analysis, strategic intelligenc...

Registry SourceRecently Updated
Research

audio-quality-check

Analyze audio recording quality - echo detection, loudness, speech intelligibility, SNR, spectral analysis. Use when the user wants to check a recording's qu...

Registry SourceRecently Updated
1120Profile unavailable
Research

GEO Performance Analysis

Analyzes a brand’s presence and sentiment in LLM-generated industry recommendations, extracting mention context and competitor comparisons.

Registry SourceRecently Updated
201Profile unavailable
Research

Paperspace

Paperspace integration. Manage data, records, and automate workflows. Use when the user wants to interact with Paperspace data.

Registry SourceRecently Updated
1470Profile unavailable