PaperBanana: Academic Illustration Pipeline
Automates publication-ready academic illustrations via 5 specialized agents, each implemented as a separate Gemini API call: Retriever (categorize & select references) -> Planner (multimodal description) -> Stylist (polish) -> Visualizer (render) -> Critic (evaluate & refine).
Two output modes:
- DIAGRAM MODE: Each agent is a Python script calling Gemini VLM/image APIs. Run
scripts/orchestrate.pyfor end-to-end execution. - PLOT MODE: Statistical plots generated as executable Python matplotlib/seaborn code (code-based to eliminate data hallucination).
Requirements: GOOGLE_API_KEY env var (used for VLM calls in retriever/planner/stylist/critic AND image generation in visualizer), Python 3.10+ with google-genai, matplotlib, seaborn, numpy, pillow.
Paper: PaperBanana: Automating Academic Illustrations with Multi-Agent Systems (arXiv:2601.23265, Google/PKU)
Step 1: Determine Output Mode
Decide which track to follow:
| Signal | Mode |
|---|---|
| User provides raw data, table, CSV + visual intent (bar chart, scatter, etc.) | PLOT MODE |
| User provides methodology text, description, or figure caption | DIAGRAM MODE |
| User provides existing figure to improve | Match original type |
Critical rule: PLOT MODE always generates Python code (never image generation for data visualizations). Code-based generation eliminates data hallucination errors that corrupt numerical accuracy in image-based approaches.
Step 2: Execute Pipeline
DIAGRAM MODE — Automated Pipeline
Primary entry point: Run the end-to-end orchestrator:
python scripts/orchestrate.py \
--methodology-file methodology.txt \
--caption "Figure 1: Overview of proposed framework" \
--mode diagram \
--output output/diagram.png
Or with inline text:
python scripts/orchestrate.py \
--methodology "Our framework consists of three modules..." \
--caption "Figure 1: System overview" \
--mode diagram \
--output output/diagram.png
The orchestrator chains all 5 agents automatically and handles the Critic's refinement loop (up to 3 iterations). Intermediate outputs are saved to output/work/ for inspection.
Pipeline Details
Read references/DIAGRAM-PROMPTS.md for the actual Gemini prompt templates used by each agent.
Phase 1: RETRIEVER (scripts/retriever.py) — Gemini VLM call
- Classifies methodology into 1 of 4 categories from
references/DIAGRAM-CATEGORIES.md - Selects 2 most relevant reference diagrams from the 13 curated examples in
assets/references/ - Identifies visual intent: Framework Overview, Pipeline/Flow, Detailed Module, Architecture Diagram
Phase 2: PLANNER (scripts/planner.py) — Multimodal Gemini VLM call
- Sends the 2 selected reference images + methodology text to Gemini as a multimodal prompt
- The VLM "sees" what good methodology diagrams look like (in-context learning from images)
- Generates an extremely detailed textual description of the target diagram
- Critical: Natural language only for all visual attributes. NEVER hex codes or pixel dimensions
Phase 3: STYLIST (scripts/stylist.py) — Gemini VLM call
- Takes the Planner's description + full NeurIPS 2025 style guide
- Applies domain-specific styling based on the category from Phase 1
- Follows 5 critical rules: preserve aesthetics, intervene minimally, respect domain, enrich details, preserve content
- Outputs the polished description only
Phase 4: VISUALIZER (scripts/generate_image.py) — Gemini Image API call
- Uses
gemini-3-pro-image-previewto generate the diagram image from the styled description - Prepends quality prefix (high-res, legible text, clean background, no watermarks)
- Aspect ratio selected based on visual intent (16:9 for pipelines, 3:2 for modules)
Phase 5: CRITIC (scripts/critic.py) — Multimodal Gemini VLM call
- Sends the generated image + methodology text to Gemini for multimodal evaluation
- Scores on 4 dimensions (faithfulness, readability, conciseness, aesthetics)
- If faithfulness < 7 OR readability < 7: generates revised description → loops to Phase 4
- Maximum 3 refinement iterations
DIAGRAM MODE — Manual Execution
You can also run each agent individually for more control:
# Phase 1: Retriever
python scripts/retriever.py --methodology-file text.txt --output work/retriever.json
# Phase 2: Planner
python scripts/planner.py --methodology-file text.txt --caption "Figure 1: ..." \
--references work/retriever.json --output work/planner.json
# Phase 3: Stylist
python scripts/stylist.py --description work/planner.json --output work/stylist.json
# Phase 4: Visualizer (extract styled_description from JSON, pass to generate_image.py)
python scripts/generate_image.py --prompt-file work/styled_desc.txt --output output/diagram.png
# Phase 5: Critic
python scripts/critic.py --image output/diagram.png --methodology-file text.txt \
--description work/stylist.json --output work/critic.json
PLOT MODE
Read references/PLOT-PROMPTS.md for detailed agent prompts. Read references/PLOT-STYLE-GUIDE.md for aesthetic rules.
Plot mode uses Claude (or the host agent) for reasoning and code generation — no Gemini API calls needed for plot generation itself.
Phase 1: CATEGORIZE (Retriever)
Match data characteristics and visual intent:
| Data Type | Plot Types |
|---|---|
| Categorical comparison | Bar chart, grouped bar, stacked bar |
| Continuous trends | Line chart, area chart |
| Correlation/distribution | Scatter plot, histogram, box plot, violin |
| Matrix/similarity | Heatmap, confusion matrix |
| Multi-dimensional | Radar/spider chart |
| Proportional | Pie/donut chart, treemap |
Phase 2: PLAN (Planner)
Create a detailed specification that explicitly enumerates:
- Every raw data point with exact coordinates/values
- Axis ranges, labels, tick marks, scales (linear/log)
- Color assignments for each series/category
- Font sizes for title, axis labels, tick labels, legend
- Line widths, marker sizes, marker shapes
- Legend placement and formatting
- Grid style (major/minor, dashed/solid)
- Figure dimensions and DPI
Phase 3: STYLE (Stylist)
Read references/PLOT-STYLE-GUIDE.md for NeurIPS 2025 plot aesthetics.
Key styling rules:
- White backgrounds only
- Colorblind-friendly palettes (see
assets/palettes/colorblind_safe.json) - Sans-serif fonts (Helvetica, Arial, or DejaVu Sans)
- Markers on line charts for print readability
- Inward-facing tick marks
- Subtle grid lines (light gray, dashed)
Phase 4: VISUALIZE (Visualizer — Code Generation)
Generate complete, self-contained Python matplotlib/seaborn code. Use scripts/plot_generator.py as a reference implementation or run it directly with a JSON config:
python scripts/plot_generator.py --config plot_config.json --output figure.pdf
Code requirements:
- Self-contained: all data defined inline, no external file dependencies
- Apply
.mplstylefromassets/matplotlib_styles/academic_default.mplstyle - Set
OUTPUT_PATHvariable for output file location - 300 DPI,
bbox_inches='tight' - No
plt.show()— save only - Support both PDF and PNG output
After generating the code, execute it to produce the plot image.
Phase 5: CRITIQUE (Critic)
Same rubric as diagram mode, plus plot-specific checks:
- Data fidelity: Every data point correctly plotted
- Axis accuracy: Ranges, labels, scales match specification
- Layout: No overlapping labels, legends, or data points
- Code correctness: Syntax valid, imports available, output saved
If code execution failed, analyze the error, simplify the approach, and regenerate.
Quick Start Examples
Diagram (automated): Run scripts/orchestrate.py with your methodology text file and caption.
Diagram (via agent): "Generate a methodology diagram for my transformer architecture. Here is the methodology section: [paste text]. Caption: Overview of our proposed multi-head attention framework."
Plot: "Create a bar chart comparing model performance. Data: {BERT: 92.3, GPT-4: 88.1, Claude: 95.7, Gemini: 91.2}. Intent: F1 score comparison across language models."
Improve: "Improve the aesthetics of this diagram: [paste existing description or attach current figure]"
File Reference
| File | Purpose | When to Read |
|---|---|---|
scripts/orchestrate.py | End-to-end pipeline runner | Diagram mode primary entry point |
scripts/retriever.py | VLM-based reference selection | Phase 1 (diagram mode) |
scripts/planner.py | Multimodal description generation | Phase 2 (diagram mode) |
scripts/stylist.py | VLM-based style application | Phase 3 (diagram mode) |
scripts/generate_image.py | Gemini Image API call | Phase 4 (diagram mode) |
scripts/critic.py | VLM-based image evaluation | Phase 5 (diagram mode) |
scripts/plot_generator.py | Template-based matplotlib generator | Phase 4 (plot mode) |
scripts/validate_output.py | Output validation and dependency check | Post-generation validation |
references/DIAGRAM-PROMPTS.md | Actual Gemini prompt templates for diagrams | All diagram phases |
references/PLOT-PROMPTS.md | Agent prompts for plots | All plot phases |
references/DIAGRAM-STYLE-GUIDE.md | NeurIPS 2025 diagram aesthetics | Phase 3 (Style) |
references/PLOT-STYLE-GUIDE.md | NeurIPS 2025 plot aesthetics | Phase 3 (Style) |
references/EVALUATION-RUBRIC.md | Critic scoring criteria (4 dimensions) | Phase 5 (Critique) |
references/DIAGRAM-CATEGORIES.md | 4 diagram categories with keywords | Phase 1 (Categorize) |
assets/references/index.json | 13 curated reference diagram metadata | Phase 1 (Retriever) |
assets/references/*.jpg | 13 curated reference diagram images | Phase 2 (Planner multimodal input) |
assets/palettes/*.json | Color palette definitions | Phase 3 (Style) |
assets/matplotlib_styles/*.mplstyle | Matplotlib style sheets | Phase 4 (plot mode) |
Environment Setup
# Required for all Gemini API calls (VLM reasoning + image generation)
export GOOGLE_API_KEY="your-api-key-here"
# Install dependencies
pip install google-genai matplotlib seaborn numpy pillow
Verify setup: python scripts/validate_output.py --check-deps