RunPod Serverless Builder

Build end-to-end RunPod serverless endpoints optimized for extremely short cold start times.

Capabilities

Create production-ready RunPod serverless workers for:

vLLM - High-performance LLM inference
ComfyUI - Image/video generation with workflow support
Custom Inference - User-provided Python inference code

Loading Strategies:

Baked Models: Models embedded in Docker image for fastest cold starts (<5s)
Dynamic Loading: Models loaded from network storage at runtime (shared across workers)

Quick Start

Use the interactive project generator:

python3 scripts/init_project.py

This generates a complete project with:

Optimized Dockerfile
RunPod handler (worker.py)
Startup scripts (for dynamic loading)
Configuration files
Documentation

Project Generation Workflow

Step 1: Run the Generator

Execute the script and answer prompts:

import subprocess
skill_dir = "/path/to/runpod-serverless-builder"
subprocess.run(["python3", f"{skill_dir}/scripts/init_project.py"])

The script prompts for:

Project name - e.g., "my-vllm-worker"
Workload type - vLLM, ComfyUI, or Custom
Loading strategy - Baked or Dynamic
Model configuration - Model name, quantization, etc.
Output directory - Where to generate files

Step 2: Customize Generated Files

The generator creates a complete project structure:

my-runpod-worker/
├── Dockerfile          # Optimized for cold starts
├── worker.py           # RunPod handler function
├── startup.sh          # Dynamic loading (if applicable)
├── requirements.txt    # Python dependencies
├── .dockerignore       # Build optimization
├── .env.example        # Environment variables
└── README.md           # Project documentation

Review and customize:

worker.py: Modify handler logic, add custom processing
Dockerfile: Add custom dependencies, adjust configurations
startup.sh: Add custom initialization steps
requirements.txt: Add additional Python packages

Step 3: Build and Deploy

# Build Docker image
docker build -t my-worker:latest .

# Push to registry
docker push registry/my-worker:latest

# Deploy to RunPod Dashboard
# 1. Create template with image
# 2. Set environment variables
# 3. Create endpoint

Manual Implementation (Without Generator)

If you prefer manual implementation or need to understand the patterns:

vLLM Worker

Baked Model Approach:

Copy Dockerfile template:

shutil.copy("assets/dockerfiles/vllm_baked.dockerfile", "Dockerfile")

Copy worker template:

shutil.copy("assets/workers/worker_vllm.py", "worker.py")

Build with model:

docker build -t my-vllm:latest \
  --build-arg MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
  --build-arg BASE_PATH="/models" \
  .

Dynamic Loading Approach:

Copy Dockerfile and startup script:

shutil.copy("assets/dockerfiles/vllm_dynamic.dockerfile", "Dockerfile")
shutil.copy("assets/startup_scripts/startup_vllm.sh", "startup.sh")

Set environment variables in RunPod:

MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=hf_your_token
GPU_MEMORY_UTILIZATION=0.95

ComfyUI Worker

Baked Model Approach:

Use ComfyUI baked template:

shutil.copy("assets/dockerfiles/comfyui_baked.dockerfile", "Dockerfile")
shutil.copy("assets/workers/worker_comfyui.py", "worker.py")

Modify Dockerfile to download models:

# Add model downloads
RUN aria2c -x 16 -s 16 https://huggingface.co/... \
    -d /ComfyUI/models/checkpoints

Dynamic Loading Approach:

Use dynamic template with startup script:

shutil.copy("assets/dockerfiles/comfyui_dynamic.dockerfile", "Dockerfile")
shutil.copy("assets/startup_scripts/startup_comfyui.sh", "startup.sh")
shutil.copy("assets/config/extra_model_paths.yaml", "extra_model_paths.yaml")

Configure network storage paths in extra_model_paths.yaml
Set environment variables:

GITHUB_PAT=ghp_token  # For private repos
CUSTOM_NODES=https://github.com/org/node1.git,https://github.com/org/node2.git

Custom Inference Worker

Use custom templates:

shutil.copy("assets/dockerfiles/custom_inference.dockerfile", "Dockerfile")
shutil.copy("assets/workers/worker_custom.py", "worker.py")

Implement your inference logic in worker.py:

def initialize_model():
    # Load your model
    return your_model

def handler(job):
    model = initialize_model()
    # Your inference logic
    result = model.predict(job["input"])
    return {"result": result}

Handler Patterns

vLLM Handler

from vllm import LLM, SamplingParams

llm = None

def initialize_model():
    global llm
    if llm is None:
        llm = LLM(model=MODEL_NAME, gpu_memory_utilization=0.95)
    return llm

def handler(job):
    model = initialize_model()
    messages = job["input"]["messages"]

    # Apply chat template
    tokenizer = model.get_tokenizer()
    prompt = tokenizer.apply_chat_template(messages, tokenize=False)

    # Generate
    outputs = model.generate([prompt], SamplingParams(...))
    return {"text": outputs[0].outputs[0].text}

ComfyUI Handler

def update_workflow(workflow, parameters):
    # Update prompt node
    workflow[parameters["prompt_node_id"]]["inputs"]["text"] = parameters["prompt"]
    # Update seed node
    workflow[parameters["seed_node_id"]]["inputs"]["seed"] = parameters.get("seed", 42)
    return workflow

def handler(job):
    # Load workflow JSON
    with open(job["input"]["workflow_path"]) as f:
        workflow = json.load(f)

    # Update with parameters
    workflow = update_workflow(workflow, job["input"])

    # Execute with ComfyUI API
    output = execute_comfyui_workflow(workflow)
    return {"image_base64": output}

Cold Start Optimization

Key strategies (see references/cold_start_optimization.md for details):

Baked Models Strategy

Models embedded in image
Target: <5 second cold starts
Best for: Small-medium models, latency-critical workloads

Dynamic Loading Strategy

Models on network storage
Target: <60 second cold starts
Best for: Large models, shared across workers

Dockerfile Optimization

# Use BuildKit cache mounts
RUN --mount=type=cache,target=/root/.cache/pip pip install ...

# Order from least to most frequently changing
COPY requirements.txt /
RUN pip install -r requirements.txt
COPY worker.py /

# Combine commands to reduce layers
RUN apt-get update && apt-get install -y pkg1 pkg2 && apt-get clean

Worker Optimization

# Module-level initialization (runs once per container)
MODEL = load_model()  # Cached across warm starts

def handler(job):
    # MODEL already loaded for warm starts
    return MODEL.predict(job["input"])

Reference Documentation

Consult reference files for detailed guidance:

references/cold_start_optimization.md - Comprehensive cold start optimization strategies
references/vllm_guide.md - vLLM configuration, API patterns, troubleshooting
references/comfyui_guide.md - ComfyUI workflow management, custom nodes, video workflows

Load references when needed:

# For cold start optimization questions
with open("references/cold_start_optimization.md") as f:
    cold_start_guide = f.read()

# For vLLM-specific configuration
with open("references/vllm_guide.md") as f:
    vllm_guide = f.read()

# For ComfyUI workflow patterns
with open("references/comfyui_guide.md") as f:
    comfyui_guide = f.read()

Common Scenarios

Scenario 1: vLLM with Baked Model

User request: "Create a RunPod endpoint for Llama 3.1 8B with the fastest possible cold starts"

Implementation:

Run init_project.py or copy vllm_baked templates
Set MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" in Dockerfile
Build with model baked in
Deploy to RunPod

Scenario 2: ComfyUI with Dynamic Loading

User request: "Build a ComfyUI video generation endpoint that loads models from network storage"

Implementation:

Run init_project.py selecting ComfyUI + Dynamic
Configure extra_model_paths.yaml for network storage
Implement workflow update logic in worker.py
Deploy with CUSTOM_NODES environment variable

Scenario 3: Custom Inference with User Code

User request: "I have a custom object detection model, help me deploy it to RunPod"

Implementation:

Run init_project.py selecting Custom
Copy user's model code to inference/ directory
Implement initialize_model() and handler() in worker.py
Add dependencies to requirements.txt
Build and deploy

Troubleshooting

Slow Cold Starts

Check if models are baked vs downloaded at runtime
Review Dockerfile layer caching
Minimize dependencies in requirements.txt
Consult references/cold_start_optimization.md

Worker Errors

Check logs in RunPod dashboard
Test worker.py locally: python3 worker.py
Verify environment variables in .env.example
Check model loading in initialize_model()

Build Failures

Verify base image compatibility
Check requirements.txt for conflicting versions
Test Dockerfile locally: docker build .

Best Practices

Always use the generator first - It implements proven patterns
Start with baked models - Optimize for cold starts, then consider dynamic loading if needed
Pin dependency versions - Avoid "latest" tags and unpinned packages
Profile cold starts - Measure and optimize based on actual metrics
Test locally before deploying - Run worker.py and docker build locally
Consult references - Load reference docs for detailed guidance on specific topics