Model Deployment

Deploy LLMs to production with optimal performance.

Quick Start

vLLM Server

# Install
pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000 \
    --tensor-parallel-size 1

# Query (OpenAI-compatible)
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "prompt": "Hello, how are you?",
        "max_tokens": 100
    }'

Text Generation Inference (TGI)

# Docker deployment
docker run --gpus all -p 8080:80 \
    -v ./data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-7b-chat-hf \
    --quantize bitsandbytes-nf4 \
    --max-input-length 4096 \
    --max-total-tokens 8192

# Query
curl http://localhost:8080/generate \
    -H "Content-Type: application/json" \
    -d '{"inputs": "What is AI?", "parameters": {"max_new_tokens": 100}}'

Ollama (Local Deployment)

# Install and run
curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama2

# API usage
curl http://localhost:11434/api/generate -d '{
    "model": "llama2",
    "prompt": "Why is the sky blue?"
}'

Deployment Options Comparison

Platform	Ease	Cost	Scale	Latency	Best For
vLLM	⭐⭐	Self-host	High	Low	Production
TGI	⭐⭐	Self-host	High	Low	HuggingFace ecosystem
Ollama	⭐⭐⭐	Free	Low	Medium	Local dev
OpenAI	⭐⭐⭐	Pay-per-token	Very High	Low	Quick start
AWS Bedrock	⭐⭐	Pay-per-token	Very High	Medium	Enterprise
Replicate	⭐⭐⭐	Pay-per-second	High	Medium	Prototyping

FastAPI Inference Server

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()

# Load model at startup
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 100
    temperature: float = 0.7
    top_p: float = 0.9

class GenerateResponse(BaseModel):
    text: str
    tokens_used: int

@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
    try:
        inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=request.max_tokens,
                temperature=request.temperature,
                top_p=request.top_p,
                do_sample=True
            )

        generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
        new_tokens = len(outputs[0]) - len(inputs.input_ids[0])

        return GenerateResponse(
            text=generated,
            tokens_used=new_tokens
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "healthy", "model": model_name}

Docker Deployment

Dockerfile

FROM nvidia/cuda:12.1-runtime-ubuntu22.04

WORKDIR /app

# Install Python
RUN apt-get update && apt-get install -y python3 python3-pip

# Install dependencies
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Download model (or mount volume)
RUN python3 -c "from transformers import AutoModelForCausalLM; \
    AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Docker Compose

version: '3.8'
services:
  llm-server:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: llm
        image: llm-inference:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llm-inference
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Optimization Techniques

Quantization

from transformers import BitsAndBytesConfig

# 8-bit quantization
config_8bit = BitsAndBytesConfig(load_in_8bit=True)

# 4-bit quantization (QLoRA-style)
config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=config_4bit,
    device_map="auto"
)

Batching

# Dynamic batching with vLLM
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")

# vLLM automatically batches concurrent requests
prompts = ["Prompt 1", "Prompt 2", "Prompt 3"]
sampling = SamplingParams(temperature=0.7, max_tokens=100)

outputs = llm.generate(prompts, sampling)  # Batched execution

KV Cache Optimization

# vLLM with PagedAttention
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    gpu_memory_utilization=0.9,
    max_num_batched_tokens=4096
)

Monitoring

from prometheus_client import Counter, Histogram, start_http_server
import time

REQUEST_COUNT = Counter('inference_requests_total', 'Total requests')
LATENCY = Histogram('inference_latency_seconds', 'Request latency')
TOKENS = Counter('tokens_generated_total', 'Total tokens generated')

@app.middleware("http")
async def metrics_middleware(request, call_next):
    REQUEST_COUNT.inc()
    start = time.time()
    response = await call_next(request)
    LATENCY.observe(time.time() - start)
    return response

# Start metrics server
start_http_server(9090)

Best Practices

Use quantization: 4-bit for dev, 8-bit for production
Implement batching: vLLM/TGI handle this automatically
Monitor everything: Latency, throughput, errors, GPU utilization
Cache responses: For repeated queries
Set timeouts: Prevent hung requests
Load balance: Multiple replicas for high availability

Error Handling & Retry

from tenacity import retry, stop_after_attempt, wait_exponential
import httpx

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def call_inference_api(prompt: str):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/generate",
            json={"prompt": prompt},
            timeout=30.0
        )
        return response.json()

Troubleshooting

Symptom	Cause	Solution
OOM on load	Model too large	Use quantization
High latency	No batching	Enable vLLM batching
Connection refused	Server not started	Check health endpoint

Unit Test Template

def test_health_endpoint():
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"

model-deployment

Safety Notice

Copy this and send it to your AI assistant to learn

Model Deployment

Quick Start

vLLM Server

Text Generation Inference (TGI)

Ollama (Local Deployment)

Deployment Options Comparison

FastAPI Inference Server

Docker Deployment

Dockerfile

Docker Compose

Kubernetes Deployment

Optimization Techniques

Quantization

Batching

KV Cache Optimization

Monitoring

Best Practices

Error Handling & Retry

Troubleshooting

Unit Test Template

Source Transparency

Related Skills

prompt-engineering

fine-tuning

java-spring-boot

java-testing