high-performance-inference

High-Performance Inference

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "high-performance-inference" with this command: npx skills add yonatangross/orchestkit/yonatangross-orchestkit-high-performance-inference

High-Performance Inference

Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.

vLLM 0.14.0 (Jan ): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.

Overview

  • Deploying LLMs with low latency requirements

  • Reducing GPU memory for larger models

  • Maximizing throughput for batch inference

  • Edge/mobile deployment with constrained resources

  • Cost optimization through efficient hardware utilization

Quick Reference

Basic vLLM server

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--max-model-len 8192

With quantization + speculative decoding

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct
--quantization awq
--speculative-config '{"method": "ngram", "num_speculative_tokens": 5}'
--tensor-parallel-size 4
--gpu-memory-utilization 0.9

vLLM 0.14.x Key Features

Feature Benefit

PagedAttention Up to 24x throughput via efficient KV cache

Continuous Batching Dynamic request batching for max utilization

CUDA Graphs Fast model execution with graph capture

Tensor Parallelism Scale across multiple GPUs

Prefix Caching Reuse KV cache for shared prefixes

AttentionConfig New API replacing VLLM_ATTENTION_BACKEND env

Semantic Router vLLM SR v0.1 "Iris" for intelligent LLM routing

Python vLLM Integration

from vllm import LLM, SamplingParams

Initialize with optimization flags

llm = LLM( model="meta-llama/Meta-Llama-3.1-8B-Instruct", quantization="awq", tensor_parallel_size=2, gpu_memory_utilization=0.9, enable_prefix_caching=True, )

Sampling parameters

sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=1024, )

Generate

outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text)

Quantization Methods

Method Bits Memory Savings Speed Quality

FP16 16 Baseline Baseline Best

INT8 8 50% +10-20% Very Good

AWQ 4 75% +20-40% Good

GPTQ 4 75% +15-30% Good

FP8 8 50% +30-50% Very Good

When to Use Each:

  • FP16: Maximum quality, sufficient memory

  • INT8/FP8: Balance of quality and efficiency

  • AWQ: Best 4-bit quality, activation-aware

  • GPTQ: Faster quantization, good quality

Speculative Decoding

Accelerate generation by predicting multiple tokens:

N-gram based (no extra model)

speculative_config = { "method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 2, }

Draft model (higher quality)

speculative_config = { "method": "draft_model", "draft_model": "meta-llama/Llama-3.2-1B-Instruct", "num_speculative_tokens": 3, }

Expected Gains: 1.5-2.5x throughput for autoregressive tasks.

Key Decisions

Decision Recommendation

Quantization AWQ for 4-bit, FP8 for H100/H200

Batch size Dynamic via continuous batching

GPU memory 0.85-0.95 utilization

Parallelism Tensor parallel across GPUs

KV cache Enable prefix caching for shared contexts

Common Mistakes

  • Using GPTQ without calibration data (poor quality)

  • Over-allocating GPU memory (OOM on peak loads)

  • Ignoring warmup requests (cold start latency)

  • Not benchmarking actual workload patterns

  • Mixing quantization with incompatible features

Performance Benchmarking

from vllm import LLM, SamplingParams import time

def benchmark_throughput(llm, prompts, sampling_params, num_runs=3): """Benchmark tokens per second.""" total_tokens = 0 total_time = 0

for _ in range(num_runs):
    start = time.perf_counter()
    outputs = llm.generate(prompts, sampling_params)
    elapsed = time.perf_counter() - start

    tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
    total_tokens += tokens
    total_time += elapsed

return total_tokens / total_time  # tokens/sec

Advanced Patterns

See references/ for:

  • vLLM Deployment: PagedAttention, batching, production config

  • Quantization Guide: AWQ, GPTQ, INT8, FP8 comparison

  • Speculative Decoding: Draft models, n-gram, throughput tuning

  • Edge Deployment: Mobile, resource-constrained optimization

Related Skills

  • llm-streaming

  • Streaming token responses

  • function-calling

  • Tool use with inference

  • ollama-local

  • Local inference with Ollama

  • prompt-caching

  • Reduce redundant computation

  • semantic-caching

  • Cache full responses

Capability Details

vllm-deployment

Keywords: vllm, inference server, deploy, serve, production Solves:

  • Deploy LLMs with vLLM for production

  • Configure tensor parallelism and batching

  • Optimize GPU memory utilization

quantization

Keywords: quantize, AWQ, GPTQ, INT8, FP8, compress, reduce memory Solves:

  • Reduce model memory footprint

  • Choose appropriate quantization method

  • Maintain quality with lower precision

speculative-decoding

Keywords: speculative, draft model, faster generation, predict tokens Solves:

  • Accelerate autoregressive generation

  • Configure draft models or n-gram speculation

  • Tune speculative token count

edge-inference

Keywords: edge, mobile, embedded, constrained, optimization Solves:

  • Deploy on resource-constrained devices

  • Optimize for mobile/edge hardware

  • Balance quality and resource usage

throughput-optimization

Keywords: throughput, latency, performance, benchmark, optimize Solves:

  • Maximize requests per second

  • Reduce time to first token

  • Benchmark and tune performance

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

responsive-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

domain-driven-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

dashboard-patterns

No summary provided by upstream source.

Repository SourceNeeds Review