uv-tensorrt-llm

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "uv-tensorrt-llm" with this command: npx skills add uv-xiao/pkbllm/uv-xiao-pkbllm-uv-tensorrt-llm

TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

When to use TensorRT-LLM

Use TensorRT-LLM when:

  • Deploying on NVIDIA GPUs (A100, H100, GB200)

  • Need maximum throughput (24,000+ tokens/sec on Llama 3)

  • Require low latency for real-time applications

  • Working with quantized models (FP8, INT4, FP4)

  • Scaling across multiple GPUs or nodes

Use vLLM instead when:

  • Need simpler setup and Python-first API

  • Want PagedAttention without TensorRT compilation

  • Working with AMD GPUs or non-NVIDIA hardware

Use llama.cpp instead when:

  • Deploying on CPU or Apple Silicon

  • Need edge deployment without NVIDIA GPUs

  • Want simpler GGUF quantization format

Quick start

Installation

Docker (recommended)

docker pull nvidia/tensorrt_llm:latest

pip install

pip install tensorrt_llm==1.2.0rc3

Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

Basic inference

from tensorrt_llm import LLM, SamplingParams

Initialize model

llm = LLM(model="meta-llama/Meta-Llama-3-8B")

Configure sampling

sampling_params = SamplingParams( max_tokens=100, temperature=0.7, top_p=0.9 )

Generate

prompts = ["Explain quantum computing"] outputs = llm.generate(prompts, sampling_params)

for output in outputs: print(output.text)

Serving with trtllm-serve

Start server (automatic model download and compilation)

trtllm-serve meta-llama/Meta-Llama-3-8B
--tp_size 4 \ # Tensor parallelism (4 GPUs) --max_batch_size 256
--max_num_tokens 4096

Client request

curl -X POST http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Meta-Llama-3-8B", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'

Key features

Performance optimizations

  • In-flight batching: Dynamic batching during generation

  • Paged KV cache: Efficient memory management

  • Flash Attention: Optimized attention kernels

  • Quantization: FP8, INT4, FP4 for 2-4× faster inference

  • CUDA graphs: Reduced kernel launch overhead

Parallelism

  • Tensor parallelism (TP): Split model across GPUs

  • Pipeline parallelism (PP): Layer-wise distribution

  • Expert parallelism: For Mixture-of-Experts models

  • Multi-node: Scale beyond single machine

Advanced features

  • Speculative decoding: Faster generation with draft models

  • LoRA serving: Efficient multi-adapter deployment

  • Disaggregated serving: Separate prefill and generation

Common patterns

Quantized model (FP8)

from tensorrt_llm import LLM

Load FP8 quantized model (2× faster, 50% memory)

llm = LLM( model="meta-llama/Meta-Llama-3-70B", dtype="fp8", max_num_tokens=8192 )

Inference same as before

outputs = llm.generate(["Summarize this article..."])

Multi-GPU deployment

Tensor parallelism across 8 GPUs

llm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=8, dtype="fp8" )

Batch inference

Process 100 prompts efficiently

prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate( prompts, sampling_params=SamplingParams(max_tokens=200) )

Automatic in-flight batching for maximum throughput

Performance benchmarks

Meta Llama 3-8B (H100 GPU):

  • Throughput: 24,000 tokens/sec

  • Latency: ~10ms per token

  • vs PyTorch: 100× faster

Llama 3-70B (8× A100 80GB):

  • FP8 quantization: 2× faster than FP16

  • Memory: 50% reduction with FP8

Supported models

  • LLaMA family: Llama 2, Llama 3, CodeLlama

  • GPT family: GPT-2, GPT-J, GPT-NeoX

  • Qwen: Qwen, Qwen2, QwQ

  • DeepSeek: DeepSeek-V2, DeepSeek-V3

  • Mixtral: Mixtral-8x7B, Mixtral-8x22B

  • Vision: LLaVA, Phi-3-vision

  • 100+ models on HuggingFace

References

  • Optimization Guide - Quantization, batching, KV cache tuning

  • Multi-GPU Setup - Tensor/pipeline parallelism, multi-node

  • Serving Guide - Production deployment, monitoring, autoscaling

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

uv-find-skills

No summary provided by upstream source.

Repository SourceNeeds Review
General

uv-mamba-architecture

No summary provided by upstream source.

Repository SourceNeeds Review
General

uv-moe-training

No summary provided by upstream source.

Repository SourceNeeds Review
General

uv-miles-rl-training

No summary provided by upstream source.

Repository SourceNeeds Review