gptq

GPTQ (Generative Pre-trained Transformer Quantization)

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "gptq" with this command: npx skills add orchestra-research/ai-research-skills/orchestra-research-ai-research-skills-gptq

GPTQ (Generative Pre-trained Transformer Quantization)

Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization.

When to use GPTQ

Use GPTQ when:

  • Need to fit large models (70B+) on limited GPU memory

  • Want 4× memory reduction with <2% accuracy loss

  • Deploying on consumer GPUs (RTX 4090, 3090)

  • Need faster inference (3-4× speedup vs FP16)

Use AWQ instead when:

  • Need slightly better accuracy (<1% loss)

  • Have newer GPUs (Ampere, Ada)

  • Want Marlin kernel support (2× faster on some GPUs)

Use bitsandbytes instead when:

  • Need simple integration with transformers

  • Want 8-bit quantization (less compression, better quality)

  • Don't need pre-quantized model files

Quick start

Installation

Install AutoGPTQ

pip install auto-gptq

With Triton (Linux only, faster)

pip install auto-gptq[triton]

With CUDA extensions (faster)

pip install auto-gptq --no-build-isolation

Full installation

pip install auto-gptq transformers accelerate

Load pre-quantized model

from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM

Load quantized model from HuggingFace

model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"

model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_triton=False # Set True on Linux for speed )

tokenizer = AutoTokenizer.from_pretrained(model_name)

Generate

prompt = "Explain quantum computing" inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0]))

Quantize your own model

from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig from datasets import load_dataset

Load model

model_name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_name)

Quantization config

quantize_config = BaseQuantizeConfig( bits=4, # 4-bit quantization group_size=128, # Group size (recommended: 128) desc_act=False, # Activation order (False for CUDA kernel) damp_percent=0.01 # Dampening factor )

Load model for quantization

model = AutoGPTQForCausalLM.from_pretrained( model_name, quantize_config=quantize_config )

Prepare calibration data

dataset = load_dataset("c4", split="train", streaming=True) calibration_data = [ tokenizer(example["text"])["input_ids"][:512] for example in dataset.take(128) ]

Quantize

model.quantize(calibration_data)

Save quantized model

model.save_quantized("llama-2-7b-gptq") tokenizer.save_pretrained("llama-2-7b-gptq")

Push to HuggingFace

model.push_to_hub("username/llama-2-7b-gptq")

Group-wise quantization

How GPTQ works:

  • Group weights: Divide each weight matrix into groups (typically 128 elements)

  • Quantize per-group: Each group has its own scale/zero-point

  • Minimize error: Uses Hessian information to minimize quantization error

  • Result: 4-bit weights with near-FP16 accuracy

Group size trade-off:

Group Size Model Size Accuracy Speed Recommendation

-1 (per-column) Smallest Best Slowest Research only

32 Smaller Better Slower High accuracy needed

128 Medium Good Fast Recommended default

256 Larger Lower Faster Speed critical

1024 Largest Lowest Fastest Not recommended

Example:

Weight matrix: [1024, 4096] = 4.2M elements

Group size = 128:

  • Groups: 4.2M / 128 = 32,768 groups
  • Each group: own 4-bit scale + zero-point
  • Result: Better granularity → better accuracy

Quantization configurations

Standard 4-bit (recommended)

from auto_gptq import BaseQuantizeConfig

config = BaseQuantizeConfig( bits=4, # 4-bit quantization group_size=128, # Standard group size desc_act=False, # Faster CUDA kernel damp_percent=0.01 # Dampening factor )

Performance:

  • Memory: 4× reduction (70B model: 140GB → 35GB)

  • Accuracy: ~1.5% perplexity increase

  • Speed: 3-4× faster than FP16

High accuracy (3-bit with larger groups)

config = BaseQuantizeConfig( bits=3, # 3-bit (more compression) group_size=128, # Keep standard group size desc_act=True, # Better accuracy (slower) damp_percent=0.01 )

Trade-off:

  • Memory: 5× reduction

  • Accuracy: ~3% perplexity increase

  • Speed: 5× faster (but less accurate)

Maximum accuracy (4-bit with small groups)

config = BaseQuantizeConfig( bits=4, group_size=32, # Smaller groups (better accuracy) desc_act=True, # Activation reordering damp_percent=0.005 # Lower dampening )

Trade-off:

  • Memory: 3.5× reduction (slightly larger)

  • Accuracy: ~0.8% perplexity increase (best)

  • Speed: 2-3× faster (kernel overhead)

Kernel backends

ExLlamaV2 (default, fastest)

model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_exllama=True, # Use ExLlamaV2 exllama_config={"version": 2} )

Performance: 1.5-2× faster than Triton

Marlin (Ampere+ GPUs)

Quantize with Marlin format

config = BaseQuantizeConfig( bits=4, group_size=128, desc_act=False # Required for Marlin )

model.quantize(calibration_data, use_marlin=True)

Load with Marlin

model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_marlin=True # 2× faster on A100/H100 )

Requirements:

  • NVIDIA Ampere or newer (A100, H100, RTX 40xx)

  • Compute capability ≥ 8.0

Triton (Linux only)

model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_triton=True # Linux only )

Performance: 1.2-1.5× faster than CUDA backend

Integration with transformers

Direct transformers usage

from transformers import AutoModelForCausalLM, AutoTokenizer

Load quantized model (transformers auto-detects GPTQ)

model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-13B-Chat-GPTQ", device_map="auto", trust_remote_code=False )

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")

Use like any transformers model

inputs = tokenizer("Hello", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=100)

QLoRA fine-tuning (GPTQ + LoRA)

from transformers import AutoModelForCausalLM from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

Load GPTQ model

model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-7B-GPTQ", device_map="auto" )

Prepare for LoRA training

model = prepare_model_for_kbit_training(model)

LoRA config

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )

Add LoRA adapters

model = get_peft_model(model, lora_config)

Fine-tune (memory efficient!)

70B model trainable on single A100 80GB

Performance benchmarks

Memory reduction

Model FP16 GPTQ 4-bit Reduction

Llama 2-7B 14 GB 3.5 GB 4×

Llama 2-13B 26 GB 6.5 GB 4×

Llama 2-70B 140 GB 35 GB 4×

Llama 3-405B 810 GB 203 GB 4×

Enables:

  • 70B on single A100 80GB (vs 2× A100 needed for FP16)

  • 405B on 3× A100 80GB (vs 11× A100 needed for FP16)

  • 13B on RTX 4090 24GB (vs OOM with FP16)

Inference speed (Llama 2-7B, A100)

Precision Tokens/sec vs FP16

FP16 25 tok/s 1×

GPTQ 4-bit (CUDA) 85 tok/s 3.4×

GPTQ 4-bit (ExLlama) 105 tok/s 4.2×

GPTQ 4-bit (Marlin) 120 tok/s 4.8×

Accuracy (perplexity on WikiText-2)

Model FP16 GPTQ 4-bit (g=128) Degradation

Llama 2-7B 5.47 5.55 +1.5%

Llama 2-13B 4.88 4.95 +1.4%

Llama 2-70B 3.32 3.38 +1.8%

Excellent quality preservation - less than 2% degradation!

Common patterns

Multi-GPU deployment

Automatic device mapping

model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-70B-GPTQ", device_map="auto", # Automatically split across GPUs max_memory={0: "40GB", 1: "40GB"} # Limit per GPU )

Manual device mapping

device_map = { "model.embed_tokens": 0, "model.layers.0-39": 0, # First 40 layers on GPU 0 "model.layers.40-79": 1, # Last 40 layers on GPU 1 "model.norm": 1, "lm_head": 1 }

model = AutoGPTQForCausalLM.from_quantized( model_name, device_map=device_map )

CPU offloading

Offload some layers to CPU (for very large models)

model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-405B-GPTQ", device_map="auto", max_memory={ 0: "80GB", # GPU 0 1: "80GB", # GPU 1 2: "80GB", # GPU 2 "cpu": "200GB" # Offload overflow to CPU } )

Batch inference

Process multiple prompts efficiently

prompts = [ "Explain AI", "Explain ML", "Explain DL" ]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")

outputs = model.generate( **inputs, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id )

for i, output in enumerate(outputs): print(f"Prompt {i}: {tokenizer.decode(output)}")

Finding pre-quantized models

TheBloke on HuggingFace:

Search:

Find GPTQ models on HuggingFace

https://huggingface.co/models?library=gptq

Download:

from auto_gptq import AutoGPTQForCausalLM

Automatically downloads from HuggingFace

model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-70B-Chat-GPTQ", device="cuda:0" )

Supported models

  • LLaMA family: Llama 2, Llama 3, Code Llama

  • Mistral: Mistral 7B, Mixtral 8x7B, 8x22B

  • Qwen: Qwen, Qwen2, QwQ

  • DeepSeek: V2, V3

  • Phi: Phi-2, Phi-3

  • Yi, Falcon, BLOOM, OPT

  • 100+ models on HuggingFace

References

  • Calibration Guide - Dataset selection, quantization process, quality optimization

  • Integration Guide - Transformers, PEFT, vLLM, TensorRT-LLM

  • Troubleshooting - Common issues, performance optimization

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

ml-paper-writing

No summary provided by upstream source.

Repository SourceNeeds Review
Research

mlflow

No summary provided by upstream source.

Repository SourceNeeds Review
Research

faiss

No summary provided by upstream source.

Repository SourceNeeds Review
Research

serving-llms-vllm

No summary provided by upstream source.

Repository SourceNeeds Review