quantizing-models-bitsandbytes

bitsandbytes - LLM Quantization

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "quantizing-models-bitsandbytes" with this command: npx skills add ovachiever/droid-tings/ovachiever-droid-tings-quantizing-models-bitsandbytes

bitsandbytes - LLM Quantization

Quick start

bitsandbytes reduces LLM memory by 50% (8-bit) or 75% (4-bit) with <1% accuracy loss.

Installation:

pip install bitsandbytes transformers accelerate

8-bit quantization (50% memory reduction):

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_8bit=True) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=config, device_map="auto" )

Memory: 14GB → 7GB

4-bit quantization (75% memory reduction):

config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=config, device_map="auto" )

Memory: 14GB → 3.5GB

Common workflows

Workflow 1: Load large model in limited GPU memory

Copy this checklist:

Quantization Loading:

  • Step 1: Calculate memory requirements
  • Step 2: Choose quantization level (4-bit or 8-bit)
  • Step 3: Configure quantization
  • Step 4: Load and verify model

Step 1: Calculate memory requirements

Estimate model memory:

FP16 memory (GB) = Parameters × 2 bytes / 1e9 INT8 memory (GB) = Parameters × 1 byte / 1e9 INT4 memory (GB) = Parameters × 0.5 bytes / 1e9

Example (Llama 2 7B): FP16: 7B × 2 / 1e9 = 14 GB INT8: 7B × 1 / 1e9 = 7 GB INT4: 7B × 0.5 / 1e9 = 3.5 GB

Step 2: Choose quantization level

GPU VRAM Model Size Recommended

8 GB 3B 4-bit

12 GB 7B 4-bit

16 GB 7B 8-bit or 4-bit

24 GB 13B 8-bit or 70B 4-bit

40+ GB 70B 8-bit

Step 3: Configure quantization

For 8-bit (better accuracy):

from transformers import BitsAndBytesConfig import torch

config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, # Outlier threshold llm_int8_has_fp16_weight=False )

For 4-bit (maximum memory savings):

config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, # Compute in FP16 bnb_4bit_quant_type="nf4", # NormalFloat4 (recommended) bnb_4bit_use_double_quant=True # Nested quantization )

Step 4: Load and verify model

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-13b-hf", quantization_config=config, device_map="auto", # Automatic device placement torch_dtype=torch.float16 )

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")

Test inference

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_length=50) print(tokenizer.decode(outputs[0]))

Check memory

import torch print(f"Memory allocated: {torch.cuda.memory_allocated()/1e9:.2f}GB")

Workflow 2: Fine-tune with QLoRA (4-bit training)

QLoRA enables fine-tuning large models on consumer GPUs.

Copy this checklist:

QLoRA Fine-tuning:

  • Step 1: Install dependencies
  • Step 2: Configure 4-bit base model
  • Step 3: Add LoRA adapters
  • Step 4: Train with standard Trainer

Step 1: Install dependencies

pip install bitsandbytes transformers peft accelerate datasets

Step 2: Configure 4-bit base model

from transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True )

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=bnb_config, device_map="auto" )

Step 3: Add LoRA adapters

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

Prepare model for training

model = prepare_model_for_kbit_training(model)

Configure LoRA

lora_config = LoraConfig( r=16, # LoRA rank lora_alpha=32, # LoRA alpha target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )

Add LoRA adapters

model = get_peft_model(model, lora_config) model.print_trainable_parameters()

Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%

Step 4: Train with standard Trainer

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments( output_dir="./qlora-output", per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=3, learning_rate=2e-4, fp16=True, logging_steps=10, save_strategy="epoch" )

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, tokenizer=tokenizer )

trainer.train()

Save LoRA adapters (only ~20MB)

model.save_pretrained("./qlora-adapters")

Workflow 3: 8-bit optimizer for memory-efficient training

Use 8-bit Adam/AdamW to reduce optimizer memory by 75%.

8-bit Optimizer Setup:

  • Step 1: Replace standard optimizer
  • Step 2: Configure training
  • Step 3: Monitor memory savings

Step 1: Replace standard optimizer

import bitsandbytes as bnb from transformers import Trainer, TrainingArguments

Instead of torch.optim.AdamW

model = AutoModelForCausalLM.from_pretrained("model-name")

training_args = TrainingArguments( output_dir="./output", per_device_train_batch_size=8, optim="paged_adamw_8bit", # 8-bit optimizer learning_rate=5e-5 )

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset )

trainer.train()

Manual optimizer usage:

import bitsandbytes as bnb

optimizer = bnb.optim.AdamW8bit( model.parameters(), lr=1e-4, betas=(0.9, 0.999), eps=1e-8 )

Training loop

for batch in dataloader: loss = model(**batch).loss loss.backward() optimizer.step() optimizer.zero_grad()

Step 2: Configure training

Compare memory:

Standard AdamW optimizer memory = model_params × 8 bytes (states) 8-bit AdamW memory = model_params × 2 bytes Savings = 75% optimizer memory

Example (Llama 2 7B): Standard: 7B × 8 = 56 GB 8-bit: 7B × 2 = 14 GB Savings: 42 GB

Step 3: Monitor memory savings

import torch

before = torch.cuda.memory_allocated()

Training step

optimizer.step()

after = torch.cuda.memory_allocated() print(f"Memory used: {(after-before)/1e9:.2f}GB")

When to use vs alternatives

Use bitsandbytes when:

  • GPU memory limited (need to fit larger model)

  • Training with QLoRA (fine-tune 70B on single GPU)

  • Inference only (50-75% memory reduction)

  • Using HuggingFace Transformers

  • Acceptable 0-2% accuracy degradation

Use alternatives instead:

  • GPTQ/AWQ: Production serving (faster inference than bitsandbytes)

  • GGUF: CPU inference (llama.cpp)

  • FP8: H100 GPUs (hardware FP8 faster)

  • Full precision: Accuracy critical, memory not constrained

Common issues

Issue: CUDA error during loading

Install matching CUDA version:

Check CUDA version

nvcc --version

Install matching bitsandbytes

pip install bitsandbytes --no-cache-dir

Issue: Model loading slow

Use CPU offload for large models:

model = AutoModelForCausalLM.from_pretrained( "model-name", quantization_config=config, device_map="auto", max_memory={0: "20GB", "cpu": "30GB"} # Offload to CPU )

Issue: Lower accuracy than expected

Try 8-bit instead of 4-bit:

config = BitsAndBytesConfig(load_in_8bit=True)

8-bit has <0.5% accuracy loss vs 1-2% for 4-bit

Or use NF4 with double quantization:

config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # Better than fp4 bnb_4bit_use_double_quant=True # Extra accuracy )

Issue: OOM even with 4-bit

Enable CPU offload:

model = AutoModelForCausalLM.from_pretrained( "model-name", quantization_config=config, device_map="auto", offload_folder="offload", # Disk offload offload_state_dict=True )

Advanced topics

QLoRA training guide: See references/qlora-training.md for complete fine-tuning workflows, hyperparameter tuning, and multi-GPU training.

Quantization formats: See references/quantization-formats.md for INT8, NF4, FP4 comparison, double quantization, and custom quantization configs.

Memory optimization: See references/memory-optimization.md for CPU offloading strategies, gradient checkpointing, and memory profiling.

Hardware requirements

  • GPU: NVIDIA with compute capability 7.0+ (Turing, Ampere, Hopper)

  • VRAM: Depends on model and quantization

  • 4-bit Llama 2 7B: 4GB

  • 4-bit Llama 2 13B: 8GB

  • 4-bit Llama 2 70B: 24GB

  • CUDA: 11.1+ (12.0+ recommended)

  • PyTorch: 2.0+

Supported platforms: NVIDIA GPUs (primary), AMD ROCm, Intel GPUs (experimental)

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

react-hook-form-zod

No summary provided by upstream source.

Repository SourceNeeds Review
General

nextjs-shadcn-builder

No summary provided by upstream source.

Repository SourceNeeds Review
General

deep-reading-analyst

No summary provided by upstream source.

Repository SourceNeeds Review