LLM Fine-Tuning Infrastructure

Train and fine-tune open-source LLMs efficiently — from LoRA on a single GPU to distributed full fine-tuning across multi-node clusters.

When to Use This Skill

Use this skill when:

Fine-tuning an LLM on domain-specific data (legal, medical, code, support)
Running QLoRA to fine-tune 70B models on consumer GPUs
Setting up distributed training with DeepSpeed or FSDP
Exporting fine-tuned adapters for production serving
Implementing RLHF, DPO, or instruction tuning pipelines

Prerequisites

NVIDIA GPU(s) with 24GB+ VRAM (RTX 4090 / A100 / H100)
CUDA 12.1+ and nvidia-smi working
Python 3.10+ with pip
Hugging Face account and HF_TOKEN for gated models
500GB+ disk for model weights and training data

Quick Start: QLoRA Fine-Tuning

pip install transformers datasets trl peft bitsandbytes accelerate

python - <<'EOF' from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model from trl import SFTTrainer, SFTConfig import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"

4-bit quantization (QLoRA)

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, )

model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_id)

LoRA configuration

peft_config = LoraConfig( r=16, # rank lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", )

dataset = load_dataset("your-org/your-dataset", split="train")

trainer = SFTTrainer( model=model, args=SFTConfig( output_dir="./output", num_train_epochs=3, per_device_train_batch_size=2, gradient_accumulation_steps=8, learning_rate=2e-4, bf16=True, logging_steps=10, save_strategy="epoch", report_to="wandb", ), train_dataset=dataset, peft_config=peft_config, processing_class=tokenizer, ) trainer.train() trainer.save_model("./fine-tuned-model") EOF

Axolotl (Production Fine-Tuning Framework)

config.yaml — Axolotl QLoRA config for Llama 3.1

base_model: meta-llama/Llama-3.1-8B-Instruct model_type: LlamaForCausalLM tokenizer_type: PreTrainedTokenizerFast

load_in_4bit: true adapter: qlora lora_r: 32 lora_alpha: 64 lora_dropout: 0.05 lora_target_modules:

q_proj
k_proj
v_proj
o_proj
gate_proj
up_proj
down_proj

datasets:

path: your-org/your-dataset type: alpaca # or sharegpt, chat_template, etc.

dataset_prepared_path: ./prepared-data val_set_size: 0.05 output_dir: ./output

sequence_len: 4096 sample_packing: true # pack multiple short samples for efficiency

micro_batch_size: 2 gradient_accumulation_steps: 8 num_epochs: 3 learning_rate: 2e-4 optimizer: adamw_bnb_8bit lr_scheduler: cosine warmup_ratio: 0.05

bf16: true flash_attention: true

logging_steps: 10 eval_steps: 100 save_steps: 200 wandb_project: my-fine-tune

Run with Axolotl

pip install axolotl[flash-attn,deepspeed] accelerate launch -m axolotl.cli.train config.yaml

Distributed Training with DeepSpeed

// deepspeed_zero3.json — ZeRO Stage 3 (split optimizer + gradients + params) { "zero_optimization": { "stage": 3, "offload_optimizer": {"device": "cpu", "pin_memory": true}, "offload_param": {"device": "cpu", "pin_memory": true}, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "gather_16bit_weights_on_model_save": true }, "bf16": {"enabled": true}, "gradient_clipping": 1.0, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto" }

Launch 4-GPU DeepSpeed training

deepspeed --num_gpus=4 train.py
--deepspeed deepspeed_zero3.json
--model_name meta-llama/Llama-3.1-70B-Instruct
--output_dir ./output

DPO / RLHF Alignment

from trl import DPOTrainer, DPOConfig from datasets import load_dataset

Dataset format: {"prompt": ..., "chosen": ..., "rejected": ...}

dataset = load_dataset("your-org/preference-data")

trainer = DPOTrainer( model=model, ref_model=None, # None = implicit reference with peft args=DPOConfig( output_dir="./dpo-output", beta=0.1, # KL divergence weight num_train_epochs=1, per_device_train_batch_size=1, gradient_accumulation_steps=16, learning_rate=5e-7, bf16=True, ), train_dataset=dataset["train"], peft_config=peft_config, processing_class=tokenizer, ) trainer.train()

Merging LoRA Adapters for Deployment

from peft import PeftModel from transformers import AutoModelForCausalLM

Load base model in full precision

base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.bfloat16, device_map="cpu", )

Load and merge LoRA adapter

model = PeftModel.from_pretrained(base_model, "./fine-tuned-model") merged_model = model.merge_and_unload()

Save merged model (ready for vLLM serving)

merged_model.save_pretrained("./merged-model", safe_serialization=True) tokenizer.save_pretrained("./merged-model")

Push to Hugging Face Hub

merged_model.push_to_hub("your-org/your-fine-tuned-model")

Kubernetes Training Job

apiVersion: batch/v1 kind: Job metadata: name: llm-fine-tune spec: template: spec: restartPolicy: OnFailure nodeSelector: nvidia.com/gpu.product: A100-SXM4-80GB containers: - name: trainer image: nvcr.io/nvidia/pytorch:24.05-py3 command: ["accelerate", "launch", "-m", "axolotl.cli.train", "/config/config.yaml"] resources: limits: nvidia.com/gpu: "4" memory: "320Gi" requests: nvidia.com/gpu: "4" volumeMounts: - name: config mountPath: /config - name: model-cache mountPath: /root/.cache/huggingface - name: output mountPath: /output env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token key: token - name: WANDB_API_KEY valueFrom: secretKeyRef: name: wandb-token key: key volumes: - name: config configMap: name: axolotl-config - name: model-cache persistentVolumeClaim: claimName: model-cache-pvc - name: output persistentVolumeClaim: claimName: training-output-pvc

Common Issues

Issue Cause Fix

CUDA out of memory

Batch too large Reduce micro_batch_size ; increase gradient_accumulation_steps

Training loss NaN Learning rate too high Lower LR to 1e-4 or 5e-5 ; add warmup

Slow training No Flash Attention Install flash-attn ; enable flash_attention: true

Poor fine-tune quality Bad data formatting Validate dataset format; check sample_packing compatibility

Adapter merge errors Mixed quantization Merge in bf16 on CPU, not in 4-bit

Best Practices

Use Flash Attention 2 — it's 2–4× faster and uses less memory.
Monitor training loss/eval loss via W&B or MLflow; overfit = more dropout or less data.
Validate with a held-out eval set (5–10%); MMLU or custom evals for quality gates.
Start with LoRA r=16 before increasing — higher rank = more parameters, diminishing returns.
Use sample_packing in Axolotl to maximize GPU utilization on short sequences.

Related Skills

vllm-server - Serve fine-tuned models
gpu-server-management - GPU setup
llm-inference-scaling - Deploy at scale
ai-pipeline-orchestration - Training pipelines

llm-fine-tuning

Safety Notice

Copy this and send it to your AI assistant to learn