ML Training Patterns

Purpose: Provide production-ready training templates, configuration files, and automation scripts for common ML training scenarios including classification, generation, fine-tuning, and PEFT/LoRA approaches.

Activation Triggers:

Building text classification models (sentiment, intent, NER, etc.)
Training text generation models (summarization, Q&A, chatbots)
Fine-tuning pre-trained models for specific tasks
Implementing PEFT (Parameter-Efficient Fine-Tuning) with LoRA
Setting up training pipelines with HuggingFace Transformers
Configuring training hyperparameters and optimization
Preparing datasets for model training

Key Resources:

scripts/setup-classification.sh
Classification training setup automation
scripts/setup-generation.sh
Generation training setup automation
scripts/setup-fine-tuning.sh
Full fine-tuning setup automation
scripts/setup-peft.sh
PEFT/LoRA training setup automation
templates/classification-config.yaml
Classification training configuration
templates/generation-config.yaml
Generation training configuration
templates/peft-config.json
PEFT/LoRA configuration
examples/sentiment-classifier.md
Complete sentiment classification example
examples/text-generator.md
Complete text generation example

Training Scenarios Overview

Text Classification

Use cases: Sentiment analysis, intent classification, topic categorization, spam detection, named entity recognition (NER)

Key characteristics:

Input: Text → Output: Class label(s)
Typically uses encoder models (BERT, RoBERTa, DistilBERT)
Fast inference, suitable for production
Requires labeled training data

Setup command:

./scripts/setup-classification.sh <project-name> <model-name> <num-classes>

Example:

./scripts/setup-classification.sh sentiment-model distilbert-base-uncased 3

Text Generation

Use cases: Summarization, question answering, chatbots, text completion, translation, code generation

Key characteristics:

Input: Text (prompt) → Output: Generated text
Uses decoder or encoder-decoder models (GPT-2, T5, BART)
More computationally intensive
Can be trained with or without labeled data

Setup command:

./scripts/setup-generation.sh <project-name> <model-name> <generation-type>

Example:

./scripts/setup-generation.sh qa-bot t5-small question-answering

Full Fine-Tuning

Use cases: When you have sufficient data and compute to retrain all model parameters

Key characteristics:

Updates all model weights
Requires significant compute (GPU with 16GB+ VRAM)
Best for substantial domain adaptation
Training time: hours to days

Setup command:

./scripts/setup-fine-tuning.sh <project-name> <model-name> <task-type>

Example:

./scripts/setup-fine-tuning.sh medical-classifier bert-base-uncased classification

PEFT (Parameter-Efficient Fine-Tuning)

Use cases: Limited compute resources, quick experimentation, domain adaptation with small datasets

Key characteristics:

Only trains a small subset of parameters (LoRA adapters)
10-100x less memory than full fine-tuning
Fast training (minutes to hours)
Can fine-tune large models (7B+) on consumer GPUs
Uses techniques like LoRA, QLoRA, Prefix Tuning, Adapter Layers

Setup command:

./scripts/setup-peft.sh <project-name> <model-name> <peft-method>

Example:

./scripts/setup-peft.sh efficient-classifier roberta-base lora

Classification Training Pattern

Configuration Template

File: templates/classification-config.yaml

Key parameters:

model: name: distilbert-base-uncased num_labels: 3 task_type: classification

dataset: train_file: data/train.csv validation_file: data/val.csv test_file: data/test.csv text_column: text label_column: label

training: output_dir: ./outputs num_epochs: 3 batch_size: 16 learning_rate: 2e-5 warmup_steps: 500 weight_decay: 0.01 evaluation_strategy: epoch save_strategy: epoch logging_steps: 100 fp16: true # Mixed precision training gradient_accumulation_steps: 1

optimizer: name: adamw betas: [0.9, 0.999] epsilon: 1e-8

Training Pipeline

Dataset Preparation:

from datasets import load_dataset

Load from CSV

dataset = load_dataset('csv', data_files={ 'train': 'data/train.csv', 'validation': 'data/val.csv', 'test': 'data/test.csv' })

Preprocess

def preprocess(examples): return tokenizer( examples['text'], truncation=True, padding='max_length', max_length=512 )

dataset = dataset.map(preprocess, batched=True)

Model Initialization:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=num_classes, id2label={0: 'negative', 1: 'neutral', 2: 'positive'}, label2id={'negative': 0, 'neutral': 1, 'positive': 2} )

Training:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments( output_dir='./outputs', num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=32, learning_rate=2e-5, warmup_steps=500, weight_decay=0.01, evaluation_strategy='epoch', save_strategy='epoch', load_best_model_at_end=True, metric_for_best_model='accuracy', fp16=True, # Enable mixed precision )

trainer = Trainer( model=model, args=training_args, train_dataset=dataset['train'], eval_dataset=dataset['validation'], compute_metrics=compute_metrics, )

trainer.train()

Evaluation:

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(eval_pred): predictions, labels = eval_pred predictions = predictions.argmax(axis=-1)

accuracy = accuracy_score(labels, predictions)
precision, recall, f1, _ = precision_recall_fscore_support(
    labels, predictions, average='weighted'
)

return {
    'accuracy': accuracy,
    'precision': precision,
    'recall': recall,
    'f1': f1
}

Evaluate on test set

results = trainer.evaluate(dataset['test']) print(results)

Generation Training Pattern

Configuration Template

File: templates/generation-config.yaml

Key parameters:

model: name: t5-small task_type: generation generation_type: question-answering # or summarization, translation, etc.

dataset: train_file: data/train.json validation_file: data/val.json input_column: question target_column: answer max_input_length: 512 max_target_length: 128

training: output_dir: ./outputs num_epochs: 5 batch_size: 8 learning_rate: 3e-4 warmup_steps: 1000 weight_decay: 0.01 evaluation_strategy: steps eval_steps: 500 save_steps: 500 logging_steps: 100 fp16: true gradient_accumulation_steps: 2 predict_with_generate: true

generation: max_length: 128 min_length: 10 num_beams: 4 length_penalty: 2.0 early_stopping: true no_repeat_ngram_size: 3

Training Pipeline

Dataset Preparation:

from datasets import load_dataset

Load from JSON (question-answer pairs)

dataset = load_dataset('json', data_files={ 'train': 'data/train.json', 'validation': 'data/val.json' })

Preprocess for seq2seq

def preprocess(examples): inputs = tokenizer( examples['question'], max_length=512, truncation=True, padding='max_length' )

# Tokenize targets
with tokenizer.as_target_tokenizer():
    targets = tokenizer(
        examples['answer'],
        max_length=128,
        truncation=True,
        padding='max_length'
    )

inputs['labels'] = targets['input_ids']
return inputs

dataset = dataset.map(preprocess, batched=True)

Model & Training:

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained('t5-small')

training_args = Seq2SeqTrainingArguments( output_dir='./outputs', num_train_epochs=5, per_device_train_batch_size=8, per_device_eval_batch_size=16, learning_rate=3e-4, predict_with_generate=True, generation_max_length=128, generation_num_beams=4, fp16=True, )

trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=dataset['train'], eval_dataset=dataset['validation'], )

trainer.train()

Generation & Evaluation:

Generate predictions

def generate_answer(question): inputs = tokenizer(question, return_tensors='pt', max_length=512, truncation=True) outputs = model.generate( **inputs, max_length=128, num_beams=4, length_penalty=2.0, early_stopping=True ) return tokenizer.decode(outputs[0], skip_special_tokens=True)

Test

question = "What is machine learning?" answer = generate_answer(question) print(f"Q: {question}\nA: {answer}")

PEFT/LoRA Training Pattern

Why PEFT/LoRA?

Traditional fine-tuning challenges:

Requires updating all model parameters (millions to billions)
High GPU memory requirements (often 40GB+ for 7B models)
Slow training (hours to days)
Risk of catastrophic forgetting

PEFT/LoRA benefits:

Only trains ~0.1-1% of parameters (LoRA adapters)
10-100x less memory usage
3-10x faster training
Can fine-tune 7B+ models on consumer GPUs (RTX 3090, 4090)
Multiple task adapters for same base model

Configuration Template

File: templates/peft-config.json

{ "peft_type": "LORA", "task_type": "SEQ_CLS", "inference_mode": false, "r": 8, "lora_alpha": 16, "lora_dropout": 0.1, "target_modules": [ "query", "key", "value", "dense" ], "bias": "none", "modules_to_save": ["classifier"] }

Key parameters:

r : LoRA rank (lower = fewer parameters, typically 4-64)
lora_alpha : Scaling factor (typically 2x rank)
lora_dropout : Dropout for LoRA layers (0.05-0.1)
target_modules : Which layers to apply LoRA (query, key, value, dense)

Training Pipeline

Install PEFT:

pip install peft

Setup PEFT Model:

from transformers import AutoModelForSequenceClassification from peft import get_peft_model, LoraConfig, TaskType

Load base model

base_model = AutoModelForSequenceClassification.from_pretrained( 'roberta-base', num_labels=3 )

Configure LoRA

peft_config = LoraConfig( task_type=TaskType.SEQ_CLS, inference_mode=False, r=8, lora_alpha=16, lora_dropout=0.1, target_modules=['query', 'key', 'value', 'dense'] )

Apply PEFT

model = get_peft_model(base_model, peft_config) model.print_trainable_parameters()

Output: trainable params: 296,448 || all params: 124,940,546 || trainable%: 0.237%

Training:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments( output_dir='./peft_outputs', num_train_epochs=3, per_device_train_batch_size=16, # Can use larger batch size! learning_rate=1e-3, # Higher learning rate for PEFT fp16=True, )

trainer = Trainer( model=model, args=training_args, train_dataset=dataset['train'], eval_dataset=dataset['validation'], )

trainer.train()

Save & Load Adapters:

Save only LoRA adapters (tiny file, ~1-10MB)

model.save_pretrained('./lora_adapters')

Load adapters later

from peft import PeftModel base_model = AutoModelForSequenceClassification.from_pretrained('roberta-base', num_labels=3) model = PeftModel.from_pretrained(base_model, './lora_adapters')

QLoRA (Quantized LoRA)

For even more memory efficiency with large models:

from transformers import BitsAndBytesConfig

4-bit quantization config

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 )

Load model in 4-bit

model = AutoModelForCausalLM.from_pretrained( 'meta-llama/Llama-2-7b-hf', quantization_config=bnb_config, device_map='auto' )

Apply LoRA on top of quantized model

model = prepare_model_for_kbit_training(model) model = get_peft_model(model, peft_config)

Now can fine-tune 7B model on 16GB GPU!

Setup Scripts Usage

Classification Setup

cd /home/gotime2022/.claude/plugins/marketplaces/ai-dev-marketplace/plugins/ml-training/skills/training-patterns ./scripts/setup-classification.sh my-classifier distilbert-base-uncased 3

Creates:

Project directory structure
Training script with Trainer API
Configuration file (classification-config.yaml)
Dataset preparation script
Requirements.txt
README with instructions

Arguments:

project-name : Name of training project
model-name : HuggingFace model identifier
num-classes : Number of classification labels

Generation Setup

./scripts/setup-generation.sh my-generator t5-small summarization

Creates:

Seq2Seq training pipeline
Generation configuration
Dataset processing for input-target pairs
Evaluation with ROUGE/BLEU metrics
Inference script

Arguments:

project-name : Name of training project
model-name : HuggingFace model identifier
generation-type : summarization, question-answering, translation, etc.

Fine-Tuning Setup

./scripts/setup-fine-tuning.sh domain-model bert-base-uncased classification

Creates:

Full fine-tuning pipeline
GPU memory optimization configs
Gradient checkpointing setup
Mixed precision training
Model checkpointing strategy

Arguments:

project-name : Name of training project
model-name : HuggingFace model identifier
task-type : classification or generation

PEFT Setup

./scripts/setup-peft.sh efficient-trainer roberta-base lora

Creates:

PEFT training pipeline with LoRA
Adapter configuration
Memory-efficient training setup
Adapter save/load utilities
Multi-adapter management

Arguments:

project-name : Name of training project
model-name : HuggingFace model identifier
peft-method : lora, qlora, prefix-tuning, or adapter

Dataset Formats

Classification Dataset (CSV)

text,label "This product is amazing!",positive "Terrible experience",negative "It's okay, nothing special",neutral

Generation Dataset (JSON)

[ { "question": "What is the capital of France?", "answer": "The capital of France is Paris." }, { "question": "How does photosynthesis work?", "answer": "Photosynthesis is the process where plants convert light energy into chemical energy..." } ]

HuggingFace Datasets Integration

from datasets import load_dataset

Load from HuggingFace Hub

dataset = load_dataset('glue', 'sst2') # Sentiment classification dataset = load_dataset('squad') # Question answering dataset = load_dataset('cnn_dailymail', '3.0.0') # Summarization

Load local files

dataset = load_dataset('csv', data_files='data.csv') dataset = load_dataset('json', data_files='data.json')

Training Best Practices

Hyperparameter Selection

Learning Rate:

Full fine-tuning: 1e-5 to 5e-5
PEFT/LoRA: 1e-4 to 1e-3 (can be higher)
Rule of thumb: Start with 2e-5 for full, 3e-4 for PEFT

Batch Size:

As large as GPU memory allows
Use gradient accumulation if batch size limited
Effective batch size = batch_size × gradient_accumulation_steps

Epochs:

Classification: 3-5 epochs
Generation: 5-10 epochs
Watch for overfitting with validation metrics

Warmup Steps:

Typically 10% of total training steps
Helps stabilize training initially

GPU Memory Optimization

Techniques:

Mixed precision (fp16/bf16): 2x memory reduction
Gradient checkpointing: 30-50% memory reduction (slower training)
Gradient accumulation: Simulate larger batch sizes
PEFT/LoRA: 10-100x memory reduction
8-bit/4-bit quantization: 2-4x memory reduction

Example:

from transformers import TrainingArguments

training_args = TrainingArguments( fp16=True, # Mixed precision gradient_checkpointing=True, # Memory optimization gradient_accumulation_steps=4, # Effective batch size × 4 per_device_train_batch_size=4, # Small batch per GPU )

Monitoring Training

Track these metrics:

Training loss (should decrease steadily)
Validation loss (should decrease, not increase)
Validation accuracy/F1/ROUGE (should increase)
Learning rate schedule
GPU memory usage

Use Weights & Biases:

training_args = TrainingArguments( report_to='wandb', run_name='my-training-run', )

Early Stopping

from transformers import EarlyStoppingCallback

trainer = Trainer( callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] )

Model Checkpointing

training_args = TrainingArguments( save_strategy='epoch', # Save after each epoch save_total_limit=3, # Keep only best 3 checkpoints load_best_model_at_end=True, # Load best after training metric_for_best_model='f1', # Choose best by F1 score )

Common Training Patterns

Pattern 1: Quick Experimentation (PEFT)

When: Testing ideas, limited compute, small datasets Approach: LoRA with small rank (r=4-8) Time: Minutes to 1 hour Memory: Can fine-tune 7B models on 16GB GPU

Pattern 2: Production Classification (Full Fine-Tuning)

When: Production deployment, sufficient labeled data Approach: Full fine-tuning with early stopping Time: 1-6 hours Memory: 16GB GPU for base models (110M-340M params)

Pattern 3: Domain Adaptation (PEFT + Full)

When: Adapting to specific domain, then task-specific fine-tuning Approach:

PEFT on domain data (unlabeled or weakly labeled)
Full fine-tuning on task data (labeled) Time: 2-12 hours total Memory: 16-40GB GPU

Pattern 4: Multi-Task Learning (Multiple LoRA Adapters)

When: One model for multiple tasks Approach: Train separate LoRA adapters per task, swap at inference Time: 1-3 hours per task Memory: 16GB GPU, adapters are tiny (1-10MB each)

Troubleshooting

Out of Memory (OOM) Errors:

Reduce batch size
Enable gradient checkpointing
Use gradient accumulation
Switch to PEFT/LoRA
Use 8-bit quantization

Training Not Converging:

Lower learning rate
Increase warmup steps
Check data quality and preprocessing
Verify labels are correct
Try different optimizer (AdamW vs SGD)

Overfitting:

Add dropout
Use weight decay
Get more training data
Use data augmentation
Early stopping

Slow Training:

Enable fp16 mixed precision
Increase batch size
Use gradient accumulation less
Remove gradient checkpointing
Use faster model variant (distilbert vs bert)

Poor Evaluation Metrics:

Check data distribution (train vs val vs test)
Verify preprocessing is consistent
Try different model architecture
Increase model size or training time
Check for label imbalance

Supported Models:

Classification: BERT, RoBERTa, DistilBERT, ALBERT, DeBERTa
Generation: T5, BART, GPT-2, Llama-2, Mistral, Phi
PEFT: Compatible with all transformer models

Requirements:

Python 3.11+
PyTorch 2.0+
Transformers 4.30+
PEFT 0.7+ (for LoRA)
Datasets 2.14+

Best Practice: Start with PEFT/LoRA for quick iteration, switch to full fine-tuning only when necessary

training-patterns

Safety Notice

Copy this and send it to your AI assistant to learn

Load from CSV

Preprocess

Evaluate on test set

Load from JSON (question-answer pairs)

Preprocess for seq2seq

Generate predictions

Test

Load base model

Configure LoRA

Apply PEFT

Output: trainable params: 296,448 || all params: 124,940,546 || trainable%: 0.237%

Save only LoRA adapters (tiny file, ~1-10MB)

Load adapters later

4-bit quantization config

Load model in 4-bit

Apply LoRA on top of quantized model

Now can fine-tune 7B model on 16GB GPU!

Load from HuggingFace Hub

Load local files

Source Transparency

Related Skills

document-parsers

stt-integration

model-routing-patterns