training-patterns

Purpose: Provide production-ready training templates, configuration files, and automation scripts for common ML training scenarios including classification, generation, fine-tuning, and PEFT/LoRA approaches.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "training-patterns" with this command: npx skills add vanman2024/ai-dev-marketplace/vanman2024-ai-dev-marketplace-training-patterns

ML Training Patterns

Purpose: Provide production-ready training templates, configuration files, and automation scripts for common ML training scenarios including classification, generation, fine-tuning, and PEFT/LoRA approaches.

Activation Triggers:

  • Building text classification models (sentiment, intent, NER, etc.)

  • Training text generation models (summarization, Q&A, chatbots)

  • Fine-tuning pre-trained models for specific tasks

  • Implementing PEFT (Parameter-Efficient Fine-Tuning) with LoRA

  • Setting up training pipelines with HuggingFace Transformers

  • Configuring training hyperparameters and optimization

  • Preparing datasets for model training

Key Resources:

  • scripts/setup-classification.sh

  • Classification training setup automation

  • scripts/setup-generation.sh

  • Generation training setup automation

  • scripts/setup-fine-tuning.sh

  • Full fine-tuning setup automation

  • scripts/setup-peft.sh

  • PEFT/LoRA training setup automation

  • templates/classification-config.yaml

  • Classification training configuration

  • templates/generation-config.yaml

  • Generation training configuration

  • templates/peft-config.json

  • PEFT/LoRA configuration

  • examples/sentiment-classifier.md

  • Complete sentiment classification example

  • examples/text-generator.md

  • Complete text generation example

Training Scenarios Overview

  1. Text Classification

Use cases: Sentiment analysis, intent classification, topic categorization, spam detection, named entity recognition (NER)

Key characteristics:

  • Input: Text → Output: Class label(s)

  • Typically uses encoder models (BERT, RoBERTa, DistilBERT)

  • Fast inference, suitable for production

  • Requires labeled training data

Setup command:

./scripts/setup-classification.sh <project-name> <model-name> <num-classes>

Example:

./scripts/setup-classification.sh sentiment-model distilbert-base-uncased 3

  1. Text Generation

Use cases: Summarization, question answering, chatbots, text completion, translation, code generation

Key characteristics:

  • Input: Text (prompt) → Output: Generated text

  • Uses decoder or encoder-decoder models (GPT-2, T5, BART)

  • More computationally intensive

  • Can be trained with or without labeled data

Setup command:

./scripts/setup-generation.sh <project-name> <model-name> <generation-type>

Example:

./scripts/setup-generation.sh qa-bot t5-small question-answering

  1. Full Fine-Tuning

Use cases: When you have sufficient data and compute to retrain all model parameters

Key characteristics:

  • Updates all model weights

  • Requires significant compute (GPU with 16GB+ VRAM)

  • Best for substantial domain adaptation

  • Training time: hours to days

Setup command:

./scripts/setup-fine-tuning.sh <project-name> <model-name> <task-type>

Example:

./scripts/setup-fine-tuning.sh medical-classifier bert-base-uncased classification

  1. PEFT (Parameter-Efficient Fine-Tuning)

Use cases: Limited compute resources, quick experimentation, domain adaptation with small datasets

Key characteristics:

  • Only trains a small subset of parameters (LoRA adapters)

  • 10-100x less memory than full fine-tuning

  • Fast training (minutes to hours)

  • Can fine-tune large models (7B+) on consumer GPUs

  • Uses techniques like LoRA, QLoRA, Prefix Tuning, Adapter Layers

Setup command:

./scripts/setup-peft.sh <project-name> <model-name> <peft-method>

Example:

./scripts/setup-peft.sh efficient-classifier roberta-base lora

Classification Training Pattern

Configuration Template

File: templates/classification-config.yaml

Key parameters:

model: name: distilbert-base-uncased num_labels: 3 task_type: classification

dataset: train_file: data/train.csv validation_file: data/val.csv test_file: data/test.csv text_column: text label_column: label

training: output_dir: ./outputs num_epochs: 3 batch_size: 16 learning_rate: 2e-5 warmup_steps: 500 weight_decay: 0.01 evaluation_strategy: epoch save_strategy: epoch logging_steps: 100 fp16: true # Mixed precision training gradient_accumulation_steps: 1

optimizer: name: adamw betas: [0.9, 0.999] epsilon: 1e-8

Training Pipeline

  1. Dataset Preparation:

from datasets import load_dataset

Load from CSV

dataset = load_dataset('csv', data_files={ 'train': 'data/train.csv', 'validation': 'data/val.csv', 'test': 'data/test.csv' })

Preprocess

def preprocess(examples): return tokenizer( examples['text'], truncation=True, padding='max_length', max_length=512 )

dataset = dataset.map(preprocess, batched=True)

  1. Model Initialization:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=num_classes, id2label={0: 'negative', 1: 'neutral', 2: 'positive'}, label2id={'negative': 0, 'neutral': 1, 'positive': 2} )

  1. Training:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments( output_dir='./outputs', num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=32, learning_rate=2e-5, warmup_steps=500, weight_decay=0.01, evaluation_strategy='epoch', save_strategy='epoch', load_best_model_at_end=True, metric_for_best_model='accuracy', fp16=True, # Enable mixed precision )

trainer = Trainer( model=model, args=training_args, train_dataset=dataset['train'], eval_dataset=dataset['validation'], compute_metrics=compute_metrics, )

trainer.train()

  1. Evaluation:

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(eval_pred): predictions, labels = eval_pred predictions = predictions.argmax(axis=-1)

accuracy = accuracy_score(labels, predictions)
precision, recall, f1, _ = precision_recall_fscore_support(
    labels, predictions, average='weighted'
)

return {
    'accuracy': accuracy,
    'precision': precision,
    'recall': recall,
    'f1': f1
}

Evaluate on test set

results = trainer.evaluate(dataset['test']) print(results)

Generation Training Pattern

Configuration Template

File: templates/generation-config.yaml

Key parameters:

model: name: t5-small task_type: generation generation_type: question-answering # or summarization, translation, etc.

dataset: train_file: data/train.json validation_file: data/val.json input_column: question target_column: answer max_input_length: 512 max_target_length: 128

training: output_dir: ./outputs num_epochs: 5 batch_size: 8 learning_rate: 3e-4 warmup_steps: 1000 weight_decay: 0.01 evaluation_strategy: steps eval_steps: 500 save_steps: 500 logging_steps: 100 fp16: true gradient_accumulation_steps: 2 predict_with_generate: true

generation: max_length: 128 min_length: 10 num_beams: 4 length_penalty: 2.0 early_stopping: true no_repeat_ngram_size: 3

Training Pipeline

  1. Dataset Preparation:

from datasets import load_dataset

Load from JSON (question-answer pairs)

dataset = load_dataset('json', data_files={ 'train': 'data/train.json', 'validation': 'data/val.json' })

Preprocess for seq2seq

def preprocess(examples): inputs = tokenizer( examples['question'], max_length=512, truncation=True, padding='max_length' )

# Tokenize targets
with tokenizer.as_target_tokenizer():
    targets = tokenizer(
        examples['answer'],
        max_length=128,
        truncation=True,
        padding='max_length'
    )

inputs['labels'] = targets['input_ids']
return inputs

dataset = dataset.map(preprocess, batched=True)

  1. Model & Training:

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained('t5-small')

training_args = Seq2SeqTrainingArguments( output_dir='./outputs', num_train_epochs=5, per_device_train_batch_size=8, per_device_eval_batch_size=16, learning_rate=3e-4, predict_with_generate=True, generation_max_length=128, generation_num_beams=4, fp16=True, )

trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=dataset['train'], eval_dataset=dataset['validation'], )

trainer.train()

  1. Generation & Evaluation:

Generate predictions

def generate_answer(question): inputs = tokenizer(question, return_tensors='pt', max_length=512, truncation=True) outputs = model.generate( **inputs, max_length=128, num_beams=4, length_penalty=2.0, early_stopping=True ) return tokenizer.decode(outputs[0], skip_special_tokens=True)

Test

question = "What is machine learning?" answer = generate_answer(question) print(f"Q: {question}\nA: {answer}")

PEFT/LoRA Training Pattern

Why PEFT/LoRA?

Traditional fine-tuning challenges:

  • Requires updating all model parameters (millions to billions)

  • High GPU memory requirements (often 40GB+ for 7B models)

  • Slow training (hours to days)

  • Risk of catastrophic forgetting

PEFT/LoRA benefits:

  • Only trains ~0.1-1% of parameters (LoRA adapters)

  • 10-100x less memory usage

  • 3-10x faster training

  • Can fine-tune 7B+ models on consumer GPUs (RTX 3090, 4090)

  • Multiple task adapters for same base model

Configuration Template

File: templates/peft-config.json

{ "peft_type": "LORA", "task_type": "SEQ_CLS", "inference_mode": false, "r": 8, "lora_alpha": 16, "lora_dropout": 0.1, "target_modules": [ "query", "key", "value", "dense" ], "bias": "none", "modules_to_save": ["classifier"] }

Key parameters:

  • r : LoRA rank (lower = fewer parameters, typically 4-64)

  • lora_alpha : Scaling factor (typically 2x rank)

  • lora_dropout : Dropout for LoRA layers (0.05-0.1)

  • target_modules : Which layers to apply LoRA (query, key, value, dense)

Training Pipeline

  1. Install PEFT:

pip install peft

  1. Setup PEFT Model:

from transformers import AutoModelForSequenceClassification from peft import get_peft_model, LoraConfig, TaskType

Load base model

base_model = AutoModelForSequenceClassification.from_pretrained( 'roberta-base', num_labels=3 )

Configure LoRA

peft_config = LoraConfig( task_type=TaskType.SEQ_CLS, inference_mode=False, r=8, lora_alpha=16, lora_dropout=0.1, target_modules=['query', 'key', 'value', 'dense'] )

Apply PEFT

model = get_peft_model(base_model, peft_config) model.print_trainable_parameters()

Output: trainable params: 296,448 || all params: 124,940,546 || trainable%: 0.237%

  1. Training:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments( output_dir='./peft_outputs', num_train_epochs=3, per_device_train_batch_size=16, # Can use larger batch size! learning_rate=1e-3, # Higher learning rate for PEFT fp16=True, )

trainer = Trainer( model=model, args=training_args, train_dataset=dataset['train'], eval_dataset=dataset['validation'], )

trainer.train()

  1. Save & Load Adapters:

Save only LoRA adapters (tiny file, ~1-10MB)

model.save_pretrained('./lora_adapters')

Load adapters later

from peft import PeftModel base_model = AutoModelForSequenceClassification.from_pretrained('roberta-base', num_labels=3) model = PeftModel.from_pretrained(base_model, './lora_adapters')

QLoRA (Quantized LoRA)

For even more memory efficiency with large models:

from transformers import BitsAndBytesConfig

4-bit quantization config

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 )

Load model in 4-bit

model = AutoModelForCausalLM.from_pretrained( 'meta-llama/Llama-2-7b-hf', quantization_config=bnb_config, device_map='auto' )

Apply LoRA on top of quantized model

model = prepare_model_for_kbit_training(model) model = get_peft_model(model, peft_config)

Now can fine-tune 7B model on 16GB GPU!

Setup Scripts Usage

Classification Setup

cd /home/gotime2022/.claude/plugins/marketplaces/ai-dev-marketplace/plugins/ml-training/skills/training-patterns ./scripts/setup-classification.sh my-classifier distilbert-base-uncased 3

Creates:

  • Project directory structure

  • Training script with Trainer API

  • Configuration file (classification-config.yaml)

  • Dataset preparation script

  • Requirements.txt

  • README with instructions

Arguments:

  • project-name : Name of training project

  • model-name : HuggingFace model identifier

  • num-classes : Number of classification labels

Generation Setup

./scripts/setup-generation.sh my-generator t5-small summarization

Creates:

  • Seq2Seq training pipeline

  • Generation configuration

  • Dataset processing for input-target pairs

  • Evaluation with ROUGE/BLEU metrics

  • Inference script

Arguments:

  • project-name : Name of training project

  • model-name : HuggingFace model identifier

  • generation-type : summarization, question-answering, translation, etc.

Fine-Tuning Setup

./scripts/setup-fine-tuning.sh domain-model bert-base-uncased classification

Creates:

  • Full fine-tuning pipeline

  • GPU memory optimization configs

  • Gradient checkpointing setup

  • Mixed precision training

  • Model checkpointing strategy

Arguments:

  • project-name : Name of training project

  • model-name : HuggingFace model identifier

  • task-type : classification or generation

PEFT Setup

./scripts/setup-peft.sh efficient-trainer roberta-base lora

Creates:

  • PEFT training pipeline with LoRA

  • Adapter configuration

  • Memory-efficient training setup

  • Adapter save/load utilities

  • Multi-adapter management

Arguments:

  • project-name : Name of training project

  • model-name : HuggingFace model identifier

  • peft-method : lora, qlora, prefix-tuning, or adapter

Dataset Formats

Classification Dataset (CSV)

text,label "This product is amazing!",positive "Terrible experience",negative "It's okay, nothing special",neutral

Generation Dataset (JSON)

[ { "question": "What is the capital of France?", "answer": "The capital of France is Paris." }, { "question": "How does photosynthesis work?", "answer": "Photosynthesis is the process where plants convert light energy into chemical energy..." } ]

HuggingFace Datasets Integration

from datasets import load_dataset

Load from HuggingFace Hub

dataset = load_dataset('glue', 'sst2') # Sentiment classification dataset = load_dataset('squad') # Question answering dataset = load_dataset('cnn_dailymail', '3.0.0') # Summarization

Load local files

dataset = load_dataset('csv', data_files='data.csv') dataset = load_dataset('json', data_files='data.json')

Training Best Practices

  1. Hyperparameter Selection

Learning Rate:

  • Full fine-tuning: 1e-5 to 5e-5

  • PEFT/LoRA: 1e-4 to 1e-3 (can be higher)

  • Rule of thumb: Start with 2e-5 for full, 3e-4 for PEFT

Batch Size:

  • As large as GPU memory allows

  • Use gradient accumulation if batch size limited

  • Effective batch size = batch_size × gradient_accumulation_steps

Epochs:

  • Classification: 3-5 epochs

  • Generation: 5-10 epochs

  • Watch for overfitting with validation metrics

Warmup Steps:

  • Typically 10% of total training steps

  • Helps stabilize training initially

  1. GPU Memory Optimization

Techniques:

  • Mixed precision (fp16/bf16): 2x memory reduction

  • Gradient checkpointing: 30-50% memory reduction (slower training)

  • Gradient accumulation: Simulate larger batch sizes

  • PEFT/LoRA: 10-100x memory reduction

  • 8-bit/4-bit quantization: 2-4x memory reduction

Example:

from transformers import TrainingArguments

training_args = TrainingArguments( fp16=True, # Mixed precision gradient_checkpointing=True, # Memory optimization gradient_accumulation_steps=4, # Effective batch size × 4 per_device_train_batch_size=4, # Small batch per GPU )

  1. Monitoring Training

Track these metrics:

  • Training loss (should decrease steadily)

  • Validation loss (should decrease, not increase)

  • Validation accuracy/F1/ROUGE (should increase)

  • Learning rate schedule

  • GPU memory usage

Use Weights & Biases:

training_args = TrainingArguments( report_to='wandb', run_name='my-training-run', )

  1. Early Stopping

from transformers import EarlyStoppingCallback

trainer = Trainer( callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] )

  1. Model Checkpointing

training_args = TrainingArguments( save_strategy='epoch', # Save after each epoch save_total_limit=3, # Keep only best 3 checkpoints load_best_model_at_end=True, # Load best after training metric_for_best_model='f1', # Choose best by F1 score )

Common Training Patterns

Pattern 1: Quick Experimentation (PEFT)

When: Testing ideas, limited compute, small datasets Approach: LoRA with small rank (r=4-8) Time: Minutes to 1 hour Memory: Can fine-tune 7B models on 16GB GPU

Pattern 2: Production Classification (Full Fine-Tuning)

When: Production deployment, sufficient labeled data Approach: Full fine-tuning with early stopping Time: 1-6 hours Memory: 16GB GPU for base models (110M-340M params)

Pattern 3: Domain Adaptation (PEFT + Full)

When: Adapting to specific domain, then task-specific fine-tuning Approach:

  • PEFT on domain data (unlabeled or weakly labeled)

  • Full fine-tuning on task data (labeled) Time: 2-12 hours total Memory: 16-40GB GPU

Pattern 4: Multi-Task Learning (Multiple LoRA Adapters)

When: One model for multiple tasks Approach: Train separate LoRA adapters per task, swap at inference Time: 1-3 hours per task Memory: 16GB GPU, adapters are tiny (1-10MB each)

Troubleshooting

Out of Memory (OOM) Errors:

  • Reduce batch size

  • Enable gradient checkpointing

  • Use gradient accumulation

  • Switch to PEFT/LoRA

  • Use 8-bit quantization

Training Not Converging:

  • Lower learning rate

  • Increase warmup steps

  • Check data quality and preprocessing

  • Verify labels are correct

  • Try different optimizer (AdamW vs SGD)

Overfitting:

  • Add dropout

  • Use weight decay

  • Get more training data

  • Use data augmentation

  • Early stopping

Slow Training:

  • Enable fp16 mixed precision

  • Increase batch size

  • Use gradient accumulation less

  • Remove gradient checkpointing

  • Use faster model variant (distilbert vs bert)

Poor Evaluation Metrics:

  • Check data distribution (train vs val vs test)

  • Verify preprocessing is consistent

  • Try different model architecture

  • Increase model size or training time

  • Check for label imbalance

Supported Models:

  • Classification: BERT, RoBERTa, DistilBERT, ALBERT, DeBERTa

  • Generation: T5, BART, GPT-2, Llama-2, Mistral, Phi

  • PEFT: Compatible with all transformer models

Requirements:

  • Python 3.11+

  • PyTorch 2.0+

  • Transformers 4.30+

  • PEFT 0.7+ (for LoRA)

  • Datasets 2.14+

Best Practice: Start with PEFT/LoRA for quick iteration, switch to full fine-tuning only when necessary

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

document-parsers

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

stt-integration

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

model-routing-patterns

No summary provided by upstream source.

Repository SourceNeeds Review