model-merging

Model Merging: Combining Pre-trained Models

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "model-merging" with this command: npx skills add davila7/claude-code-templates/davila7-claude-code-templates-model-merging

Model Merging: Combining Pre-trained Models

When to Use This Skill

Use Model Merging when you need to:

  • Combine capabilities from multiple fine-tuned models without retraining

  • Create specialized models by blending domain-specific expertise (math + coding + chat)

  • Improve performance beyond single models (often +5-10% on benchmarks)

  • Reduce training costs - no GPUs needed, merges run on CPU

  • Experiment rapidly - create new model variants in minutes, not days

  • Preserve multiple skills - merge without catastrophic forgetting

Success Stories: Marcoro14-7B-slerp (best on Open LLM Leaderboard 02/2024), many top HuggingFace models use merging

Tools: mergekit (Arcee AI), LazyMergekit, Model Soup

Installation

Install mergekit

git clone https://github.com/arcee-ai/mergekit.git cd mergekit pip install -e .

Or via pip

pip install mergekit

Optional: Transformer library

pip install transformers torch

Quick Start

Simple Linear Merge

config.yml - Merge two models with equal weights

merge_method: linear models:

  • model: mistralai/Mistral-7B-v0.1 parameters: weight: 0.5
  • model: teknium/OpenHermes-2.5-Mistral-7B parameters: weight: 0.5 dtype: bfloat16

Run merge

mergekit-yaml config.yml ./merged-model --cuda

Use merged model

python -m transformers.models.auto --model_name_or_path ./merged-model

SLERP Merge (Best for 2 Models)

config.yml - Spherical interpolation

merge_method: slerp slices:

  • sources:
    • model: mistralai/Mistral-7B-v0.1 layer_range: [0, 32]
    • model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [0, 32] parameters: t: 0.5 # Interpolation factor (0=model1, 1=model2) dtype: bfloat16

Core Concepts

  1. Merge Methods

Linear (Model Soup)

  • Simple weighted average of parameters

  • Fast, works well for similar models

  • Can merge 2+ models

merged_weights = w1 * model1_weights + w2 * model2_weights + w3 * model3_weights

where w1 + w2 + w3 = 1

SLERP (Spherical Linear Interpolation)

  • Interpolates along sphere in weight space

  • Preserves magnitude of weight vectors

  • Best for merging 2 models

  • Smoother than linear

SLERP formula

merged = (sin((1-t)θ) / sin(θ)) * model1 + (sin(tθ) / sin(θ)) * model2

where θ = arccos(dot(model1, model2))

t ∈ [0, 1]

Task Arithmetic

  • Extract "task vectors" (fine-tuned - base)

  • Combine task vectors, add to base

  • Good for merging multiple specialized models

Task vector

task_vector = finetuned_model - base_model

Merge multiple task vectors

merged = base_model + α₁task_vector₁ + α₂task_vector₂

TIES-Merging

  • Task arithmetic + sparsification

  • Resolves sign conflicts in parameters

  • Best for merging many task-specific models

DARE (Drop And REscale)

  • Randomly drops fine-tuned parameters

  • Rescales remaining parameters

  • Reduces redundancy, maintains performance

  1. Configuration Structure

Basic structure

merge_method: <method> # linear, slerp, ties, dare_ties, task_arithmetic base_model: <path> # Optional: base model for task arithmetic

models:

  • model: <path/to/model1> parameters: weight: <float> # Merge weight density: <float> # For TIES/DARE

  • model: <path/to/model2> parameters: weight: <float>

parameters:

Method-specific parameters

dtype: <dtype> # bfloat16, float16, float32

Optional

slices: # Layer-wise merging tokenizer: # Tokenizer configuration

Merge Methods Guide

Linear Merge

Best for: Simple model combinations, equal weighting

merge_method: linear models:

  • model: WizardLM/WizardMath-7B-V1.1 parameters: weight: 0.4
  • model: teknium/OpenHermes-2.5-Mistral-7B parameters: weight: 0.3
  • model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO parameters: weight: 0.3 dtype: bfloat16

SLERP Merge

Best for: Two models, smooth interpolation

merge_method: slerp slices:

  • sources:
    • model: mistralai/Mistral-7B-v0.1 layer_range: [0, 32]
    • model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [0, 32] parameters: t: 0.5 # 0.0 = first model, 1.0 = second model dtype: bfloat16

Layer-specific SLERP:

merge_method: slerp slices:

  • sources:
    • model: model_a layer_range: [0, 32]
    • model: model_b layer_range: [0, 32] parameters: t:
    • filter: self_attn # Attention layers value: 0.3
    • filter: mlp # MLP layers value: 0.7
    • value: 0.5 # Default for other layers dtype: bfloat16

Task Arithmetic

Best for: Combining specialized skills

merge_method: task_arithmetic base_model: mistralai/Mistral-7B-v0.1 models:

  • model: WizardLM/WizardMath-7B-V1.1 # Math parameters: weight: 0.5
  • model: teknium/OpenHermes-2.5-Mistral-7B # Chat parameters: weight: 0.3
  • model: ajibawa-2023/Code-Mistral-7B # Code parameters: weight: 0.2 dtype: bfloat16

TIES-Merging

Best for: Many models, resolving conflicts

merge_method: ties base_model: mistralai/Mistral-7B-v0.1 models:

  • model: WizardLM/WizardMath-7B-V1.1 parameters: density: 0.5 # Keep top 50% of parameters weight: 1.0
  • model: teknium/OpenHermes-2.5-Mistral-7B parameters: density: 0.5 weight: 1.0
  • model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO parameters: density: 0.5 weight: 1.0 parameters: normalize: true dtype: bfloat16

DARE Merge

Best for: Reducing redundancy

merge_method: dare_ties base_model: mistralai/Mistral-7B-v0.1 models:

  • model: WizardLM/WizardMath-7B-V1.1 parameters: density: 0.5 # Drop 50% of deltas weight: 0.6
  • model: teknium/OpenHermes-2.5-Mistral-7B parameters: density: 0.5 weight: 0.4 parameters: int8_mask: true # Use int8 for masks (saves memory) dtype: bfloat16

Advanced Patterns

Layer-wise Merging

Different models for different layers

merge_method: passthrough slices:

  • sources:
    • model: mistralai/Mistral-7B-v0.1 layer_range: [0, 16] # First half
  • sources:
    • model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [16, 32] # Second half dtype: bfloat16

MoE from Merged Models

Create Mixture of Experts

merge_method: moe base_model: mistralai/Mistral-7B-v0.1 experts:

  • source_model: WizardLM/WizardMath-7B-V1.1 positive_prompts:
    • "math"
    • "calculate"
  • source_model: teknium/OpenHermes-2.5-Mistral-7B positive_prompts:
    • "chat"
    • "conversation"
  • source_model: ajibawa-2023/Code-Mistral-7B positive_prompts:
    • "code"
    • "python" dtype: bfloat16

Tokenizer Merging

merge_method: linear models:

  • model: mistralai/Mistral-7B-v0.1
  • model: custom/specialized-model

tokenizer: source: "union" # Combine vocabularies from both models tokens: <|special_token|>: source: "custom/specialized-model"

Best Practices

  1. Model Compatibility

✅ Good: Same architecture

models = [ "mistralai/Mistral-7B-v0.1", "teknium/OpenHermes-2.5-Mistral-7B", # Both Mistral 7B ]

❌ Bad: Different architectures

models = [ "meta-llama/Llama-2-7b-hf", # Llama "mistralai/Mistral-7B-v0.1", # Mistral (incompatible!) ]

  1. Weight Selection

✅ Good: Weights sum to 1.0

models:

  • model: model_a parameters: weight: 0.6
  • model: model_b parameters: weight: 0.4 # 0.6 + 0.4 = 1.0

⚠️ Acceptable: Weights don't sum to 1 (for task arithmetic)

models:

  • model: model_a parameters: weight: 0.8
  • model: model_b parameters: weight: 0.8 # May boost performance
  1. Method Selection

Choose merge method based on use case:

2 models, smooth blend → SLERP

merge_method = "slerp"

3+ models, simple average → Linear

merge_method = "linear"

Multiple task-specific models → Task Arithmetic or TIES

merge_method = "ties"

Want to reduce redundancy → DARE

merge_method = "dare_ties"

  1. Density Tuning (TIES/DARE)

Start conservative (keep more parameters)

parameters: density: 0.8 # Keep 80%

If performance good, increase sparsity

parameters: density: 0.5 # Keep 50%

If performance degrades, reduce sparsity

parameters: density: 0.9 # Keep 90%

  1. Layer-specific Merging

Preserve base model's beginning and end

merge_method: passthrough slices:

  • sources:
    • model: base_model layer_range: [0, 2] # Keep first layers
  • sources:
    • model: merged_middle # Merge middle layers layer_range: [2, 30]
  • sources:
    • model: base_model layer_range: [30, 32] # Keep last layers

Evaluation & Testing

Benchmark Merged Models

from transformers import AutoModelForCausalLM, AutoTokenizer

Load merged model

model = AutoModelForCausalLM.from_pretrained("./merged-model") tokenizer = AutoTokenizer.from_pretrained("./merged-model")

Test on various tasks

test_prompts = { "math": "Calculate: 25 * 17 =", "code": "Write a Python function to reverse a string:", "chat": "What is the capital of France?", }

for task, prompt in test_prompts.items(): inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=100) print(f"{task}: {tokenizer.decode(outputs[0])}")

Common Benchmarks

  • Open LLM Leaderboard: General capabilities

  • MT-Bench: Multi-turn conversation

  • MMLU: Multitask accuracy

  • HumanEval: Code generation

  • GSM8K: Math reasoning

Production Deployment

Save and Upload

from transformers import AutoModelForCausalLM, AutoTokenizer

Load merged model

model = AutoModelForCausalLM.from_pretrained("./merged-model") tokenizer = AutoTokenizer.from_pretrained("./merged-model")

Upload to HuggingFace Hub

model.push_to_hub("username/my-merged-model") tokenizer.push_to_hub("username/my-merged-model")

Quantize Merged Model

Quantize with GGUF

python convert.py ./merged-model --outtype f16 --outfile merged-model.gguf

Quantize with GPTQ

python quantize_gptq.py ./merged-model --bits 4 --group_size 128

Common Pitfalls

❌ Pitfall 1: Merging Incompatible Models

Wrong: Different architectures

models:

  • model: meta-llama/Llama-2-7b # Llama architecture
  • model: mistralai/Mistral-7B # Mistral architecture

Fix: Only merge models with same architecture

❌ Pitfall 2: Over-weighting One Model

Suboptimal: One model dominates

models:

  • model: model_a parameters: weight: 0.95 # Too high
  • model: model_b parameters: weight: 0.05 # Too low

Fix: Use more balanced weights (0.3-0.7 range)

❌ Pitfall 3: Not Evaluating

Wrong: Merge and deploy without testing

mergekit-yaml config.yml ./merged-model

Deploy immediately (risky!)

Fix: Always benchmark before deploying

Resources

See Also

  • references/methods.md

  • Deep dive into merge algorithms

  • references/examples.md

  • Real-world merge configurations

  • references/evaluation.md

  • Benchmarking and testing strategies

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

senior-data-scientist

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

senior-backend

No summary provided by upstream source.

Repository SourceNeeds Review
1.2K-davila7
Coding

senior-frontend

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

ui-ux-pro-max

No summary provided by upstream source.

Repository SourceNeeds Review