gemma-tuner-multimodal

Fine-tune Gemma 4 and 3n models with audio, images, and text on Apple Silicon using PyTorch and Metal Performance Shaders.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "gemma-tuner-multimodal" with this command: npx skills add aradotso/trending-skills/aradotso-trending-skills-gemma-tuner-multimodal

Gemma Multimodal Fine-Tuner

Skill by ara.so — Daily 2026 Skills collection.

Fine-tune Gemma 4 and Gemma 3n models on text, images, and audio data entirely on Apple Silicon (MPS), with support for streaming large datasets from GCS/BigQuery without filling local storage.


What It Does

  • Text LoRA: instruction-tuning or completion fine-tuning from local CSV
  • Image + Text LoRA: captioning and VQA from local CSV
  • Audio + Text LoRA: the only Apple-Silicon-native path for this modality
  • Cloud streaming: train on terabytes from GCS/BigQuery without local copy
  • MPS-native: no NVIDIA GPU required — runs on MacBook Pro/Air/Mac Studio

Installation

Prerequisites

  • macOS 12.3+ with Apple Silicon (arm64)
  • Python 3.10+ (native arm64, not Rosetta)
  • Hugging Face account with Gemma access
# Install Python 3.12 if needed
brew install python@3.12

# Create venv
python3.12 -m venv .venv
source .venv/bin/activate

# Verify arm64 (must show arm64, not x86_64)
python -c "import platform; print(platform.machine())"

# Install PyTorch
pip install torch torchaudio

# Clone and install
git clone https://github.com/mattmireles/gemma-tuner-multimodal
cd gemma-tuner-multimodal
pip install -e .

# For Gemma 4 support (separate venv recommended)
pip install -r requirements/requirements-gemma4.txt

Authenticate with Hugging Face

huggingface-cli login
# Or set environment variable:
export HF_TOKEN=your_token_here

CLI Commands

# Check system is ready
gemma-macos-tuner system-check

# Guided setup wizard (recommended for first run)
gemma-macos-tuner wizard

# Prepare dataset
gemma-macos-tuner prepare <dataset-profile>

# Fine-tune a model
gemma-macos-tuner finetune <profile> --json-logging

# Evaluate a run
gemma-macos-tuner evaluate <profile-or-run>

# Export merged HF/SafeTensors (merges LoRA when adapter_config.json present)
gemma-macos-tuner export <run-dir-or-profile>

# Blacklist bad samples from errors
gemma-macos-tuner blacklist <profile>

# List training runs
gemma-macos-tuner runs list

Configuration (config/config.ini)

The config is hierarchical INI: defaults → groups → models → datasets → profiles.

[defaults]
output_dir = output
batch_size = 2
gradient_accumulation_steps = 8
learning_rate = 2e-4
num_train_epochs = 3

[model:gemma-3n-e2b-it]
group = gemma
base_model = google/gemma-3n-E2B-it

[model:gemma-4-e2b-it]
group = gemma
base_model = google/gemma-4-E2B-it

[dataset:my-audio-dataset]
data_dir = data/datasets/my-audio-dataset
audio_column = audio_path
text_column = transcript

[profile:my-audio-profile]
model = gemma-3n-e2b-it
dataset = my-audio-dataset
modality = audio
lora_r = 16
lora_alpha = 32
lora_dropout = 0.05
max_seq_length = 512

Use GEMMA_TUNER_CONFIG env var to point to config outside repo root:

export GEMMA_TUNER_CONFIG=/path/to/my/config.ini

Modality Configuration

Text-Only Fine-Tuning

Instruction tuning (user/assistant pairs):

[profile:text-instruction]
model = gemma-3n-e2b-it
dataset = my-text-dataset
modality = text
text_sub_mode = instruction
prompt_column = prompt
text_column = response
max_seq_length = 2048
lora_r = 16
lora_alpha = 32

Completion tuning (full sequence trained):

[profile:text-completion]
model = gemma-3n-e2b-it
dataset = my-text-dataset
modality = text
text_sub_mode = completion
text_column = text
max_seq_length = 2048

CSV format for instruction tuning (data/datasets/my-text-dataset/train.csv):

prompt,response
"What is photosynthesis?","Photosynthesis is the process by which plants..."
"Explain LoRA fine-tuning","LoRA (Low-Rank Adaptation) is a parameter-efficient..."

Image Fine-Tuning

[profile:image-caption]
model = gemma-3n-e2b-it
dataset = my-image-dataset
modality = image
image_sub_mode = captioning
image_token_budget = 256
prompt_column = prompt
text_column = caption
max_seq_length = 512

CSV format (data/datasets/my-image-dataset/train.csv):

image_path,prompt,caption
/data/images/img1.jpg,Describe this image,A dog sitting on a green lawn...
/data/images/img2.jpg,What is shown here,A bar chart showing quarterly revenue...

Audio Fine-Tuning

[profile:audio-asr]
model = gemma-3n-e2b-it
dataset = my-audio-dataset
modality = audio
audio_column = audio_path
text_column = transcript
max_seq_length = 512
lora_r = 16
lora_alpha = 32
lora_dropout = 0.05

CSV format (data/datasets/my-audio-dataset/train.csv):

audio_path,transcript
/data/audio/recording1.wav,The patient presents with acute respiratory symptoms
/data/audio/recording2.wav,Counsel objects to the characterization of the evidence

Supported Models

Model KeyHugging Face IDNotes
gemma-3n-e2b-itgoogle/gemma-3n-E2B-itDefault, ~2B instruct
gemma-3n-e4b-itgoogle/gemma-3n-E4B-it~4B instruct
gemma-4-e2b-itgoogle/gemma-4-E2B-itNeeds requirements-gemma4.txt
gemma-4-e4b-itgoogle/gemma-4-E4B-itNeeds requirements-gemma4.txt
gemma-4-e2bgoogle/gemma-4-E2BBase, needs Gemma 4 stack
gemma-4-e4bgoogle/gemma-4-E4BBase, needs Gemma 4 stack

Add custom models with a [model:your-name] section using group = gemma.


Dataset Directory Layout

data/
└── datasets/
    └── <dataset-name>/
        ├── train.csv       # required
        ├── validation.csv  # optional
        └── test.csv        # optional

Output Layout

output/
└── {run-id}-{profile}/
    ├── metadata.json
    ├── metrics.json
    ├── checkpoint-*/
    └── adapter_model/      # LoRA artifacts

Python API Examples

Running Fine-Tuning Programmatically

from gemma_tuner.core.config import load_config
from gemma_tuner.core.ops import run_finetune

# Load config
config = load_config("config/config.ini")

# Run fine-tuning for a profile
run_finetune(profile="my-audio-profile", config=config, json_logging=True)

Using Device Utilities

from gemma_tuner.utils.device import get_device, memory_hint

device = get_device()   # Returns "mps", "cuda", or "cpu"
print(f"Training on: {device}")

hint = memory_hint(model_key="gemma-3n-e2b-it")
print(hint)

Loading and Inspecting Datasets

from gemma_tuner.utils.dataset_utils import load_csv_dataset

train_df, val_df = load_csv_dataset(
    data_dir="data/datasets/my-text-dataset",
    text_column="response",
    prompt_column="prompt"
)
print(f"Train samples: {len(train_df)}, Val samples: {len(val_df)}")

Custom LoRA Config

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3n-E2B-it",
    torch_dtype="auto",
    device_map="mps"
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Common Patterns

Full Workflow: Text Instruction Tuning

# 1. Prepare your data
mkdir -p data/datasets/my-dataset
cp train.csv data/datasets/my-dataset/
cp validation.csv data/datasets/my-dataset/

# 2. Add profile to config/config.ini
cat >> config/config.ini << 'EOF'
[dataset:my-dataset]
data_dir = data/datasets/my-dataset

[profile:my-text-run]
model = gemma-3n-e2b-it
dataset = my-dataset
modality = text
text_sub_mode = instruction
prompt_column = prompt
text_column = response
max_seq_length = 2048
lora_r = 16
lora_alpha = 32
EOF

# 3. Prepare dataset
gemma-macos-tuner prepare my-dataset

# 4. Fine-tune
gemma-macos-tuner finetune my-text-run --json-logging

# 5. Export merged weights
gemma-macos-tuner export my-text-run

GCS Streaming for Large Datasets

[dataset:large-audio-gcs]
source = gcs
gcs_bucket = my-bucket
gcs_prefix = audio-training-data/
audio_column = audio_path
text_column = transcript

[profile:large-audio-run]
model = gemma-3n-e4b-it
dataset = large-audio-gcs
modality = audio
lora_r = 32
lora_alpha = 64

Set credentials:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
gemma-macos-tuner finetune large-audio-run

Add a Custom Gemma Checkpoint

[model:my-custom-gemma]
group = gemma
base_model = my-org/my-gemma-checkpoint

[profile:custom-run]
model = my-custom-gemma
dataset = my-dataset
modality = text
text_sub_mode = instruction

Troubleshooting

Wrong architecture (x86_64 instead of arm64)

python -c "import platform; print(platform.machine())"
# Must be arm64 — if x86_64, reinstall Python natively:
brew install python@3.12
python3.12 -m venv .venv && source .venv/bin/activate

MPS out of memory

  • Reduce batch_size (try 1)
  • Increase gradient_accumulation_steps to compensate
  • Use a smaller model (e2b instead of e4b)
  • Reduce max_seq_length

Gemma 4 model not loading

# Gemma 4 requires the updated Transformers stack
pip install -r requirements/requirements-gemma4.txt
# Use a separate venv if you also need Gemma 3n

Config not found outside repo root

export GEMMA_TUNER_CONFIG=/absolute/path/to/config/config.ini
gemma-macos-tuner finetune my-profile

Hugging Face auth errors

huggingface-cli login
# Or:
export HF_TOKEN=your_hf_token
# Accept Gemma license at: https://huggingface.co/google/gemma-3n-E2B-it

System check before debugging anything else

gemma-macos-tuner system-check

Audio tower loaded even for text-only runs

This is a known v1 issue — USM audio tower weights stay in memory even for modality = text. See README/KNOWN_ISSUES.md. Workaround: use a smaller model variant to stay within RAM budget.


Architecture Reference

FileRole
gemma_tuner/cli_typer.pyMain CLI entrypoint (gemma-macos-tuner)
gemma_tuner/core/ops.pyDispatches prepare/finetune/evaluate/export
gemma_tuner/scripts/finetune.pyRouter: Gemma models → models/gemma/finetune.py
gemma_tuner/models/gemma/finetune.pyCore training loop with LoRA
gemma_tuner/scripts/export.pyMerges LoRA → HF/SafeTensors tree
gemma_tuner/utils/device.pyMPS/CUDA/CPU selection and memory hints
gemma_tuner/utils/dataset_utils.pyCSV loading, blacklist/protection semantics
gemma_tuner/wizard/Interactive CLI wizard (questionary + Rich)
config/config.iniHierarchical INI configuration

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

openclaw-control-center

No summary provided by upstream source.

Repository SourceNeeds Review
General

ui-ux-pro-max-skill

No summary provided by upstream source.

Repository SourceNeeds Review
General

lightpanda-browser

No summary provided by upstream source.

Repository SourceNeeds Review
General

chrome-cdp-live-browser

No summary provided by upstream source.

Repository SourceNeeds Review