mlx-swift-lm Skill
- Overview & Triggers
mlx-swift-lm is a Swift package for running Large Language Models (LLMs) and Vision-Language Models (VLMs) on Apple Silicon using MLX. It supports local inference, streaming generation, wired-memory coordination, tool calling, LoRA/DoRA fine-tuning, and embeddings.
When to Use This Skill
-
Running LLM/VLM inference on macOS/iOS with Apple Silicon
-
Streaming text generation from local models
-
Coordinating concurrent inference with wired-memory policies and tickets
-
Tool calling / function calling with models
-
LoRA adapter training and fine-tuning
-
Text embeddings for RAG/semantic search
-
Porting model architectures from Python MLX-LM to Swift
Architecture Overview
MLXLMCommon - Core infra (ModelContainer, ChatSession, Evaluate, KVCache, wired memory helpers) MLXLLM - Text-only LLM support (Llama, Qwen, Gemma, Phi, DeepSeek, etc.) MLXVLM - Vision-Language Models (Qwen-VL, PaliGemma, Gemma3, etc.) MLXEmbedders - Embedding models and pooling utilities
- Key File Reference
Purpose File Path
Thread-safe model wrapper Libraries/MLXLMCommon/ModelContainer.swift
Simplified chat API Libraries/MLXLMCommon/ChatSession.swift
Generation & streaming APIs Libraries/MLXLMCommon/Evaluate.swift
KV cache types Libraries/MLXLMCommon/KVCache.swift
Wired-memory policies Libraries/MLXLMCommon/WiredMemoryPolicies.swift
Wired-memory measurement helpers Libraries/MLXLMCommon/WiredMemoryUtils.swift
Model configuration Libraries/MLXLMCommon/ModelConfiguration.swift
Chat message types Libraries/MLXLMCommon/Chat.swift
Tool call processing Libraries/MLXLMCommon/Tool/ToolCallFormat.swift
Concurrency utilities Libraries/MLXLMCommon/Utilities/SerialAccessContainer.swift
LLM factory & registry Libraries/MLXLLM/LLMModelFactory.swift
VLM factory & registry Libraries/MLXVLM/VLMModelFactory.swift
LoRA configuration Libraries/MLXLMCommon/Adapters/LoRA/LoRAContainer.swift
LoRA training Libraries/MLXLLM/LoraTrain.swift
- Quick Start
LLM Chat (Simplest API)
import MLXLLM import MLXLMCommon
let modelContainer = try await LLMModelFactory.shared.loadContainer( configuration: .init(id: "mlx-community/Qwen3-4B-4bit") )
let session = ChatSession(modelContainer)
let response = try await session.respond(to: "What is Swift?") print(response)
for try await chunk in session.streamResponse(to: "Explain structured concurrency") { print(chunk, terminator: "") }
VLM with Image
import MLXVLM import MLXLMCommon
let modelContainer = try await VLMModelFactory.shared.loadContainer( configuration: .init(id: "mlx-community/Qwen2-VL-2B-Instruct-4bit") )
let session = ChatSession(modelContainer) let image = UserInput.Image.url(imageURL)
let response = try await session.respond( to: "Describe this image", image: image, video: nil )
Embeddings
import Embedders
let container = try await loadModelContainer( configuration: ModelConfiguration(id: "mlx-community/bge-small-en-v1.5-mlx") )
let embeddings = await container.perform { model, tokenizer, pooler in let tokens = tokenizer.encode(text: "Hello world") let input = MLXArray(tokens).expandedDimensions(axis: 0) let output = model(input) let pooled = pooler(output, normalize: true) eval(pooled) return pooled }
- Primary Workflow: LLM Inference
ChatSession API (Recommended)
ChatSession manages conversation history and KV cache automatically:
let session = ChatSession( modelContainer, instructions: "You are a helpful assistant", generateParameters: GenerateParameters(maxTokens: 500, temperature: 0.7) )
let r1 = try await session.respond(to: "What is 2+2?") let r2 = try await session.respond(to: "And if you multiply that by 3?")
await session.clear()
Streaming with ModelContainer.generate(...)
For lower-level control, prepare UserInput and generate directly:
let userInput = UserInput(prompt: "Hello") let lmInput = try await modelContainer.prepare(input: userInput)
let stream = try await modelContainer.generate( input: lmInput, parameters: GenerateParameters() )
for await generation in stream { switch generation { case .chunk(let text): print(text, terminator: "") case .toolCall(let call): print("Tool call: (call.function.name)") case .info(let info): print("\nStop reason: (info.stopReason)") print("(info.tokensPerSecond) tok/s") } }
Generation API Surface (Evaluate.swift)
Use these depending on your control needs:
-
generate(input:..., context:..., wiredMemoryTicket:) -> AsyncStream<Generation> : decoded text + tool calls.
-
generateTask(..., wiredMemoryTicket:) -> (AsyncStream<Generation>, Task<Void, Never>) : same output, plus task handle for deterministic cleanup when consumers stop early.
-
generateTokens(..., wiredMemoryTicket:) -> AsyncStream<TokenGeneration> : raw token IDs.
-
generateTokensTask(..., wiredMemoryTicket:) -> (AsyncStream<TokenGeneration>, Task<Void, Never>) : raw tokens + task handle.
-
GenerateStopReason : .stop , .length , .cancelled in final .info .
See references/generation.md for full patterns.
Tool Calling
struct WeatherInput: Codable { let location: String } struct WeatherOutput: Codable { let temperature: Double; let conditions: String }
let weatherTool = Tool<WeatherInput, WeatherOutput>( name: "get_weather", description: "Get current weather", parameters: [.required("location", type: .string, description: "City name")] ) { _ in WeatherOutput(temperature: 22.0, conditions: "Sunny") }
let userInput = UserInput( prompt: .text("What's the weather in Tokyo?"), tools: [weatherTool.schema] )
let lmInput = try await modelContainer.prepare(input: userInput) let stream = try await modelContainer.generate(input: lmInput, parameters: GenerateParameters())
for await generation in stream { switch generation { case .chunk(let text): print(text, terminator: "") case .toolCall(let call): let result = try await call.execute(with: weatherTool) print("\nWeather: (result.conditions)") case .info: break } }
See references/tool-calling.md for multi-turn tool loops.
GenerateParameters
let params = GenerateParameters( maxTokens: 1000, // nil = unlimited maxKVSize: 4096, // Sliding window (RotatingKVCache) kvBits: 4, // Quantized cache (4 or 8) kvGroupSize: 64, // Quantization group size quantizedKVStart: 0, // Token index to start KV quantization temperature: 0.7, // 0 = greedy / argmax topP: 0.9, // Nucleus sampling repetitionPenalty: 1.1, // Penalize repeats repetitionContextSize: 20, // Penalty window prefillStepSize: 512 // Prompt prefill chunk size )
Wired Memory (Optional)
Use policy tickets to coordinate concurrent inference memory:
let policy = WiredSumPolicy() let ticket = policy.ticket(size: estimatedBytes, kind: .active)
let userInput = UserInput(prompt: "Summarize this text") let lmInput = try await modelContainer.prepare(input: userInput)
let stream = try await modelContainer.generate( input: lmInput, parameters: GenerateParameters(), wiredMemoryTicket: ticket )
for await generation in stream { if case .chunk(let text) = generation { print(text, terminator: "") } }
For policy selection, reservations, and measurement-based budgeting, see references/wired-memory.md.
Prompt Caching / History Re-hydration
let history: [Chat.Message] = [ .system("You are helpful"), .user("Hello"), .assistant("Hi there!") ]
let session = ChatSession(modelContainer, history: history)
- Secondary Workflow: VLM Inference
Image Input Types
let imageFromURL = UserInput.Image.url(fileURL) let imageFromCI = UserInput.Image.ciImage(ciImage) let imageFromArray = UserInput.Image.array(mlxArray)
Video Input
let videoFromURL = UserInput.Video.url(videoURL) let videoFromAsset = UserInput.Video.avAsset(avAsset) let videoFromFrames = UserInput.Video.frames(videoFrames)
let response = try await session.respond(to: "What happens in this video?", video: videoFromURL)
Multiple Images
let images: [UserInput.Image] = [.url(url1), .url(url2)] let response = try await session.respond(to: "Compare these two images", images: images, videos: [])
VLM-Specific Processing
let session = ChatSession( modelContainer, processing: UserInput.Processing(resize: CGSize(width: 512, height: 512)) )
- Best Practices
DO
// DO: Prefer ChatSession for multi-turn chat UX let session = ChatSession(modelContainer)
// DO: Prepare UserInput before container-level generation let userInput = UserInput(prompt: "Hello") let lmInput = try await modelContainer.prepare(input: userInput)
// DO: Use task-handle variants for early-stop scenarios let (stream, task) = generateTask( promptTokenCount: lmInput.text.tokens.size, modelConfiguration: context.configuration, tokenizer: context.tokenizer, iterator: iterator ) for await item in stream { if shouldStop { break } } await task.value
// DO: Use wired tickets when coordinating concurrent workloads let ticket = WiredSumPolicy().ticket(size: estimatedBytes) let _ = try await modelContainer.generate(input: lmInput, parameters: params, wiredMemoryTicket: ticket)
DON'T
// DON'T: Skip prepare(input:) before container-level generation. // ModelContainer.generate expects LMInput, not UserInput.
// DON'T: Share MLXArray across tasks (not Sendable) let array = MLXArray(...) Task { _ = array.sum() } // wrong
// DON'T: Ignore task completion after early-break on low-level streams for await item in stream { if shouldStop { break } } // await task.value is required for deterministic cleanup
Thread Safety
-
ModelContainer is Sendable and thread-safe.
-
ChatSession is not thread-safe; use one session per task/flow.
-
MLXArray is not Sendable ; keep it inside one isolation domain or use SendableBox transfer patterns.
Memory Management
let slidingWindow = GenerateParameters(maxKVSize: 4096) let quantizedKV = GenerateParameters(kvBits: 4, kvGroupSize: 64) await session.clear()
- Reference Links
Reference When to Use
references/model-container.md Loading models, ModelContainer API, ModelConfiguration
references/generation.md generate , generateTask , raw token streaming APIs
references/wired-memory.md Wired tickets, policies, budgeting, reservations
references/kv-cache.md Cache types, memory optimization, cache serialization
references/concurrency.md Thread safety, SerialAccessContainer, async patterns
references/tool-calling.md Function calling, tool formats, ToolCallProcessor
references/tokenizer-chat.md Tokenizer, Chat.Message, EOS tokens
references/supported-models.md Model families, registries, model-specific config
references/lora-adapters.md LoRA/DoRA/QLoRA, loading adapters
references/training.md LoRATrain API, fine-tuning
references/embeddings.md EmbeddingModel, pooling, use cases
references/model-porting.md Porting models from Python MLX-LM to Swift
- Deprecated Patterns Summary
If you see... Use instead...
generate(... didGenerate:) callback AsyncStream-based generation APIs
perform { model, tokenizer in }
perform { context in }
TokenIterator(prompt: MLXArray)
TokenIterator(input: LMInput)
ModelRegistry typealias LLMRegistry or VLMRegistry
createAttentionMask(h:cache:[KVCache]?)
createAttentionMask(h:cache:KVCache?)
- Automatic vs Manual Configuration
Automatic Behaviors
Feature Details
EOS token loading Loaded from config.json
EOS override generation_config.json
config.json defaults
EOS merging All sources merged at generation time
EOS detection Stops generation when EOS encountered
Chat template application Applied by tokenizer / processor path
Tool call format detection Inferred from model_type in config.json
Cache type selection Driven by GenerateParameters (maxKVSize , kvBits )
Tokenizer loading Loaded automatically from model assets
Model weight loading Downloaded and loaded from Hugging Face/local directory
Optional Configuration
Feature When to Configure
extraEOSTokens
Model has unlisted stop tokens
toolCallFormat
Override auto-detected tool parser format
maxKVSize
Enable sliding window cache
kvBits , kvGroupSize , quantizedKVStart
Enable and tune KV quantization
prefillStepSize
Tune prompt prefill chunking/perf tradeoff
wiredMemoryTicket
Coordinate policy-based wired-memory limits