axiom-foundation-models

Foundation Models — On-Device AI for Apple Platforms

When to Use This Skill

Use when:

Implementing on-device AI features with Foundation Models
Adding text summarization, classification, or extraction capabilities
Creating structured output from LLM responses
Building tool-calling patterns for external data integration
Streaming generated content for better UX
Debugging Foundation Models issues (context overflow, slow generation, wrong output)
Deciding between Foundation Models vs server LLMs (ChatGPT, Claude, etc.)

Related Skills

Use axiom-foundation-models-diag for systematic troubleshooting (context exceeded, guardrail violations, availability problems)
Use axiom-foundation-models-ref for complete API reference with all WWDC code examples

Red Flags — Anti-Patterns That Will Fail

❌ Using for World Knowledge

Why it fails: The on-device model is 3 billion parameters, optimized for summarization, extraction, classification — NOT world knowledge or complex reasoning.

Example of wrong use:

// ❌ BAD - Asking for world knowledge let session = LanguageModelSession() let response = try await session.respond(to: "What's the capital of France?")

Why: Model will hallucinate or give low-quality answers. It's trained for content generation, not encyclopedic knowledge.

Correct approach: Use server LLMs (ChatGPT, Claude) for world knowledge, or provide factual data through Tool calling.

❌ Blocking Main Thread

Why it fails: session.respond() is async but if called synchronously on main thread, freezes UI for seconds.

Example of wrong use:

// ❌ BAD - Blocking main thread Button("Generate") { let response = try await session.respond(to: prompt) // UI frozen! }

Why: Generation takes 1-5 seconds. User sees frozen app, bad reviews follow.

Correct approach:

// ✅ GOOD - Async on background Button("Generate") { Task { let response = try await session.respond(to: prompt) // Update UI with response } }

❌ Manual JSON Parsing

Why it fails: Prompting for JSON and parsing with JSONDecoder leads to hallucinated keys, invalid JSON, no type safety.

Example of wrong use:

// ❌ BAD - Manual JSON parsing let prompt = "Generate a person with name and age as JSON" let response = try await session.respond(to: prompt) let data = response.content.data(using: .utf8)! let person = try JSONDecoder().decode(Person.self, from: data) // CRASHES!

Why: Model might output {firstName: "John"} when you expect {name: "John"} . Or invalid JSON entirely.

Correct approach:

// ✅ GOOD - @Generable guarantees structure @Generable struct Person { let name: String let age: Int }

let response = try await session.respond( to: "Generate a person", generating: Person.self ) // response.content is type-safe Person instance

❌ Ignoring Availability Check

Why it fails: Foundation Models only runs on Apple Intelligence devices in supported regions. App crashes or shows errors without check.

Example of wrong use:

// ❌ BAD - No availability check let session = LanguageModelSession() // Might fail!

Correct approach:

// ✅ GOOD - Check first switch SystemLanguageModel.default.availability { case .available: let session = LanguageModelSession() // proceed case .unavailable(let reason): // Show graceful UI: "AI features require Apple Intelligence" }

❌ Single Huge Prompt

Why it fails: 4096 token context window (input + output). One massive prompt hits limit, gives poor results.

Example of wrong use:

// ❌ BAD - Everything in one prompt let prompt = """ Generate a 7-day itinerary for Tokyo including hotels, restaurants, activities for each day, transportation details, budget breakdown... """ // Exceeds context, poor quality

Correct approach: Break into smaller tasks, use tools for external data, multi-turn conversation.

❌ Not Handling Context Overflow

Why it fails: Multi-turn conversations grow transcript. Eventually exceeds 4096 tokens, throws error, conversation ends.

Must handle:

// ✅ GOOD - Handle overflow do { let response = try await session.respond(to: prompt) } catch LanguageModelSession.GenerationError.exceededContextWindowSize { // Condense transcript and create new session session = condensedSession(from: session) }

❌ Not Handling Guardrail Violations

Why it fails: Model has content policy. Certain prompts trigger guardrails, throw error.

Must handle:

// ✅ GOOD - Handle guardrails do { let response = try await session.respond(to: userInput) } catch LanguageModelSession.GenerationError.guardrailViolation { // Show message: "I can't help with that request" }

❌ Not Handling Unsupported Language

Why it fails: Model supports specific languages. User input might be unsupported, throws error.

Must check:

// ✅ GOOD - Check supported languages let supported = SystemLanguageModel.default.supportedLanguages guard supported.contains(Locale.current.language) else { // Show disclaimer return }

Mandatory First Steps

Before writing any Foundation Models code, complete these steps:

Check Availability

switch SystemLanguageModel.default.availability { case .available: // Proceed with implementation print("✅ Foundation Models available") case .unavailable(let reason): // Handle gracefully - show UI message print("❌ Unavailable: (reason)") }

Why: Foundation Models requires:

Apple Intelligence-enabled device
Supported region
User opted in to Apple Intelligence

Failure mode: App crashes or shows confusing errors without check.

Identify Use Case

Ask yourself: What is my primary goal?

Use Case Foundation Models? Alternative

Summarization ✅ YES

Extraction (key info from text) ✅ YES

Classification (categorize content) ✅ YES

Content tagging ✅ YES (built-in adapter!)

World knowledge ❌ NO ChatGPT, Claude, Gemini

Complex reasoning ❌ NO Server LLMs

Mathematical computation ❌ NO Calculator, symbolic math

Critical: If your use case requires world knowledge or advanced reasoning, stop. Foundation Models is the wrong tool.

Design @Generable Schema

If you need structured output (not just plain text):

Bad approach: Prompt for "JSON" and parse manually Good approach: Define @Generable type

@Generable struct SearchSuggestions { @Guide(description: "Suggested search terms", .count(4)) var searchTerms: [String] }

Why: Constrained decoding guarantees structure. No parsing errors, no hallucinated keys.

Consider Tools for External Data

If your feature needs external information:

Weather → WeatherKit tool
Locations → MapKit tool
Contacts → Contacts API tool
Calendar → EventKit tool

Don't try to get this information from the model (it will hallucinate). Do define Tool protocol implementations.

Plan Streaming for Long Generations

If generation takes >1 second, use streaming:

let stream = session.streamResponse( to: prompt, generating: Itinerary.self )

for try await partial in stream { // Update UI incrementally self.itinerary = partial }

Why: Users see progress immediately, perceived latency drops dramatically.

Decision Tree

Need on-device AI? │ ├─ World knowledge/reasoning? │ └─ ❌ NOT Foundation Models │ → Use ChatGPT, Claude, Gemini, etc. │ → Reason: 3B parameter model, not trained for encyclopedic knowledge │ ├─ Summarization? │ └─ ✅ YES → Pattern 1 (Basic Session) │ → Example: Summarize article, condense email │ → Time: 10-15 minutes │ ├─ Structured extraction? │ └─ ✅ YES → Pattern 2 (@Generable) │ → Example: Extract name, date, amount from invoice │ → Time: 15-20 minutes │ ├─ Content tagging? │ └─ ✅ YES → Pattern 3 (contentTagging use case) │ → Example: Tag article topics, extract entities │ → Time: 10 minutes │ ├─ Need external data? │ └─ ✅ YES → Pattern 4 (Tool calling) │ → Example: Fetch weather, query contacts, get locations │ → Time: 20-30 minutes │ ├─ Long generation? │ └─ ✅ YES → Pattern 5 (Streaming) │ → Example: Generate itinerary, create story │ → Time: 15-20 minutes │ └─ Dynamic schemas (runtime-defined structure)? └─ ✅ YES → Pattern 6 (DynamicGenerationSchema) → Example: Level creator, user-defined forms → Time: 30-40 minutes

Pattern 1: Basic Session (~1500 words)

Use when: Simple text generation, summarization, or content analysis.

Core Concepts

LanguageModelSession:

Stateful — retains transcript of all interactions
Instructions vs prompts:
Instructions (from developer): Define model's role, static guidance
Prompts (from user): Dynamic input for generation
Model trained to obey instructions over prompts (security feature)

Implementation

import FoundationModels

func respond(userInput: String) async throws -> String { let session = LanguageModelSession(instructions: """ You are a friendly barista in a pixel art coffee shop. Respond to the player's question concisely. """ ) let response = try await session.respond(to: userInput) return response.content }

// WWDC 301:1:05

Key Points

Instructions are optional — Reasonable defaults if omitted
Never interpolate user input into instructions — Security risk (prompt injection)
Keep instructions concise — Each token adds latency

Multi-Turn Interactions

let session = LanguageModelSession()

// First turn let first = try await session.respond(to: "Write a haiku about fishing") print(first.content) // "Silent waters gleam, // Casting lines in morning mist— // Hope in every cast."

// Second turn - model remembers context let second = try await session.respond(to: "Do another one about golf") print(second.content) // "Silent morning dew, // Caddies guide with gentle words— // Paths of patience tread."

// Inspect full transcript print(session.transcript)

// WWDC 286:17:46

Why this works: Session retains transcript automatically. Model uses context from previous turns.

Transcript Inspection

let transcript = session.transcript // Use for: // - Debugging generation issues // - Showing conversation history in UI // - Exporting chat logs

Error Handling (Basic)

do { let response = try await session.respond(to: prompt) } catch LanguageModelSession.GenerationError.guardrailViolation { // Content policy triggered print("Cannot generate that content") } catch LanguageModelSession.GenerationError.unsupportedLanguageOrLocale { // Language not supported print("Please use English or another supported language") }

When to Use This Pattern

✅ Good for:

Simple Q&A
Text summarization
Content analysis
Single-turn generation

❌ Not good for:

Structured output (use Pattern 2)
Long conversations (will hit context limit)
External data needs (use Pattern 4)

Time Cost

Implementation: 10-15 minutes for basic usage Debugging: +5-10 minutes if hitting errors

Pattern 2: @Generable Structured Output (~2000 words)

Use when: You need structured data from model, not just plain text.

The Problem

Without @Generable:

// ❌ BAD - Unreliable let prompt = "Generate a person with name and age as JSON" let response = try await session.respond(to: prompt) // Might get: {"firstName": "John"} when you expect {"name": "John"} // Might get invalid JSON entirely // Must parse manually, prone to crashes

The Solution: @Generable

@Generable struct Person { let name: String let age: Int }

let session = LanguageModelSession() let response = try await session.respond( to: "Generate a person", generating: Person.self )

let person = response.content // Type-safe Person instance!

// WWDC 301:8:14

How It Works (Constrained Decoding)

@Generable macro generates schema at compile-time
Schema passed to model automatically
Model generates tokens constrained by schema
Framework parses output into Swift type
Guaranteed structural correctness — No hallucinated keys, no parsing errors

"Constrained decoding masks out invalid tokens. Model can only pick tokens valid according to schema."

Supported Types

Primitives:

String , Int , Float , Double , Bool

Arrays:

@Generable struct SearchSuggestions { var searchTerms: [String] }

Nested/Composed:

@Generable struct Itinerary { var destination: String var days: [DayPlan] // Composed type }

@Generable struct DayPlan { var activities: [String] }

// WWDC 286:6:18

Enums with Associated Values:

@Generable struct NPC { let name: String let encounter: Encounter

@Generable
enum Encounter {
    case orderCoffee(String)
    case wantToTalkToManager(complaint: String)
}

}

// WWDC 301:10:49

Recursive Types:

@Generable struct Itinerary { var destination: String var relatedItineraries: [Itinerary] // Recursive! }

@Guide Constraints

Control generated values with @Guide:

Natural Language Description:

@Generable struct NPC { @Guide(description: "A full name with first and last") let name: String }

Numeric Ranges:

@Generable struct Character { @Guide(.range(1...10)) let level: Int }

// WWDC 301:11:20

Array Count:

@Generable struct Suggestions { @Guide(description: "Suggested search terms", .count(4)) var searchTerms: [String] }

// WWDC 286:5:32

Maximum Count:

@Generable struct Result { @Guide(.maximumCount(3)) let topics: [String] }

Regex Patterns:

@Generable struct NPC { @Guide(Regex { Capture { ChoiceOf { "Mr" "Mrs" } } ". " OneOrMore(.word) }) let name: String }

// Output: {name: "Mrs. Brewster"}

// WWDC 301:13:40

Property Order Matters

Properties generated in declaration order:

@Generable struct Itinerary { var destination: String // Generated first var days: [DayPlan] // Generated second var summary: String // Generated last }

"You may find model produces best summaries when they're last property."

Why: Later properties can reference earlier ones. Put most important properties first for streaming.

Pattern 3: Streaming with PartiallyGenerated (~1500 words)

Use when: Generation takes >1 second and you want progressive UI updates.

The Problem

Without streaming:

// User waits 3-5 seconds seeing nothing let response = try await session.respond(to: prompt, generating: Itinerary.self) // Then entire result appears at once

User experience: Feels slow, frozen UI.

The Solution: Streaming

@Generable struct Itinerary { var name: String var days: [DayPlan] }

let stream = session.streamResponse( to: "Generate a 3-day itinerary to Mt. Fuji", generating: Itinerary.self )

for try await partial in stream { print(partial) // Incrementally updated }

// WWDC 286:9:40

PartiallyGenerated Type

@Generable macro automatically creates PartiallyGenerated type:

// Compiler generates: extension Itinerary { struct PartiallyGenerated { var name: String? // All properties optional! var days: [DayPlan]? } }

Why optional: Properties fill in as model generates them.

SwiftUI Integration

struct ItineraryView: View { let session: LanguageModelSession @State private var itinerary: Itinerary.PartiallyGenerated?

var body: some View {
    VStack {
        if let name = itinerary?.name {
            Text(name)
                .font(.title)
        }

        if let days = itinerary?.days {
            ForEach(days, id: \.self) { day in
                DayView(day: day)
            }
        }

        Button("Generate") {
            Task {
                let stream = session.streamResponse(
                    to: "Generate 3-day itinerary to Tokyo",
                    generating: Itinerary.self
                )

                for try await partial in stream {
                    self.itinerary = partial
                }
            }
        }
    }
}

}

// WWDC 286:10:05

Animations & Transitions

Add polish:

if let name = itinerary?.name { Text(name) .transition(.opacity) }

if let days = itinerary?.days { ForEach(days, id: .self) { day in DayView(day: day) .transition(.slide) } }

"Get creative with SwiftUI animations to hide latency. Turn waiting into delight."

View Identity

Critical for arrays:

// ✅ GOOD - Stable identity ForEach(days, id: .id) { day in DayView(day: day) }

// ❌ BAD - Identity changes, animations break ForEach(days.indices, id: .self) { index in DayView(day: days[index]) }

Property Order for Streaming UX

// ✅ GOOD - Title appears first, summary last @Generable struct Itinerary { var name: String // Shows first var days: [DayPlan] // Shows second var summary: String // Shows last (can reference days) }

// ❌ BAD - Summary before content @Generable struct Itinerary { var summary: String // Doesn't make sense before days! var days: [DayPlan] }

// WWDC 286:11:00

When to Use Streaming

✅ Use for:

Itineraries
Stories
Long descriptions
Multi-section content

❌ Skip for:

Simple Q&A (< 1 sentence)
Quick classification
Content tagging

Time Cost

Implementation: 15-20 minutes with SwiftUI Polish (animations): +5-10 minutes

Pattern 4: Tool Calling (~2000 words)

Use when: Model needs external data (weather, locations, contacts) to generate response.

The Problem

// ❌ BAD - Model will hallucinate let response = try await session.respond( to: "What's the temperature in Cupertino?" ) // Output: "It's about 72°F" (completely made up!)

Why: 3B parameter model doesn't have real-time weather data.

The Solution: Tool Calling

Let model autonomously call your code to fetch external data.

import FoundationModels import WeatherKit import CoreLocation

struct GetWeatherTool: Tool { let name = "getWeather" let description = "Retrieve latest weather for a city"

@Generable
struct Arguments {
    @Guide(description: "The city to fetch weather for")
    var city: String
}

func call(arguments: Arguments) async throws -> ToolOutput {
    let places = try await CLGeocoder().geocodeAddressString(arguments.city)
    let weather = try await WeatherService.shared.weather(for: places.first!.location!)
    let temp = weather.currentWeather.temperature.value

    return ToolOutput("\(arguments.city)'s temperature is \(temp) degrees.")
}

}

// WWDC 286:13:42

Attaching Tool to Session

let session = LanguageModelSession( tools: [GetWeatherTool()], instructions: "Help user with weather forecasts." )

let response = try await session.respond( to: "What's the temperature in Cupertino?" )

print(response.content) // "It's 71°F in Cupertino!"

// WWDC 286:15:03

Model autonomously:

Recognizes it needs weather data
Calls GetWeatherTool
Receives real temperature
Incorporates into natural response

Tool Protocol Requirements

protocol Tool { var name: String { get } var description: String { get }

associatedtype Arguments: Generable

func call(arguments: Arguments) async throws -> ToolOutput

}

Name: Short, verb-based (e.g. getWeather , findContact ) Description: One sentence explaining purpose Arguments: Must be @Generable (guarantees valid input) call: Your code — fetch data, process, return

ToolOutput

Two forms:

Natural language (String):

return ToolOutput("Temperature is 71°F")

Structured (GeneratedContent):

let content = GeneratedContent(properties: ["temperature": 71]) return ToolOutput(content)

Multiple Tools Example

let session = LanguageModelSession( tools: [ GetWeatherTool(), FindRestaurantTool(), FindHotelTool() ], instructions: "Plan travel itineraries." )

let response = try await session.respond( to: "Create a 2-day plan for Tokyo" )

// Model autonomously decides: // - Calls FindRestaurantTool for dining // - Calls FindHotelTool for accommodation // - Calls GetWeatherTool to suggest activities

Stateful Tools

Tools can maintain state across calls:

class FindContactTool: Tool { let name = "findContact" let description = "Find contact from age generation"

var pickedContacts = Set&#x3C;String>() // State!

@Generable
struct Arguments {
    let generation: Generation

    @Generable
    enum Generation {
        case babyBoomers
        case genX
        case millennial
        case genZ
    }
}

func call(arguments: Arguments) async throws -> ToolOutput {
    // Use Contacts API
    var contacts = fetchContacts(for: arguments.generation)

    // Remove already picked
    contacts.removeAll(where: { pickedContacts.contains($0.name) })

    guard let picked = contacts.randomElement() else {
        return ToolOutput("No more contacts")
    }

    pickedContacts.insert(picked.name) // Update state
    return ToolOutput(picked.name)
}

}

// WWDC 301:21:55

Why class, not struct: Need to mutate state from call method.

Tool Calling Flow

Session initialized with tools
User prompt: "What's Tokyo's weather?"
Model analyzes: "Need weather data"
Model generates tool call: getWeather(city: "Tokyo")
Framework calls your tool's call() method
Your tool fetches real data from API
Tool output inserted into transcript
Model generates final response using tool output

"Model decides autonomously when and how often to call tools. Can call multiple tools per request, even in parallel."

Tool Calling Guarantees

✅ Guaranteed:

Valid tool names (no hallucinated tools)
Valid arguments (via @Generable)
Structural correctness

❌ Not guaranteed:

Tool will be called (model might not need it)
Specific argument values (model decides based on context)

Real-World Example: Itinerary Planner

struct FindPointsOfInterestTool: Tool { let name = "findPointsOfInterest" let description = "Find restaurants, museums, parks near a landmark"

let landmark: String

@Generable
struct Arguments {
    let category: Category

    @Generable
    enum Category {
        case restaurant
        case museum
        case park
        case marina
    }
}

func call(arguments: Arguments) async throws -> ToolOutput {
    // Use MapKit
    let request = MKLocalSearch.Request()
    request.naturalLanguageQuery = "\(arguments.category) near \(landmark)"

    let search = MKLocalSearch(request: request)
    let response = try await search.start()

    let names = response.mapItems.prefix(5).map { $0.name ?? "" }
    return ToolOutput(names.joined(separator: ", "))
}

}

From WWDC 259 summary: "Tool fetches points of interest from MapKit. Model uses world knowledge to determine promising categories."

When to Use Tools

✅ Use for:

Weather data
Map/location queries
Contact information
Calendar events
External APIs

❌ Don't use for:

Data model already has
Information in prompt/instructions
Simple calculations (model can do these)

Time Cost

Simple tool: 20-25 minutes Complex tool with state: 30-40 minutes

Pattern 5: Context Management (~1500 words)

Use when: Multi-turn conversations that might exceed 4096 token limit.

The Problem

// Long conversation... for i in 1...100 { let response = try await session.respond(to: "Question (i)") // Eventually... // Error: exceededContextWindowSize }

Context window: 4096 tokens (input + output combined) Average: ~3 characters per token in English

Rough calculation:

4096 tokens ≈ 12,000 characters
≈ 2,000-3,000 words total

Long conversation or verbose prompts/responses → Exceed limit

Handling Context Overflow

Basic: Start fresh session

var session = LanguageModelSession()

do { let response = try await session.respond(to: prompt) print(response.content) } catch LanguageModelSession.GenerationError.exceededContextWindowSize { // New session, no history session = LanguageModelSession() }

// WWDC 301:3:37

Problem: Loses entire conversation history.

Better: Condense Transcript

var session = LanguageModelSession()

do { let response = try await session.respond(to: prompt) } catch LanguageModelSession.GenerationError.exceededContextWindowSize { // New session with condensed history session = condensedSession(from: session) }

func condensedSession(from previous: LanguageModelSession) -> LanguageModelSession { let allEntries = previous.transcript.entries var condensedEntries = Transcript.Entry

// Always include first entry (instructions)
if let first = allEntries.first {
    condensedEntries.append(first)

    // Include last entry (most recent context)
    if allEntries.count > 1, let last = allEntries.last {
        condensedEntries.append(last)
    }
}

let condensedTranscript = Transcript(entries: condensedEntries)
return LanguageModelSession(transcript: condensedTranscript)

}

// WWDC 301:3:55

Why this works:

Instructions always preserved
Recent context retained
Total tokens drastically reduced

Advanced: Summarize Middle Entries

For long conversations where recent context isn't enough:

func condensedSession(from previous: LanguageModelSession) -> LanguageModelSession { let entries = previous.transcript.entries

guard entries.count > 3 else {
    return LanguageModelSession(transcript: previous.transcript)
}

// Keep first (instructions) and last (recent)
var condensedEntries = [entries.first!]

// Summarize middle entries
let middleEntries = Array(entries[1..&#x3C;entries.count-1])
let summaryPrompt = """
    Summarize this conversation in 2-3 sentences:
    \(middleEntries.map { $0.content }.joined(separator: "\n"))
    """

// Use Foundation Models itself to summarize!
let summarySession = LanguageModelSession()
let summary = try await summarySession.respond(to: summaryPrompt)

condensedEntries.append(Transcript.Entry(content: summary.content))
condensedEntries.append(entries.last!)

return LanguageModelSession(transcript: Transcript(entries: condensedEntries))

}

"You could summarize parts of transcript with Foundation Models itself."

Preventing Context Overflow

Keep prompts concise:

// ❌ BAD let prompt = """ I want you to generate a comprehensive detailed analysis of this article with multiple sections including summary, key points, sentiment analysis, main arguments, counter arguments, logical fallacies, and conclusions... """

// ✅ GOOD let prompt = "Summarize this article's key points"

Use tools for data: Instead of putting entire dataset in prompt, use tools to fetch on-demand.
Break complex tasks into steps:

// ❌ BAD - One massive generation let response = try await session.respond( to: "Create 7-day itinerary with hotels, restaurants, activities..." )

// ✅ GOOD - Multiple smaller generations let overview = try await session.respond(to: "Create high-level 7-day plan") for day in 1...7 { let details = try await session.respond(to: "Detail activities for day (day)") }

Monitoring Context Usage

"Each token in instructions and prompt adds latency. Longer outputs take longer."

Use Instruments (Foundation Models template) to:

See token counts
Identify verbose prompts
Optimize context usage

Time Cost

Basic overflow handling: 5-10 minutes Condensing strategy: 15-20 minutes Advanced summarization: 30-40 minutes

Pattern 6: Sampling & Generation Options (~1000 words)

Use when: You need control over output randomness/determinism.

Understanding Sampling

Model generates output one token at a time:

Creates probability distribution for next token
Samples from distribution
Picks token
Repeats

Default: Random sampling → Different output each time

Deterministic Output (Greedy)

let response = try await session.respond( to: prompt, options: GenerationOptions(sampling: .greedy) )

// WWDC 301:6:14

Use cases:

Repeatable demos
Testing/debugging
Consistent results required

Caveat: Only holds for same model version. OS updates may change output.

Temperature Control

Low variance (conservative, focused):

let response = try await session.respond( to: prompt, options: GenerationOptions(temperature: 0.5) )

High variance (creative, diverse):

let response = try await session.respond( to: prompt, options: GenerationOptions(temperature: 2.0) )

// WWDC 301:6:14

Temperature scale:

0.1-0.5 : Very focused, predictable
1.0 (default): Balanced
1.5-2.0 : Creative, varied

Example use cases:

Low temp: Fact extraction, classification
High temp: Creative writing, brainstorming

When to Adjust Sampling

✅ Greedy for:

Unit tests
Demos
Consistency critical

✅ Low temperature for:

Factual tasks
Classification
Extraction

✅ High temperature for:

Creative content
Story generation
Varied NPC dialog

Time Cost

Implementation: 2-3 minutes (one line change)

Pressure Scenarios

Scenario 1: "Just Use ChatGPT API" (~1000 words)

Context: You're implementing a new AI feature. PM suggests using ChatGPT API for "better results."

Pressure signals:

👔 Authority: PM outranks you
💸 Existing integration: Team already uses OpenAI for other features
⏰ Speed: "ChatGPT is proven, Foundation Models is new"

Rationalization traps:

"PM knows best"
"ChatGPT gives better answers"
"Faster to implement with existing code"

Why this fails:

Privacy violation: User data sent to external server

Medical notes, financial docs, personal messages
Violates user expectation of on-device privacy
Potential GDPR/privacy law issues

Cost: Every API call costs money

Foundation Models is free
Scale to millions of users = massive costs

Offline unavailable: Requires internet

Airplane mode, poor signal → feature broken
Foundation Models works offline

Latency: Network round-trip adds 500-2000ms

Foundation Models: On-device, <100ms startup

When ChatGPT IS appropriate:

World knowledge required (e.g. "Who is the president of France?")
Complex reasoning (multi-step logic, math proofs)
Very long context (>4096 tokens)

Mandatory response:

"I understand ChatGPT delivers great results for certain tasks. However, for this feature, Foundation Models is the right choice for three critical reasons:

Privacy: This feature processes [medical notes/financial data/personal content]. Users expect this data stays on-device. Sending to external API violates that trust and may have compliance issues.
Cost: At scale, ChatGPT API calls cost $X per 1000 requests. Foundation Models is free. For Y million users, that's $Z annually we can avoid.
Offline capability: Foundation Models works without internet. Users in airplane mode or with poor signal still get full functionality.

When to use ChatGPT: If this feature required world knowledge or complex reasoning, ChatGPT would be the right choice. But this is [summarization/extraction/classification], which is exactly what Foundation Models is optimized for.

Time estimate: Foundation Models implementation: 15-20 minutes. Privacy compliance review for ChatGPT: 2-4 weeks."

Time saved: Privacy compliance review vs correct implementation: 2-4 weeks vs 20 minutes

Scenario 2: "Parse JSON Manually" (~1000 words)

Context: Teammate suggests prompting for JSON, parsing with JSONDecoder. Claims it's "simple and familiar."

Pressure signals:

⏰ Deadline: Ship in 2 days
📚 Familiarity: "Everyone knows JSON"
🔧 Existing code: Already have JSON parsing utilities

Rationalization traps:

"JSON is standard"
"We parse JSON everywhere already"
"Faster than learning new API"

Why this fails:

Hallucinated keys: Model outputs {firstName: "John"} when you expect {name: "John"}

JSONDecoder crashes: keyNotFound
No compile-time safety

Invalid JSON: Model might output:

Here's the person: {name: "John", age: 30}

Not valid JSON (preamble text)
Parsing fails

No type safety: Manual string parsing, prone to errors

Real-world example:

// ❌ BAD - Will fail let prompt = "Generate a person with name and age as JSON" let response = try await session.respond(to: prompt)

// Model outputs: {"firstName": "John Smith", "years": 30} // Your code expects: {"name": ..., "age": ...} // CRASH: keyNotFound(name)

Debugging time: 2-4 hours finding edge cases, writing parsing hacks

Correct approach:

// ✅ GOOD - 15 minutes, guaranteed to work @Generable struct Person { let name: String let age: Int }

let response = try await session.respond( to: "Generate a person", generating: Person.self ) // response.content is type-safe Person, always valid

Mandatory response:

"I understand JSON parsing feels familiar, but for LLM output, @Generable is objectively better for three technical reasons:

Constrained decoding guarantees structure: Model can ONLY generate valid Person instances. Impossible to get wrong keys, invalid JSON, or missing fields.
No parsing code needed: Framework handles parsing automatically. Zero chance of parsing bugs.
Compile-time safety: If we change Person struct, compiler catches all issues. Manual JSON parsing = runtime crashes.

Real cost: Manual JSON approach will hit edge cases. Debugging 'keyNotFound' crashes takes 2-4 hours. @Generable implementation takes 15 minutes and has zero parsing bugs.

Analogy: This is like choosing Swift over Objective-C for new code. Both work, but Swift's type safety prevents entire categories of bugs."

Time saved: 4-8 hours debugging vs 15 minutes correct implementation

Scenario 3: "One Big Prompt" (~1000 words)

Context: Feature requires extracting name, date, amount, category from invoice. Teammate suggests one prompt: "Extract all information."

Pressure signals:

🏗️ Architecture: "Simpler with one API call"
⏰ Speed: "Why make it complicated?"
📉 Complexity: "More prompts = more code"

Rationalization traps:

"Simpler is better"
"One prompt means less code"
"Model is smart enough"

Why this fails:

Context overflow: Complex prompt + large invoice → Exceeds 4096 tokens
Poor results: Model tries to do too much at once, quality suffers
Slow generation: One massive response takes 5-8 seconds
All-or-nothing: If one field fails, entire generation fails

Better approach: Break into tasks + use tools

// ❌ BAD - One massive prompt let prompt = """ Extract from this invoice: - Vendor name - Invoice date - Total amount - Line items (description, quantity, price each) - Payment terms - Due date - Tax amount ... """ // 4 seconds, poor quality, might exceed context

// ✅ GOOD - Structured extraction with focused prompts @Generable struct InvoiceBasics { let vendor: String let date: String let amount: Double }

let basics = try await session.respond( to: "Extract vendor, date, and amount", generating: InvoiceBasics.self ) // 0.5 seconds, axiom-high quality

@Generable struct LineItem { let description: String let quantity: Int let price: Double }

let items = try await session.respond( to: "Extract line items", generating: [LineItem].self ) // 1 second, axiom-high quality

// Total: 1.5 seconds, better quality, graceful partial failures

Mandatory response:

"I understand the appeal of one simple API call. However, this specific task requires a different approach:

Context limits: Invoice + complex extraction prompt will likely exceed 4096 token limit. Multiple focused prompts stay well under limit.
Better quality: Model performs better with focused tasks. 'Extract vendor name' gets 95%+ accuracy. 'Extract everything' gets 60-70%.
Faster perceived performance: Multiple prompts with streaming show progressive results. Users see vendor name in 0.5s, not waiting 5s for everything.
Graceful degradation: If line items fail, we still have basics. All-or-nothing approach means total failure.

Implementation: Breaking into 3-4 focused extractions takes 30 minutes. One big prompt takes 2-3 hours debugging why it hits context limit and produces poor results."

Time saved: 2-3 hours debugging vs 30 minutes proper design

Performance Optimization

Prewarm Session (~200 words)

Problem: First generation takes 1-2 seconds just to load model.

Solution: Create session before user interaction.

class ViewModel: ObservableObject { private var session: LanguageModelSession?

init() {
    // Prewarm on init, not when user taps button
    Task {
        self.session = LanguageModelSession(instructions: "...")
    }
}

func generate(prompt: String) async throws -> String {
    let response = try await session!.respond(to: prompt)
    return response.content
}

}

"Prewarming session before user interaction reduces initial latency."

Time saved: 1-2 seconds off first generation

includeSchemaInPrompt: false (~200 words)

Problem: @Generable schemas inserted into prompt, increases token count.

Solution: For subsequent requests with same schema, skip insertion.

let firstResponse = try await session.respond( to: "Generate first person", generating: Person.self // Schema inserted automatically )

// Subsequent requests with SAME schema let secondResponse = try await session.respond( to: "Generate another person", generating: Person.self, options: GenerationOptions(includeSchemaInPrompt: false) )

"Setting includeSchemaInPrompt to false decreases token count and latency for subsequent requests."

When to use: Multi-turn with same @Generable type

Time saved: 10-20% latency reduction per request

Property Order for Streaming UX (~200 words)

Problem: User waits for entire generation.

Solution: Put important properties first, stream to show early.

// ✅ GOOD - Title shows immediately @Generable struct Article { var title: String // Shows in 0.2s var summary: String // Shows in 0.8s var fullText: String // Shows in 2.5s }

// ❌ BAD - Wait for everything @Generable struct Article { var fullText: String // User waits 2.5s var title: String var summary: String }

UX impact: Perceived latency drops from 2.5s to 0.2s

Foundation Models Instrument (~100 words)

Use Instruments app with Foundation Models template to:

Profile latency of each request
See token counts (input/output)
Identify optimization opportunities
Quantify improvements

"New Instruments profiling template lets you observe areas of optimization and quantify improvements."

Access: Instruments → Create → Foundation Models template

Checklist

Before shipping Foundation Models features:

Required Checks

Availability checked before creating session
Using @Generable for structured output (not manual JSON)
Handling context overflow (exceededContextWindowSize )
Handling guardrail violations (guardrailViolation )
Handling unsupported language (unsupportedLanguageOrLocale )
Streaming for long generations (>1 second)
Not blocking UI (using Task {} for async)
Tools for external data (not prompting for weather/locations)
Prewarmed session if latency-sensitive

Best Practices

Instructions are concise (not verbose)
Never interpolating user input into instructions
Property order optimized for streaming UX
Using appropriate temperature/sampling
Tested on real device (not just simulator)
Profiled with Instruments (Foundation Models template)
Error handling shows graceful UI messages
Tested offline (airplane mode)
Tested with long conversations (context handling)

Model Capability

Not using for world knowledge
Not using for complex reasoning
Use case is: summarization, extraction, classification, or generation
Have fallback if unavailable (show message, disable feature)

Resources

WWDC: 286, 259, 301

Skills: axiom-foundation-models-diag, axiom-foundation-models-ref

Last Updated: 2025-12-03 Version: 1.0.0 Target: iOS 26+, macOS 26+, iPadOS 26+, axiom-visionOS 26+

axiom-foundation-models

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

axiom-swiftui-architecture

axiom-avfoundation-ref

axiom-testflight-triage

axiom-ios-ui