Groq API

Build applications with Groq's ultra-fast LLM inference (300-1000+ tokens/sec).

Quick Start

Installation

# Python
pip install groq

# TypeScript/JavaScript
npm install groq-sdk

Environment Setup

export GROQ_API_KEY=<your-api-key>

Basic Chat Completion

Python:

from groq import Groq

client = Groq()  # Uses GROQ_API_KEY env var

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

TypeScript:

import Groq from "groq-sdk";

const client = new Groq();

const response = await client.chat.completions.create({
    model: "llama-3.3-70b-versatile",
    messages: [{ role: "user", content: "Hello" }],
});
console.log(response.choices[0].message.content);

Model Selection

Use Case	Model	Notes
Fast + cheap	`llama-3.1-8b-instant`	Best for simple tasks
Balanced	`llama-3.3-70b-versatile`	Quality/cost balance
Highest quality	`openai/gpt-oss-120b`	Built-in tools + reasoning
Agentic	`groq/compound`	Web search + code exec
Reasoning	`openai/gpt-oss-20b`	Fast reasoning (low/med/high)
Vision/OCR	`llama-4-scout-17b-16e-instruct`	Image understanding
Audio STT	`whisper-large-v3-turbo`	Transcription
TTS	`playai-tts`	Text-to-speech

See references/models.md for full model list and pricing.

Common Patterns

Streaming Responses

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

System Messages

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello"}
    ]
)

Async Client (Python)

import asyncio
from groq import AsyncGroq

async def main():
    client = AsyncGroq()
    response = await client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": "Hello"}]
    )
    return response.choices[0].message.content

print(asyncio.run(main()))

JSON Mode

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "List 3 colors as JSON array"}],
    response_format={"type": "json_object"}
)

Structured Outputs (JSON Schema)

Force output to match a schema. Two modes available:

Mode	Guarantee	Models
`strict: true`	100% schema compliance	`openai/gpt-oss-20b`, `openai/gpt-oss-120b`
`strict: false`	Best-effort compliance	All supported models

Strict Mode (guaranteed compliance):

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Extract: John is 30 years old"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"}
                },
                "required": ["name", "age"],
                "additionalProperties": False
            }
        }
    }
)

With Pydantic:

from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age: int

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Extract: John is 30"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "strict": True,
            "schema": Person.model_json_schema()
        }
    }
)
person = Person.model_validate(json.loads(response.choices[0].message.content))

See references/structured-outputs.md for schema requirements, validation libraries, and examples.

Audio

Transcription (Speech-to-Text)

with open("audio.mp3", "rb") as f:
    transcription = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",
        file=f,
        language="en",  # Optional: ISO-639-1 code
        response_format="verbose_json",  # json, text, verbose_json
        timestamp_granularities=["word", "segment"]
    )
print(transcription.text)

Translation (to English)

with open("french_audio.mp3", "rb") as f:
    translation = client.audio.translations.create(
        model="whisper-large-v3",
        file=f
    )
print(translation.text)  # English text

Text-to-Speech

response = client.audio.speech.create(
    model="playai-tts",
    input="Hello, world!",
    voice="Fritz-PlayAI",
    response_format="wav",  # flac, mp3, mulaw, ogg, wav
    speed=1.0  # 0.5 to 5
)
response.write_to_file("output.wav")

Vision

Process images with Llama 4 multimodal models. Supports up to 5 images per request.

Models: meta-llama/llama-4-scout-17b-16e-instruct (faster), meta-llama/llama-4-maverick-17b-128e-instruct (higher quality)

Image from URL

response = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

Local Image (Base64)

import base64

def encode_image(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image('photo.jpg')}"}}
        ]
    }]
)

OCR / Extract Data as JSON

response = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract all text and data as JSON"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
        ]
    }],
    response_format={"type": "json_object"}
)

See references/vision.md for multi-image, tool use with images, and multi-turn conversations.

Tool Use

For tool calling patterns and examples, see references/tool-use.md.

Quick example:

import json

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Weather in Paris?"}],
    tools=tools
)

if response.choices[0].message.tool_calls:
    for tc in response.choices[0].message.tool_calls:
        args = json.loads(tc.function.arguments)
        # Execute function and continue conversation

Built-In Tools (Agentic)

Use groq/compound or openai/gpt-oss-120b for built-in web search and code execution:

response = client.chat.completions.create(
    model="groq/compound",
    messages=[{"role": "user", "content": "Search for latest Python news"}]
)
# Model automatically uses web search

MCP (Remote Tools)

Connect to third-party MCP servers for tools like Stripe, GitHub, web scraping. Use the Responses API:

import openai

client = openai.OpenAI(
    api_key=os.environ.get("GROQ_API_KEY"),
    base_url="https://api.groq.com/openai/v1"
)

response = client.responses.create(
    model="openai/gpt-oss-120b",
    input="What models are trending on Huggingface?",
    tools=[{
        "type": "mcp",
        "server_label": "Huggingface",
        "server_url": "https://huggingface.co/mcp"
    }]
)

See references/tool-use.md for MCP configuration and popular servers.

Reasoning Models

Control how models think through complex problems.

Models: openai/gpt-oss-20b, openai/gpt-oss-120b (low/medium/high), qwen/qwen3-32b (none/default)

GPT-OSS with Reasoning Effort

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "How many r's in strawberry?"}],
    reasoning_effort="high",  # low, medium, high
    temperature=0.6,
    max_completion_tokens=1024
)

print(response.choices[0].message.content)
print("Reasoning:", response.choices[0].message.reasoning)

Qwen3 with Parsed Reasoning

response = client.chat.completions.create(
    model="qwen/qwen3-32b",
    messages=[{"role": "user", "content": "Solve: x + 5 = 12"}],
    reasoning_format="parsed"  # raw, parsed, hidden
)

print("Answer:", response.choices[0].message.content)
print("Reasoning:", response.choices[0].message.reasoning)

Hide Reasoning (GPT-OSS)

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "What is 15% of 80?"}],
    include_reasoning=False  # Hide reasoning in response
)

See references/reasoning.md for streaming, tool use with reasoning, and best practices.

Batch Processing

For high-volume async processing (24h-7d completion window):

# 1. Create JSONL file with requests
# 2. Upload file
# 3. Create batch
batch = client.batches.create(
    input_file_id=file_id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

# 4. Check status
batch = client.batches.retrieve(batch.id)
if batch.status == "completed":
    results = client.files.content(batch.output_file_id)

See references/api-reference.md for full batch API details.

Prompt Caching

Automatically reduce latency and costs by 50% for repeated prompt prefixes. No code changes required.

Supported models: moonshotai/kimi-k2-instruct-0905, openai/gpt-oss-20b, openai/gpt-oss-120b, openai/gpt-oss-safeguard-20b

How it works:

Place static content (system prompts, tools, examples) at the beginning
Place dynamic content (user queries) at the end
Cache automatically matches prefixes and applies 50% discount
Cache expires after 2 hours without use

Track cache usage:

response = client.chat.completions.create(
    model="moonshotai/kimi-k2-instruct-0905",
    messages=[{"role": "system", "content": large_system_prompt}, ...]
)

cached = response.usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens: {cached}")  # 50% discount applied to these

See references/prompt-caching.md for optimization strategies and examples.

Content Moderation

Detect and filter harmful content using safeguard models.

Llama Guard 4

General content safety classification. Returns safe or unsafe\nSX (category code).

response = client.chat.completions.create(
    model="meta-llama/Llama-Guard-4-12B",
    messages=[{"role": "user", "content": user_input}]
)

if response.choices[0].message.content.startswith("unsafe"):
    # Block or handle unsafe content
    pass

GPT-OSS Safeguard 20B

Prompt injection detection with custom policies. Returns structured JSON.

response = client.chat.completions.create(
    model="openai/gpt-oss-safeguard-20b",
    messages=[
        {"role": "system", "content": injection_detection_policy},
        {"role": "user", "content": user_input}
    ]
)
# Returns: {"violation": 1, "category": "Direct Override", "rationale": "..."}

See references/moderation.md for complete policies, harm taxonomy, and integration patterns.

Error Handling

from groq import Groq, RateLimitError, APIConnectionError, APIStatusError

client = Groq()

try:
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError:
    # Wait and retry with exponential backoff
    pass
except APIConnectionError:
    # Network issue
    pass
except APIStatusError as e:
    # API error (check e.status_code)
    pass

See references/audio.md for complete audio API reference including file handling, metadata fields, and prompting guidelines.

Resources

Models & pricing: references/models.md
Tool use guide: references/tool-use.md
Vision guide: references/vision.md
Audio guide: references/audio.md
Reasoning guide: references/reasoning.md
Structured outputs: references/structured-outputs.md
Prompt caching: references/prompt-caching.md
Moderation guide: references/moderation.md
SDK reference: references/sdk.md
Full API reference: references/api-reference.md
Official docs: https://console.groq.com/docs
Python SDK: https://github.com/groq/groq-python
TypeScript SDK: https://github.com/groq/groq-typescript