groq-api

Groq API integration for building AI-powered applications with ultra-fast LLM inference. Use when working with Groq's Chat Completions API, Python SDK (groq), TypeScript SDK (groq-sdk), tool use/function calling, vision/image processing, audio transcription with Whisper, streaming responses, text-to-speech, content moderation with Llama Guard, batch processing, or any Groq API integration task. Triggers on mentions of Groq, GroqCloud, or fast LLM inference needs.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "groq-api" with this command: npx skills add diskd-ai/groq-api/diskd-ai-groq-api-groq-api

Groq API

Build applications with Groq's ultra-fast LLM inference (300-1000+ tokens/sec).

Quick Start

Installation

# Python
pip install groq

# TypeScript/JavaScript
npm install groq-sdk

Environment Setup

export GROQ_API_KEY=<your-api-key>

Basic Chat Completion

Python:

from groq import Groq

client = Groq()  # Uses GROQ_API_KEY env var

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

TypeScript:

import Groq from "groq-sdk";

const client = new Groq();

const response = await client.chat.completions.create({
    model: "llama-3.3-70b-versatile",
    messages: [{ role: "user", content: "Hello" }],
});
console.log(response.choices[0].message.content);

Model Selection

Use CaseModelNotes
Fast + cheapllama-3.1-8b-instantBest for simple tasks
Balancedllama-3.3-70b-versatileQuality/cost balance
Highest qualityopenai/gpt-oss-120bBuilt-in tools + reasoning
Agenticgroq/compoundWeb search + code exec
Reasoningopenai/gpt-oss-20bFast reasoning (low/med/high)
Vision/OCRllama-4-scout-17b-16e-instructImage understanding
Audio STTwhisper-large-v3-turboTranscription
TTSplayai-ttsText-to-speech

See references/models.md for full model list and pricing.

Common Patterns

Streaming Responses

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

System Messages

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello"}
    ]
)

Async Client (Python)

import asyncio
from groq import AsyncGroq

async def main():
    client = AsyncGroq()
    response = await client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": "Hello"}]
    )
    return response.choices[0].message.content

print(asyncio.run(main()))

JSON Mode

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "List 3 colors as JSON array"}],
    response_format={"type": "json_object"}
)

Structured Outputs (JSON Schema)

Force output to match a schema. Two modes available:

ModeGuaranteeModels
strict: true100% schema complianceopenai/gpt-oss-20b, openai/gpt-oss-120b
strict: falseBest-effort complianceAll supported models

Strict Mode (guaranteed compliance):

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Extract: John is 30 years old"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"}
                },
                "required": ["name", "age"],
                "additionalProperties": False
            }
        }
    }
)

With Pydantic:

from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age: int

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Extract: John is 30"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "strict": True,
            "schema": Person.model_json_schema()
        }
    }
)
person = Person.model_validate(json.loads(response.choices[0].message.content))

See references/structured-outputs.md for schema requirements, validation libraries, and examples.

Audio

Transcription (Speech-to-Text)

with open("audio.mp3", "rb") as f:
    transcription = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",
        file=f,
        language="en",  # Optional: ISO-639-1 code
        response_format="verbose_json",  # json, text, verbose_json
        timestamp_granularities=["word", "segment"]
    )
print(transcription.text)

Translation (to English)

with open("french_audio.mp3", "rb") as f:
    translation = client.audio.translations.create(
        model="whisper-large-v3",
        file=f
    )
print(translation.text)  # English text

Text-to-Speech

response = client.audio.speech.create(
    model="playai-tts",
    input="Hello, world!",
    voice="Fritz-PlayAI",
    response_format="wav",  # flac, mp3, mulaw, ogg, wav
    speed=1.0  # 0.5 to 5
)
response.write_to_file("output.wav")

Vision

Process images with Llama 4 multimodal models. Supports up to 5 images per request.

Models: meta-llama/llama-4-scout-17b-16e-instruct (faster), meta-llama/llama-4-maverick-17b-128e-instruct (higher quality)

Image from URL

response = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

Local Image (Base64)

import base64

def encode_image(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image('photo.jpg')}"}}
        ]
    }]
)

OCR / Extract Data as JSON

response = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract all text and data as JSON"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
        ]
    }],
    response_format={"type": "json_object"}
)

See references/vision.md for multi-image, tool use with images, and multi-turn conversations.

Tool Use

For tool calling patterns and examples, see references/tool-use.md.

Quick example:

import json

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Weather in Paris?"}],
    tools=tools
)

if response.choices[0].message.tool_calls:
    for tc in response.choices[0].message.tool_calls:
        args = json.loads(tc.function.arguments)
        # Execute function and continue conversation

Built-In Tools (Agentic)

Use groq/compound or openai/gpt-oss-120b for built-in web search and code execution:

response = client.chat.completions.create(
    model="groq/compound",
    messages=[{"role": "user", "content": "Search for latest Python news"}]
)
# Model automatically uses web search

MCP (Remote Tools)

Connect to third-party MCP servers for tools like Stripe, GitHub, web scraping. Use the Responses API:

import openai

client = openai.OpenAI(
    api_key=os.environ.get("GROQ_API_KEY"),
    base_url="https://api.groq.com/openai/v1"
)

response = client.responses.create(
    model="openai/gpt-oss-120b",
    input="What models are trending on Huggingface?",
    tools=[{
        "type": "mcp",
        "server_label": "Huggingface",
        "server_url": "https://huggingface.co/mcp"
    }]
)

See references/tool-use.md for MCP configuration and popular servers.

Reasoning Models

Control how models think through complex problems.

Models: openai/gpt-oss-20b, openai/gpt-oss-120b (low/medium/high), qwen/qwen3-32b (none/default)

GPT-OSS with Reasoning Effort

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "How many r's in strawberry?"}],
    reasoning_effort="high",  # low, medium, high
    temperature=0.6,
    max_completion_tokens=1024
)

print(response.choices[0].message.content)
print("Reasoning:", response.choices[0].message.reasoning)

Qwen3 with Parsed Reasoning

response = client.chat.completions.create(
    model="qwen/qwen3-32b",
    messages=[{"role": "user", "content": "Solve: x + 5 = 12"}],
    reasoning_format="parsed"  # raw, parsed, hidden
)

print("Answer:", response.choices[0].message.content)
print("Reasoning:", response.choices[0].message.reasoning)

Hide Reasoning (GPT-OSS)

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "What is 15% of 80?"}],
    include_reasoning=False  # Hide reasoning in response
)

See references/reasoning.md for streaming, tool use with reasoning, and best practices.

Batch Processing

For high-volume async processing (24h-7d completion window):

# 1. Create JSONL file with requests
# 2. Upload file
# 3. Create batch
batch = client.batches.create(
    input_file_id=file_id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

# 4. Check status
batch = client.batches.retrieve(batch.id)
if batch.status == "completed":
    results = client.files.content(batch.output_file_id)

See references/api-reference.md for full batch API details.

Prompt Caching

Automatically reduce latency and costs by 50% for repeated prompt prefixes. No code changes required.

Supported models: moonshotai/kimi-k2-instruct-0905, openai/gpt-oss-20b, openai/gpt-oss-120b, openai/gpt-oss-safeguard-20b

How it works:

  • Place static content (system prompts, tools, examples) at the beginning
  • Place dynamic content (user queries) at the end
  • Cache automatically matches prefixes and applies 50% discount
  • Cache expires after 2 hours without use

Track cache usage:

response = client.chat.completions.create(
    model="moonshotai/kimi-k2-instruct-0905",
    messages=[{"role": "system", "content": large_system_prompt}, ...]
)

cached = response.usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens: {cached}")  # 50% discount applied to these

See references/prompt-caching.md for optimization strategies and examples.

Content Moderation

Detect and filter harmful content using safeguard models.

Llama Guard 4

General content safety classification. Returns safe or unsafe\nSX (category code).

response = client.chat.completions.create(
    model="meta-llama/Llama-Guard-4-12B",
    messages=[{"role": "user", "content": user_input}]
)

if response.choices[0].message.content.startswith("unsafe"):
    # Block or handle unsafe content
    pass

GPT-OSS Safeguard 20B

Prompt injection detection with custom policies. Returns structured JSON.

response = client.chat.completions.create(
    model="openai/gpt-oss-safeguard-20b",
    messages=[
        {"role": "system", "content": injection_detection_policy},
        {"role": "user", "content": user_input}
    ]
)
# Returns: {"violation": 1, "category": "Direct Override", "rationale": "..."}

See references/moderation.md for complete policies, harm taxonomy, and integration patterns.

Error Handling

from groq import Groq, RateLimitError, APIConnectionError, APIStatusError

client = Groq()

try:
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError:
    # Wait and retry with exponential backoff
    pass
except APIConnectionError:
    # Network issue
    pass
except APIStatusError as e:
    # API error (check e.status_code)
    pass

See references/audio.md for complete audio API reference including file handling, metadata fields, and prompting guidelines.

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

redmine-cli

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

code-review

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

assemblyai-cli

No summary provided by upstream source.

Repository SourceNeeds Review