Cerebras API
Cerebras provides the world's fastest AI inference (2,000+ tokens/s). OpenAI-compatible API with Python and TypeScript SDKs.
Quick Reference
| Resource | Location |
|---|---|
| API Base URL | https://api.cerebras.ai/v1 |
| Get API Key | https://cloud.cerebras.ai |
| Python SDK | pip install cerebras_cloud_sdk |
| TypeScript SDK | npm install @cerebras/cerebras_cloud_sdk |
Available Models
Deprecation Notice:
qwen-3-32bandllama-3.3-70bare scheduled for deprecation on February 16, 2026.
Production Models
Fully supported for production use.
| Model | Model ID | Parameters | Speed |
|---|---|---|---|
| Llama 3.1 8B | llama3.1-8b | 8B | ~2200 tok/s |
| Llama 3.3 70B | llama-3.3-70b | 70B | ~2100 tok/s |
| OpenAI GPT OSS | gpt-oss-120b | 120B | ~3000 tok/s |
| Qwen 3 32B | qwen-3-32b | 32B | ~2600 tok/s |
Preview Models
For evaluation only - may be discontinued with short notice.
| Model | Model ID | Parameters | Speed |
|---|---|---|---|
| Qwen 3 235B Instruct | qwen-3-235b-a22b-instruct-2507 | 235B | ~1400 tok/s |
| Z.ai GLM 4.7 | zai-glm-4.7 | 355B | ~1000 tok/s |
Migrating to GLM? See GLM 4.7 Migration Guide.
Model Selection Guide
| Use Case | Recommended Model |
|---|---|
| Speed-critical (real-time chat) | llama3.1-8b |
| Balanced (chat, coding, math) | llama-3.3-70b |
| Hybrid reasoning | qwen-3-32b |
| Multilingual, instruction following | qwen-3-235b-a22b-instruct-2507 |
| Science, math, complex reasoning | gpt-oss-120b |
| Agents, superior tool use | zai-glm-4.7 |
Model Compression
All models are unpruned original versions. Precision varies:
| Model | Precision | Weights |
|---|---|---|
llama3.1-8b | FP16 | HuggingFace |
llama-3.3-70b | FP16 | HuggingFace |
gpt-oss-120b | FP16/FP8 (weights only) | HuggingFace |
qwen-3-32b | FP16 | HuggingFace |
qwen-3-235b-a22b-instruct-2507 | FP16/FP8 (weights only) | HuggingFace |
zai-glm-4.7 | FP16/FP8 (weights only) | HuggingFace |
Note: FP16/FP8 models use selective weight-only quantization for storage. Sensitive layers remain at full precision, with dequantization on-the-fly. Activations and KV cache remain unquantized.
Basic Usage
Python
import os
from cerebras.cloud.sdk import Cerebras
client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"))
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)
TypeScript
import Cerebras from '@cerebras/cerebras_cloud_sdk';
const client = new Cerebras({ apiKey: process.env.CEREBRAS_API_KEY });
const response = await client.chat.completions.create({
model: 'llama-3.3-70b',
messages: [{ role: 'user', content: 'Explain quantum computing' }]
});
console.log(response.choices[0].message.content);
Streaming
The Cerebras API supports streaming responses, allowing messages to be sent back in chunks and displayed incrementally as they are generated. Set stream=True to receive an iterable of chunks.
Python
stream = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Why is fast inference important?"}],
stream=True
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")
TypeScript
const stream = await client.chat.completions.create({
model: 'llama-3.3-70b',
messages: [{ role: 'user', content: 'Why is fast inference important?' }],
stream: true
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || '');
}
Streaming Notes
- Each chunk contains a
deltaobject with incremental content usageandtime_infoare only available in the final chunk- Use
flush=Truein Python print for real-time display:print(..., end="", flush=True)
Cancel Streaming (TypeScript)
const stream = await client.chat.completions.create({
model: 'llama-3.3-70b',
messages: [{ role: 'user', content: 'Long response' }],
stream: true
});
for await (const chunk of stream) {
if (shouldStop) {
stream.controller.abort();
break;
}
process.stdout.write(chunk.choices[0]?.delta?.content || '');
}
Async Streaming (Python)
from cerebras.cloud.sdk import AsyncCerebras
client = AsyncCerebras()
async def stream_response():
stream = await client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Tell me a joke"}],
stream=True
)
async for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
Tool Calling
Tool calling (also known as function calling) enables models to interact with external tools, APIs, or applications to perform actions and access real-time information.
Supported models: gpt-oss-120b, qwen-3-32b, zai-glm-4.7
How It Works
- Define tools - Provide name, description, and parameters for each tool
- Send request - Include tool definitions with your API call
- Model decides - Model analyzes if a tool can help answer the question
- Execute & respond - Your code executes the tool and returns results to the model
Define Tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"strict": True,
"description": "Get temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Toronto, Canada"
}
},
"required": ["location"],
"additionalProperties": False
}
}
}
]
Make API Call
response = client.chat.completions.create(
model="zai-glm-4.7",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools
)
Handle Tool Calls
import json
choice = response.choices[0].message
if choice.tool_calls:
# Add assistant message with tool_calls
messages.append(choice)
for tool_call in choice.tool_calls:
# Execute your tool
arguments = json.loads(tool_call.function.arguments)
result = get_weather(arguments["location"])
# Append tool result
messages.append({
"role": "tool",
"content": json.dumps(result),
"tool_call_id": tool_call.id
})
# Get final response
final_response = client.chat.completions.create(
model="zai-glm-4.7",
messages=messages
)
print(final_response.choices[0].message.content)
Parallel Tool Calling
When a query requires multiple independent data points (e.g., comparing weather in different cities), the model can request multiple tools at once.
response = client.chat.completions.create(
model="zai-glm-4.7",
messages=[{"role": "user", "content": "Is Toronto warmer than Montreal?"}],
tools=tools,
parallel_tool_calls=True # Default: enabled
)
# Response may contain multiple tool_calls
for tool_call in response.choices[0].message.tool_calls:
print(f"Tool: {tool_call.function.name}, Args: {tool_call.function.arguments}")
To disable parallel calling:
response = client.chat.completions.create(
model="zai-glm-4.7",
messages=messages,
tools=tools,
parallel_tool_calls=False # Force sequential execution
)
TypeScript Example
const tools: Cerebras.Chat.ChatCompletionTool[] = [{
type: 'function',
function: {
name: 'get_weather',
strict: true,
description: 'Get temperature for a given location.',
parameters: {
type: 'object',
properties: {
location: { type: 'string', description: 'City name' }
},
required: ['location'],
additionalProperties: false
}
}
}];
const response = await client.chat.completions.create({
model: 'zai-glm-4.7',
messages: [{ role: 'user', content: 'Weather in Paris?' }],
tools
});
if (response.choices[0].message.tool_calls) {
for (const toolCall of response.choices[0].message.tool_calls) {
const args = JSON.parse(toolCall.function.arguments);
// Execute tool and continue conversation
}
}
Best Practices
- Use
strict: truefor reliable JSON argument parsing - Always set
additionalProperties: falsein parameter schemas - Provide clear, descriptive tool descriptions
- Handle cases where model doesn't call any tools
Structured Outputs
Generate structured data with enforced JSON schema compliance. Key benefits:
- Reduced Variability - Consistent outputs adhering to predefined fields
- Type Safety - Enforces correct data types, preventing mismatches
- Easier Parsing - Direct use in applications without extra processing
Defining the Schema
Define a JSON schema specifying fields, types, and required properties.
For every
requiredarray, you must setadditionalProperties: false.
Python (with Pydantic)
from pydantic import BaseModel
class Movie(BaseModel):
title: str
director: str
year: int
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Suggest a sci-fi movie"}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "movie",
"strict": True,
"schema": Movie.model_json_schema()
}
}
)
import json
movie = json.loads(response.choices[0].message.content)
Python (with raw schema)
movie_schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"director": {"type": "string"},
"year": {"type": "integer"}
},
"required": ["title", "director", "year"],
"additionalProperties": False
}
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Suggest a sci-fi movie"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "movie",
"strict": True,
"schema": movie_schema
}
}
)
TypeScript (with Zod)
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';
const MovieSchema = z.object({
title: z.string(),
director: z.string(),
year: z.number().int()
});
const response = await client.chat.completions.create({
model: 'llama-3.3-70b',
messages: [{ role: 'user', content: 'Suggest a sci-fi movie' }],
response_format: {
type: 'json_schema',
json_schema: {
name: 'movie',
strict: true,
schema: zodToJsonSchema(MovieSchema)
}
}
});
const movie = MovieSchema.parse(JSON.parse(response.choices[0].message.content || '{}'));
Response Format Modes
| Mode | Valid JSON | Adheres to Schema | Extra Fields | Constrained Decoding |
|---|---|---|---|---|
json_schema (strict: true) | Yes | Yes (guaranteed) | No | Yes |
json_schema (strict: false) | Yes (best-effort) | Yes | Yes | No |
json_object | Yes | No (flexible) | No | No |
Enabling each mode:
- Strict:
response_format: { type: "json_schema", json_schema: { strict: true, schema: ... } } - Non-strict:
response_format: { type: "json_schema", json_schema: { strict: false, schema: ... } } - JSON object:
response_format: { type: "json_object" }
Schema Requirements
- Set
"additionalProperties": falsefor all objects with required fields - Max nesting: 5 levels
- Max schema length: 5,000 chars
- No recursive schemas
toolsandresponse_formatcannot be used in the same request.
Reasoning Models
Reasoning models generate intermediate thinking tokens before their final response, enabling better problem-solving and allowing inspection of the model's thought process.
Supported models: qwen-3-32b, gpt-oss-120b, zai-glm-4.7
Reasoning Format Options
| Format | Behavior | Use Case |
|---|---|---|
parsed | Reasoning in separate reasoning field, logprobs split into reasoning_logprobs | When you need structured access to thinking |
raw | Reasoning prepended to content with wrapper tokens (<think>...</think> for GLM/Qwen) | When you want full visibility |
hidden | Reasoning text dropped from response (tokens still counted/billed) | When you want benefits without exposing thinking |
none | Uses model's default behavior | Default |
Default behaviors by model:
- Qwen3:
raw(orhiddenfor JSON output) - GLM:
text_parsed - GPT-OSS:
text_parsed
Basic Usage
response = client.chat.completions.create(
model="qwen-3-32b",
messages=[{"role": "user", "content": "Solve: 15% of 240"}],
reasoning_format="parsed"
)
print("Thinking:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)
const response = await client.chat.completions.create({
model: 'qwen-3-32b',
messages: [{ role: 'user', content: 'Solve: 15% of 240' }],
reasoning_format: 'parsed'
});
console.log('Thinking:', response.choices[0].message.reasoning);
console.log('Answer:', response.choices[0].message.content);
GPT-OSS: Reasoning Effort
Control reasoning intensity with reasoning_effort:
response = client.chat.completions.create(
model="gpt-oss-120b",
messages=[{"role": "user", "content": "Prove the Pythagorean theorem"}],
reasoning_effort="high" # "low", "medium" (default), "high"
)
GLM: Disable Reasoning
Toggle reasoning on/off for GLM:
response = client.chat.completions.create(
model="zai-glm-4.7",
messages=[{"role": "user", "content": "Quick factual question"}],
disable_reasoning=True # Skip thinking for simple queries
)
Multi-Turn Reasoning Context
To retain reasoning awareness across conversation turns, include prior reasoning in assistant messages using the model's native format.
GPT-OSS (reasoning prepended directly):
messages = [
{"role": "user", "content": "What is 25 * 4?"},
{"role": "assistant", "content": "Multiply 25 times 4 equals 100. The answer is 100."},
{"role": "user", "content": "Now divide that by 2."}
]
response = client.chat.completions.create(model="gpt-oss-120b", messages=messages)
GLM/Qwen (reasoning in <think> tags):
messages = [
{"role": "user", "content": "What is 25 * 4?"},
{"role": "assistant", "content": "<think>Multiply 25 times 4 equals 100.</think>The answer is 100."},
{"role": "user", "content": "Now divide that by 2."}
]
response = client.chat.completions.create(model="zai-glm-4.7", messages=messages)
Predicted Outputs
Reduce latency by specifying parts of the response that are already known. (Public Preview)
Supported models: gpt-oss-120b, llama3.1-8b, zai-glm-4.7
Predicted Outputs speed up response generation when parts of the output are already known. This is most useful when regenerating text or code that requires only minor changes.
Python
code = """
html {
margin: 0;
padding: 0;
box-sizing: border-box;
color: #00FF00;
}
"""
response = client.chat.completions.create(
model="gpt-oss-120b",
messages=[
{"role": "user", "content": "Change the color to blue. Respond only with code."},
{"role": "user", "content": code}
],
prediction={"type": "content", "content": code}
)
TypeScript
const code = `
html {
margin: 0;
padding: 0;
box-sizing: border-box;
color: #00FF00;
}
`;
const response = await client.chat.completions.create({
model: 'gpt-oss-120b',
messages: [
{ role: 'user', content: "Change the color to blue. Respond only with code." },
{ role: 'user', content: code }
],
prediction: { type: 'content', content: code }
});
Token-Reuse Metrics
The response includes usage metrics showing prediction efficiency:
{
"usage": {
"completion_tokens": 224,
"prompt_tokens": 204,
"completion_tokens_details": {
"accepted_prediction_tokens": 76,
"rejected_prediction_tokens": 20
}
}
}
A high ratio of accepted to rejected tokens indicates efficient prediction reuse.
Best Practices
- Use when most output is known - The larger the known section, the greater the efficiency gain
- Set
temperature=0- Reduces randomness and increases token acceptance - Keep predictions accurate - Misaligned predictions increase rejected tokens
- Monitor metrics - Track accepted vs rejected tokens to evaluate effectiveness
Limitations
- Rejected tokens are billed at completion-token rates
- Not compatible with:
logprobs,n > 1,tools - Reasoning tokens may generate additional
rejected_prediction_tokens
Prompt Caching
Store and reuse previously processed prompts to reduce latency. Designed to significantly reduce Time to First Token (TTFT) for long-context workloads like multi-turn conversations, RAG, and agentic workflows.
How It Works
Automatic - No code changes required. Works on all supported API requests.
- Prefix Matching - System analyzes the beginning of your prompt (system prompts, tool definitions, few-shot examples)
- Block-Based Caching - Prompts processed in blocks (100-600 tokens). Matching blocks reuse cached computation
- Cache Hit - Cached blocks skip processing, resulting in lower latency
- Cache Miss - Prompt processed normally, prefix stored for future matches
- Auto Expiration - TTL guaranteed 5 minutes, may persist up to 1 hour
The entire beginning of your prompt must match exactly with a cached prefix. Even a single character difference causes a cache miss.
Checking Cache Usage
Check the usage.prompt_tokens_details.cached_tokens field in your response:
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=messages
)
cached = response.usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens: {cached}")
{
"usage": {
"prompt_tokens": 1500,
"prompt_tokens_details": {
"cached_tokens": 1200
}
}
}
Best Practices for Cache Hits
- Keep prefixes consistent - System prompts, tool definitions, and few-shot examples should be identical across requests
- Order matters - Place stable content (system prompt, tools) before dynamic content (user messages)
- Multi-turn conversations - Cache naturally builds as conversation history grows
- RAG workflows - Place frequently-used context at the beginning
Example: Multi-Turn with Tools
# System message and tools are cached across turns
messages = [
{"role": "system", "content": "You are a shopping assistant."},
{"role": "user", "content": "Where is my order ORD-123456?"}
]
# Turn 1 - creates cache for system + tools
response = client.chat.completions.create(
model="qwen-3-32b",
messages=messages,
tools=tools
)
print(f"Turn 1 cached: {response.usage.prompt_tokens_details.cached_tokens}")
# Turn 2 - reuses cached system + tools
messages.append(response.choices[0].message)
messages.append({"role": "user", "content": "Please cancel it."})
response = client.chat.completions.create(
model="qwen-3-32b",
messages=messages,
tools=tools
)
print(f"Turn 2 cached: {response.usage.prompt_tokens_details.cached_tokens}")
FAQ
- Pricing: No additional cost. Standard token rates apply
- Quality: Caching only affects input processing. Output generation unchanged
- Manual clear: Not available. System manages cache automatically
- TTL: Guaranteed 5 minutes, up to 1 hour depending on load
Async Usage (Python)
import asyncio
from cerebras.cloud.sdk import AsyncCerebras
client = AsyncCerebras()
async def main():
response = await client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
asyncio.run(main())
Error Handling
All errors inherit from cerebras.cloud.sdk.APIError. Main categories:
APIConnectionError- Unable to connect to the APIAPIStatusError- API returns non-success status code (4xx or 5xx)
Error Codes
| Status | Exception | Description |
|---|---|---|
| 400 | BadRequestError | Invalid request parameters |
| 401 | AuthenticationError | Invalid or missing API key |
| 402 | PaymentRequired | Payment required |
| 403 | PermissionDeniedError | Insufficient permissions |
| 404 | NotFoundError | Resource not found |
| 422 | UnprocessableEntityError | Validation error |
| 429 | RateLimitError | Too many requests |
| 500 | InternalServerError | Server error |
| 503 | ServiceUnavailable | Service temporarily unavailable |
| N/A | APIConnectionError | Network/connection issue |
Python Example
import cerebras.cloud.sdk
from cerebras.cloud.sdk import Cerebras
client = Cerebras()
try:
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Hello"}]
)
except cerebras.cloud.sdk.APIConnectionError as e:
print("Server could not be reached")
print(e.__cause__)
except cerebras.cloud.sdk.RateLimitError as e:
print("Rate limited - implement backoff")
except cerebras.cloud.sdk.APIStatusError as e:
print(f"Error {e.status_code}: {e.response}")
TypeScript Example
try {
const response = await client.chat.completions.create({
model: 'llama-3.3-70b',
messages: [{ role: 'user', content: 'Hello' }]
});
} catch (err) {
if (err instanceof Cerebras.APIError) {
console.log(err.status); // 400
console.log(err.name); // BadRequestError
console.log(err.headers); // Response headers
} else {
throw err;
}
}
Retries & Timeouts
Automatic Retries
By default, these errors are retried 2 times with exponential backoff:
- Connection errors
- 408 Request Timeout
- 429 Rate Limit
-
= 500 Internal errors
# Python - configure retries
client = Cerebras(max_retries=0) # Disable retries
# Per-request override
client.with_options(max_retries=5).chat.completions.create(...)
// TypeScript - configure retries
const client = new Cerebras({ maxRetries: 0 });
// Per-request override
await client.chat.completions.create(params, { maxRetries: 5 });
Timeouts
Default timeout is 60 seconds. On timeout, APITimeoutError is thrown.
# Python - configure timeout
client = Cerebras(timeout=20.0) # 20 seconds
# Granular control
import httpx
client = Cerebras(
timeout=httpx.Timeout(60.0, read=5.0, write=10.0, connect=2.0)
)
# Per-request override
client.with_options(timeout=5.0).chat.completions.create(...)
// TypeScript - configure timeout
const client = new Cerebras({ timeout: 20 * 1000 });
// Per-request override
await client.chat.completions.create(params, { timeout: 5 * 1000 });
TCP Warming
SDK sends warmup requests on init to reduce first-token latency. Disable if needed:
client = Cerebras(warm_tcp_connection=False)
const client = new Cerebras({ warmTCPConnection: false });
Key Parameters
| Parameter | Description |
|---|---|
model | Model identifier (required) |
messages | Conversation history (required) |
temperature | Randomness 0-1.5 (default varies) |
max_completion_tokens | Max output tokens |
stop | Up to 4 stop sequences |
stream | Enable streaming |
response_format | text, json_object, or json_schema |
tools | Function definitions for tool calling |
reasoning_format | parsed, raw, hidden |
reasoning_effort | low, medium, high (gpt-oss only) |
References
For detailed SDK documentation:
- Python: See references/python.md
- TypeScript: See references/typescript.md