together-chat-completions

Serverless chat and text completion inference via Together AI's OpenAI-compatible API. Access 100+ open-source models with pay-per-token pricing. Includes function calling (tool use) with 6 calling patterns, structured outputs (JSON mode, json_schema, regex), and reasoning/thinking models (DeepSeek R1, Qwen3 Thinking, Kimi K2). Use when building chat applications, text generation, multi-turn conversations, function calling, structured JSON outputs, reasoning/chain-of-thought, thinking mode toggle, or any LLM inference task using Together AI.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "together-chat-completions" with this command: npx skills add zainhas/togetherai-skills/zainhas-togetherai-skills-together-chat-completions

Together Chat Completions

Overview

Send inference requests to 100+ open-source models via Together AI's serverless API. OpenAI-compatible — swap the base URL and API key to migrate existing code.

  • Base URL: https://api.together.xyz/v1
  • Auth: Authorization: Bearer $TOGETHER_API_KEY
  • Endpoints: /v1/chat/completions (chat), /v1/completions (text)
  • SDKs: pip install together (Python), npm install together-ai (TypeScript)

Installation

# Python (recommended)
uv init  # optional, if starting a new project
uv add together
# or with pip
pip install together
# TypeScript / JavaScript
npm install together-ai

Set your API key:

export TOGETHER_API_KEY=<your-api-key>

Quick Start

Basic Chat Completion

from together import Together

client = Together()

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "What are some fun things to do in NYC?"}],
)
print(response.choices[0].message.content)
import Together from "together-ai";
const together = new Together();

const response = await together.chat.completions.create({
  model: "openai/gpt-oss-20b",
  messages: [{ role: "user", content: "What are some fun things to do in NYC?" }],
});
console.log(response.choices[0].message.content);
curl -X POST "https://api.together.xyz/v1/chat/completions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-20b","messages":[{"role":"user","content":"What are some fun things to do in NYC?"}]}'

Streaming

Set stream=True to receive tokens incrementally:

stream = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Write a haiku about coding"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
import Together from "together-ai";
const together = new Together();

const stream = await together.chat.completions.create({
  model: "openai/gpt-oss-20b",
  messages: [
    { role: "user", content: "What are some fun things to do in New York?" },
  ],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
curl -X POST "https://api.together.xyz/v1/chat/completions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [
      {"role": "user", "content": "What are some fun things to do in New York?"}
    ],
    "stream": true
  }'

Multi-Turn Conversation

Pass conversation history in the messages array with alternating user/assistant roles:

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "system", "content": "You are a helpful travel guide."},
        {"role": "user", "content": "What should I do in Paris?"},
        {"role": "assistant", "content": "Visit the Eiffel Tower and the Louvre!"},
        {"role": "user", "content": "How about food recommendations?"},
    ],
)

Async (Python)

Use AsyncTogether for parallel requests:

import asyncio
from together import AsyncTogether

async def main():
    client = AsyncTogether()
    tasks = [
        client.chat.completions.create(
            model="openai/gpt-oss-20b",
            messages=[{"role": "user", "content": msg}],
        )
        for msg in ["Hello", "How are you?", "Tell me a joke"]
    ]
    responses = await asyncio.gather(*tasks)
    for r in responses:
        print(r.choices[0].message.content)

asyncio.run(main())

Key Parameters

ParameterTypeDescription
modelstringModel ID (required)
messagesarrayConversation messages with role and content (required for chat)
max_tokensintMax tokens to generate
temperaturefloatSampling temperature (0-1, default ~0.7)
top_pfloatNucleus sampling threshold (0-1)
top_kintTop-k sampling
repetition_penaltyfloatPenalize repeated tokens (>1.0 = more penalty)
stopstring[]Stop sequences
streamboolEnable streaming
response_formatobjectForce JSON output or schema (see function-calling skill)
logprobsintReturn log probabilities for top N tokens
nintNumber of completions to generate

Message Roles

  • system: Set model behavior and context (first message)
  • user: End-user input
  • assistant: Model responses (for conversation history)
  • tool: Tool/function call results

OpenAI Compatibility

Migrate from OpenAI by changing base URL and API key:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key="YOUR_TOGETHER_API_KEY",
)
response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Hello!"}],
)

Text Completions

For non-chat models, use /v1/completions:

response = client.completions.create(
    model="openai/gpt-oss-20b",
    prompt="The quick brown fox",
    max_tokens=50,
)
print(response.choices[0].text)

Rate Limits & Build Tiers

Rate limits depend on your Build Tier (based on lifetime spend):

TierLifetime SpendRPM (most models)
Tier 1$5+60
Tier 2$50+600
Tier 3$200+600
Tier 4$500+600
Tier 5$1000+600

Larger models (>100B) have separate, lower limits. See references/models.md for the full model catalog.

Function Calling (Tool Use)

Define tools the model can call, then execute them and pass results back:

import json
from together import Together
client = Together()

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
        },
    },
}]

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct-Turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that can access external functions."},
        {"role": "user", "content": "What is the current temperature of New York?"},
    ],
    tools=tools,
)

# Process tool calls
tool_calls = response.choices[0].message.tool_calls
for tc in tool_calls:
    args = json.loads(tc.function.arguments)
    result = get_current_weather(**args)  # your function

    # Pass result back
    messages.append(response.choices[0].message)
    messages.append({"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)})

final = client.chat.completions.create(model="Qwen/Qwen2.5-7B-Instruct-Turbo", messages=messages, tools=tools)
import Together from "together-ai";
const together = new Together();

const response = await together.chat.completions.create({
  model: "Qwen/Qwen2.5-7B-Instruct-Turbo",
  messages: [
    {
      role: "system",
      content: "You are a helpful assistant that can access external functions.",
    },
    { role: "user", content: "What is the current temperature of New York?" },
  ],
  tools: [
    {
      type: "function",
      function: {
        name: "getCurrentWeather",
        description: "Get the current weather in a given location",
        parameters: {
          type: "object",
          properties: {
            location: {
              type: "string",
              description: "The city and state, e.g. San Francisco, CA",
            },
            unit: {
              type: "string",
              description: "The unit of temperature",
              enum: ["celsius", "fahrenheit"],
            },
          },
        },
      },
    },
  ],
});

console.log(JSON.stringify(response.choices[0].message?.tool_calls, null, 2));
curl -X POST "https://api.together.xyz/v1/chat/completions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct-Turbo",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant that can access external functions."
      },
      {
        "role": "user",
        "content": "What is the current temperature of New York?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather in a given location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"]
              }
            }
          }
        }
      }
    ]
  }'

tool_choice Parameter

  • "auto" (default): Model decides whether to call functions
  • "required": Model must call at least one function
  • "none": Never call functions
  • {"type": "function", "function": {"name": "fn_name"}}: Force specific function

6 Calling Patterns

  1. Simple: Single function, single call
  2. Multiple functions: Multiple tools available, model picks one
  3. Parallel: Same function called multiple times in one turn
  4. Parallel multiple: Different functions called in one turn
  5. Multi-step: Chained calls (call -> result -> call -> result)
  6. Multi-turn: Function calls across conversation turns

Supported models: Qwen2.5 family, Llama 3.x/4, DeepSeek V3, Mistral, GLM, Kimi K2, and most chat models.

Structured Outputs (JSON Mode)

json_schema (Recommended)

Constrain output to match your JSON schema exactly:

from pydantic import BaseModel, Field

class VoiceNote(BaseModel):
    title: str = Field(description="A title")
    summary: str = Field(description="Short summary")
    actionItems: list[str] = Field(description="Action items")

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "system", "content": f"Respond in JSON: {json.dumps(VoiceNote.model_json_schema())}"},
        {"role": "user", "content": "Summarize: Meeting about Q4 planning..."},
    ],
    response_format={"type": "json_schema", "schema": VoiceNote.model_json_schema()},
)
result = VoiceNote.model_validate_json(response.choices[0].message.content)
curl -X POST "https://api.together.xyz/v1/chat/completions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "system",
        "content": "The following is a voice message transcript. Only answer in JSON."
      },
      {
        "role": "user",
        "content": "Good morning! Today is going to be a busy day. First, I need to make a quick breakfast. While cooking, I will also check my emails."
      }
    ],
    "model": "openai/gpt-oss-20b",
    "response_format": {
      "type": "json_schema",
      "schema": {
        "properties": {
          "title": { "type": "string", "description": "A title for the voice note" },
          "summary": { "type": "string", "description": "A short one sentence summary" },
          "actionItems": { "items": { "type": "string" }, "type": "array", "description": "Action items" }
        },
        "required": ["title", "summary", "actionItems"],
        "type": "object"
      }
    }
  }'

json_object (Simple)

Model outputs valid JSON, structure guided by prompt only:

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "system", "content": "Respond in JSON with keys: name, age, city"},
        {"role": "user", "content": "Tell me about yourself"},
    ],
    response_format={"type": "json_object"},
)

regex (Pattern Matching)

Constrain output to match a regex:

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Classify sentiment: I loved the movie!"}],
    response_format={"type": "regex", "pattern": "(positive|neutral|negative)"},
)

JSON mode supported models: DeepSeek R1/V3, GLM-5, Kimi K2, Llama 4, Qwen3/2.5, and many more.

Reasoning Models

Reasoning models think step-by-step before answering. Best for complex math, code, planning, and logic tasks.

How reasoning output is returned:

  • Most reasoning models (Kimi K2.5, GLM-5, GPT-OSS, Qwen3 Thinking, etc.) return reasoning in a separate reasoning field: response.choices[0].message.reasoning
  • DeepSeek R1 is a special case that outputs reasoning inside <think> tags within the content field

Quick Start

stream = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{"role": "user", "content": "Which is bigger: 9.9 or 9.11?"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
import Together from "together-ai";
const together = new Together();

const stream = await together.chat.completions.create({
  model: "deepseek-ai/DeepSeek-R1",
  messages: [{ role: "user", content: "Which number is bigger 9.9 or 9.11?" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
curl -X POST "https://api.together.xyz/v1/chat/completions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "messages": [
      {"role": "user", "content": "Which number is bigger 9.9 or 9.11?"}
    ]
  }'

Output:

<think>
Let me compare 9.9 and 9.11...
9.9 = 9.90, and 9.90 > 9.11
</think>

**Answer:** 9.9 is bigger.

Available Reasoning Models

ModelAPI StringStrengths
DeepSeek R1deepseek-ai/DeepSeek-R1Math, code, complex reasoning
DeepSeek V3.1deepseek-ai/DeepSeek-V3.1General + reasoning
GPT-OSS 120Bopenai/gpt-oss-120bReasoning only (adjustable effort)
GPT-OSS 20Bopenai/gpt-oss-20bReasoning only (adjustable effort)
Kimi K2 Thinkingmoonshotai/Kimi-K2-ThinkingExtended reasoning
Kimi K2.5moonshotai/Kimi-K2.5Hybrid (reasoning + general)
GLM-5zai-org/GLM-5Hybrid
MiniMax M2.5MiniMaxAI/MiniMax-M2.5Reasoning only
Qwen3 235B ThinkingQwen/Qwen3-235B-A22B-Thinking-2507Thinking mode
Qwen3.5 397BQwen/Qwen3.5-397B-A17BHybrid
QwQ 32BQwen/QwQ-32BCompact reasoning
R1 Distill Llama 70Bdeepseek-ai/DeepSeek-R1-Distill-Llama-70BDistilled reasoning
R1 Distill Qwen 14Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-14BCompact distilled

Reasoning Effort

Control how much thinking the model does ("low", "medium", "high"):

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Prove the infinitude of primes"}],
    reasoning_effort="high",
)
import Together from "together-ai";
const together = new Together();

const stream = await together.chat.completions.create({
  model: "openai/gpt-oss-120b",
  messages: [{ role: "user", content: "Prove the infinitude of primes" }],
  reasoning_effort: "high",
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
curl -X POST "https://api.together.xyz/v1/chat/completions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [
      {"role": "user", "content": "Prove the infinitude of primes"}
    ],
    "reasoning_effort": "high"
  }'

Qwen3 Thinking Toggle

Toggle thinking on/off by choosing the model variant:

  • Thinking enabled: Qwen/Qwen3-235B-A22B-Thinking-2507
  • Thinking disabled (faster, cheaper): Qwen/Qwen3-235B-A22B-Instruct-2507-tput

Accessing Reasoning Output

Most models (Kimi K2.5, GLM-5, GPT-OSS, Qwen3, etc.) -- use the reasoning field:

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Which is bigger: 9.9 or 9.11?"}],
)
reasoning = response.choices[0].message.reasoning  # step-by-step thinking
answer = response.choices[0].message.content         # final answer
print(f"Reasoning: {reasoning}")
print(f"Answer: {answer}")

DeepSeek R1 -- parse <think> tags from content:

import re
content = response.choices[0].message.content
think_match = re.search(r"<think>(.*?)</think>", content, re.DOTALL)
thinking = think_match.group(1).strip() if think_match else ""
answer = re.sub(r"<think>.*?</think>", "", content, flags=re.DOTALL).strip()

Best Practices

  • DeepSeek R1: Use temperature 0.5-0.7, omit system prompts, put instructions in user message
  • Use streaming — reasoning outputs are long
  • Use reasoning_effort="low" for simple questions, "high" for complex math/code/logic
  • Reasoning models cost more (more tokens) — use standard models for simple tasks

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

together-code-interpreter

No summary provided by upstream source.

Repository SourceNeeds Review
General

together-images

No summary provided by upstream source.

Repository SourceNeeds Review
General

together-audio

No summary provided by upstream source.

Repository SourceNeeds Review
General

together-dedicated-endpoints

No summary provided by upstream source.

Repository SourceNeeds Review