Nebius Dedicated Endpoints
Dedicated endpoints give you an isolated, GPU-backed deployment of a supported model template with per-region data residency, configurable autoscaling, and OpenAI-compatible inference.
Prerequisites
pip install requests openai
export NEBIUS_API_KEY="your-key"
Control plane (manage endpoints): https://api.tokenfactory.nebius.com
Data plane (inference), pick by region:
| Region | Inference base URL |
|---|---|
| eu-north1 | https://api.tokenfactory.nebius.com/v1/ |
| eu-west1 | https://api.tokenfactory.eu-west1.nebius.com/v1/ |
| us-central1 | https://api.tokenfactory.us-central1.nebius.com/v1/ |
Key concepts
- Template — deployable blueprint (model + supported GPU types/regions)
- Flavor —
base(throughput-optimized) orfast(low-latency, speculative decoding) - Endpoint — your live deployment, identified by
endpoint_id - routing_key — the model name to pass in inference calls
Operations
List available templates
import requests
r = requests.get("https://api.tokenfactory.nebius.com/v0/dedicated_endpoints/templates",
headers={"Authorization": f"Bearer {API_KEY}"})
templates = r.json().get("templates", [])
for t in templates:
print(t["template_name"], [f["flavor_name"] for f in t.get("flavors", [])])
Create an endpoint
payload = {
"name": "my-endpoint",
"template": "openai/gpt-oss-20b", # from list_templates
"flavor": "base",
"region": "eu-north1",
"scaling": {"min_replicas": 1, "max_replicas": 2},
}
r = requests.post("https://api.tokenfactory.nebius.com/v0/dedicated_endpoints",
headers=HEADERS, json=payload)
endpoint = r.json()
endpoint_id = endpoint["endpoint_id"]
routing_key = endpoint["routing_key"]
Poll GET /v0/dedicated_endpoints/{endpoint_id} until status == "ready".
Run inference
from openai import OpenAI
client = OpenAI(base_url="https://api.tokenfactory.nebius.com/v1/", api_key=API_KEY)
resp = client.chat.completions.create(
model=routing_key, # the routing_key from endpoint creation
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)
Update autoscaling (live, no downtime)
requests.patch(
f"https://api.tokenfactory.nebius.com/v0/dedicated_endpoints/{endpoint_id}",
headers=HEADERS,
json={"scaling": {"min_replicas": 2, "max_replicas": 8}},
)
Delete endpoint
requests.delete(
f"https://api.tokenfactory.nebius.com/v0/dedicated_endpoints/{endpoint_id}",
headers=HEADERS,
)
Choosing flavor
| Need | Use |
|---|---|
| High throughput, cost-efficient | base |
| Low latency, real-time UX | fast (uses speculative decoding + smaller batches) |
Data residency
Choose region to control where inference runs. Metrics are collected locally but stored in eu-north1.
Bundled reference
Read references/templates-regions.md when the user asks about available templates, GPU types, regions, or flavor differences.
Reference script
Full working script: scripts/02_dedicated_endpoints.py
Docs: https://docs.tokenfactory.nebius.com/ai-models-inference/dedicated-endpoints