Mistral Rate Limits

Overview

Rate limit management for Mistral AI API. Mistral enforces per-minute request and token limits that vary by model tier and subscription plan.

Prerequisites

Mistral API key configured
Understanding of token vs request rate limits
Retry infrastructure

Mistral Rate Limits by Tier

Model Requests/min Tokens/min Tokens/month

mistral-small 120 500,000 Varies by plan

mistral-medium 60 500,000 Varies by plan

mistral-large 30 200,000 Varies by plan

mistral-embed 300 1,000,000 Varies by plan

Instructions

Step 1: Token-Aware Rate Limiter

Mistral limits both requests and tokens per minute. Track both.

import time, threading

class MistralRateLimiter: def init(self, rpm: int = 60, tpm: int = 200000): # 200000 = 200K limit self.rpm = rpm self.tpm = tpm self.request_times = [] self.token_usage = [] self.lock = threading.Lock()

def wait_if_needed(self, estimated_tokens: int = 1000):  # 1000: 1 second in ms
    with self.lock:
        now = time.time()
        cutoff = now - 60
        self.request_times = [t for t in self.request_times if t > cutoff]
        self.token_usage = [(t, n) for t, n in self.token_usage if t > cutoff]

        current_rpm = len(self.request_times)
        current_tpm = sum(n for _, n in self.token_usage)

        if current_rpm >= self.rpm:
            sleep_time = self.request_times[0] - cutoff
            time.sleep(sleep_time + 0.1)

        if current_tpm + estimated_tokens > self.tpm:
            sleep_time = self.token_usage[0][0] - cutoff
            time.sleep(sleep_time + 0.1)

        self.request_times.append(time.time())

def record_usage(self, tokens: int):
    with self.lock:
        self.token_usage.append((time.time(), tokens))

Usage

limiter = MistralRateLimiter(rpm=30, tpm=200000)

def rate_limited_chat(client, messages, model="mistral-large-latest"): estimated = sum(len(m["content"]) // 4 for m in messages) limiter.wait_if_needed(estimated) response = client.chat.complete(model=model, messages=messages) limiter.record_usage(response.usage.total_tokens) return response

Step 2: Handle 429 Responses

import time

def chat_with_retry(client, messages, model, max_retries=5): for attempt in range(max_retries): try: return client.chat.complete(model=model, messages=messages) except Exception as e: if hasattr(e, 'status_code') and e.status_code == 429: # HTTP 429 Too Many Requests wait = min(2 ** attempt + 1, 60) print(f"Rate limited, waiting {wait}s (attempt {attempt+1})") time.sleep(wait) else: raise raise Exception("Max retries exceeded")

Step 3: Model-Tier Routing for Throughput

Route requests to cheaper models when premium capacity is exhausted.

class ModelRouter: def init(self): self.limiters = { "mistral-large-latest": MistralRateLimiter(rpm=30, tpm=200000), # 200000 = 200K limit "mistral-small-latest": MistralRateLimiter(rpm=120, tpm=500000), # 500000 = configured value }

def get_available_model(self, preferred: str = "mistral-large-latest") -> str:
    limiter = self.limiters[preferred]
    if limiter.has_capacity():
        return preferred
    # Fall back to smaller model with more capacity
    return "mistral-small-latest"

Step 4: Batch Embedding with Rate Awareness

def batch_embed(client, texts: list[str], batch_size: int = 32): limiter = MistralRateLimiter(rpm=300, tpm=1000000) # 1000000: 300: timeout: 5 minutes all_embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] estimated_tokens = sum(len(t) // 4 for t in batch) limiter.wait_if_needed(estimated_tokens) response = client.embeddings.create(model="mistral-embed", inputs=batch) all_embeddings.extend([d.embedding for d in response.data]) limiter.record_usage(response.usage.total_tokens) return all_embeddings

Error Handling

Issue Cause Solution

429 errors Exceeded RPM or TPM Use rate limiter, exponential backoff

Inconsistent limits Different limits per model Configure limiter per model tier

Batch embedding failures Too many tokens per batch Reduce batch size

Spike traffic blocked No smoothing Queue requests, spread over time

Examples

Rate Limit Dashboard

status = { "rpm_used": len(limiter.request_times), "rpm_limit": limiter.rpm, "tpm_used": sum(n for _, n in limiter.token_usage), "tpm_limit": limiter.tpm, "utilization_pct": len(limiter.request_times) / limiter.rpm * 100 }

Resources

Mistral Rate Limits
Mistral API Reference

Output

Configuration files or code changes applied to the project
Validation report confirming correct implementation
Summary of changes made and their rationale

mistral-rate-limits

Safety Notice

Copy this and send it to your AI assistant to learn

Usage

Source Transparency

Related Skills

backtesting-trading-strategies

svg-icon-generator

performance-lighthouse-runner

mindmap-generator