azure-speech-to-text-rest-py

Azure Speech to Text REST API for Short Audio

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "azure-speech-to-text-rest-py" with this command: npx skills add claudedjale/skillset/claudedjale-skillset-azure-speech-to-text-rest-py

Azure Speech to Text REST API for Short Audio

Simple REST API for speech-to-text transcription of short audio files (up to 60 seconds). No SDK required - just HTTP requests.

Prerequisites

  • Azure subscription - Create one free

  • Speech resource - Create in Azure Portal

  • Get credentials - After deployment, go to resource > Keys and Endpoint

Environment Variables

Required

AZURE_SPEECH_KEY=<your-speech-resource-key> AZURE_SPEECH_REGION=<region> # e.g., eastus, westus2, westeurope

Alternative: Use endpoint directly

AZURE_SPEECH_ENDPOINT=https://<region>.stt.speech.microsoft.com

Installation

pip install requests

Quick Start

import os import requests

def transcribe_audio(audio_file_path: str, language: str = "en-US") -> dict: """Transcribe short audio file (max 60 seconds) using REST API.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]

url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"

headers = {
    "Ocp-Apim-Subscription-Key": api_key,
    "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
    "Accept": "application/json"
}

params = {
    "language": language,
    "format": "detailed"  # or "simple"
}

with open(audio_file_path, "rb") as audio_file:
    response = requests.post(url, headers=headers, params=params, data=audio_file)

response.raise_for_status()
return response.json()

Usage

result = transcribe_audio("audio.wav", "en-US") print(result["DisplayText"])

Audio Requirements

Format Codec Sample Rate Notes

WAV PCM 16 kHz, mono Recommended

OGG OPUS 16 kHz, mono Smaller file size

Limitations:

  • Maximum 60 seconds of audio

  • For pronunciation assessment: maximum 30 seconds

  • No partial/interim results (final only)

Content-Type Headers

WAV PCM 16kHz

"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000"

OGG OPUS

"Content-Type": "audio/ogg; codecs=opus"

Response Formats

Simple Format (default)

params = {"language": "en-US", "format": "simple"}

{ "RecognitionStatus": "Success", "DisplayText": "Remind me to buy 5 pencils.", "Offset": "1236645672289", "Duration": "1236645672289" }

Detailed Format

params = {"language": "en-US", "format": "detailed"}

{ "RecognitionStatus": "Success", "Offset": "1236645672289", "Duration": "1236645672289", "NBest": [ { "Confidence": 0.9052885, "Display": "What's the weather like?", "ITN": "what's the weather like", "Lexical": "what's the weather like", "MaskedITN": "what's the weather like" } ] }

Chunked Transfer (Recommended)

For lower latency, stream audio in chunks:

import os import requests

def transcribe_chunked(audio_file_path: str, language: str = "en-US") -> dict: """Stream audio in chunks for lower latency.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]

url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"

headers = {
    "Ocp-Apim-Subscription-Key": api_key,
    "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
    "Accept": "application/json",
    "Transfer-Encoding": "chunked",
    "Expect": "100-continue"
}

params = {"language": language, "format": "detailed"}

def generate_chunks(file_path: str, chunk_size: int = 1024):
    with open(file_path, "rb") as f:
        while chunk := f.read(chunk_size):
            yield chunk

response = requests.post(
    url, 
    headers=headers, 
    params=params, 
    data=generate_chunks(audio_file_path)
)

response.raise_for_status()
return response.json()

Authentication Options

Option 1: Subscription Key (Simple)

headers = { "Ocp-Apim-Subscription-Key": os.environ["AZURE_SPEECH_KEY"] }

Option 2: Bearer Token

import requests import os

def get_access_token() -> str: """Get access token from the token endpoint.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]

token_url = f"https://{region}.api.cognitive.microsoft.com/sts/v1.0/issueToken"

response = requests.post(
    token_url,
    headers={
        "Ocp-Apim-Subscription-Key": api_key,
        "Content-Type": "application/x-www-form-urlencoded",
        "Content-Length": "0"
    }
)
response.raise_for_status()
return response.text

Use token in requests (valid for 10 minutes)

token = get_access_token() headers = { "Authorization": f"Bearer {token}", "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000", "Accept": "application/json" }

Query Parameters

Parameter Required Values Description

language

Yes en-US , de-DE , etc. Language of speech

format

No simple , detailed

Result format (default: simple)

profanity

No masked , removed , raw

Profanity handling (default: masked)

Recognition Status Values

Status Description

Success

Recognition succeeded

NoMatch

Speech detected but no words matched

InitialSilenceTimeout

Only silence detected

BabbleTimeout

Only noise detected

Error

Internal service error

Profanity Handling

Mask profanity with asterisks (default)

params = {"language": "en-US", "profanity": "masked"}

Remove profanity entirely

params = {"language": "en-US", "profanity": "removed"}

Include profanity as-is

params = {"language": "en-US", "profanity": "raw"}

Error Handling

import requests

def transcribe_with_error_handling(audio_path: str, language: str = "en-US") -> dict | None: """Transcribe with proper error handling.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]

url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"

try:
    with open(audio_path, "rb") as audio_file:
        response = requests.post(
            url,
            headers={
                "Ocp-Apim-Subscription-Key": api_key,
                "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
                "Accept": "application/json"
            },
            params={"language": language, "format": "detailed"},
            data=audio_file
        )
    
    if response.status_code == 200:
        result = response.json()
        if result.get("RecognitionStatus") == "Success":
            return result
        else:
            print(f"Recognition failed: {result.get('RecognitionStatus')}")
            return None
    elif response.status_code == 400:
        print(f"Bad request: Check language code or audio format")
    elif response.status_code == 401:
        print(f"Unauthorized: Check API key or token")
    elif response.status_code == 403:
        print(f"Forbidden: Missing authorization header")
    else:
        print(f"Error {response.status_code}: {response.text}")
    
    return None
    
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")
    return None

Async Version

import os import aiohttp import asyncio

async def transcribe_async(audio_file_path: str, language: str = "en-US") -> dict: """Async version using aiohttp.""" region = os.environ["AZURE_SPEECH_REGION"] api_key = os.environ["AZURE_SPEECH_KEY"]

url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"

headers = {
    "Ocp-Apim-Subscription-Key": api_key,
    "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
    "Accept": "application/json"
}

params = {"language": language, "format": "detailed"}

async with aiohttp.ClientSession() as session:
    with open(audio_file_path, "rb") as f:
        audio_data = f.read()
    
    async with session.post(url, headers=headers, params=params, data=audio_data) as response:
        response.raise_for_status()
        return await response.json()

Usage

result = asyncio.run(transcribe_async("audio.wav", "en-US")) print(result["DisplayText"])

Supported Languages

Common language codes (see full list):

Code Language

en-US

English (US)

en-GB

English (UK)

de-DE

German

fr-FR

French

es-ES

Spanish (Spain)

es-MX

Spanish (Mexico)

zh-CN

Chinese (Mandarin)

ja-JP

Japanese

ko-KR

Korean

pt-BR

Portuguese (Brazil)

Best Practices

  • Use WAV PCM 16kHz mono for best compatibility

  • Enable chunked transfer for lower latency

  • Cache access tokens for 9 minutes (valid for 10)

  • Specify the correct language for accurate recognition

  • Use detailed format when you need confidence scores

  • Handle all RecognitionStatus values in production code

When NOT to Use This API

Use the Speech SDK or Batch Transcription API instead when you need:

  • Audio longer than 60 seconds

  • Real-time streaming transcription

  • Partial/interim results

  • Speech translation

  • Custom speech models

  • Batch transcription of many files

Reference Files

File Contents

references/pronunciation-assessment.md Pronunciation assessment parameters and scoring

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

azure-observability

No summary provided by upstream source.

Repository SourceNeeds Review
General

azure-appconfiguration-java

No summary provided by upstream source.

Repository SourceNeeds Review
General

azure-aigateway

No summary provided by upstream source.

Repository SourceNeeds Review
General

azure-ai-formrecognizer-java

No summary provided by upstream source.

Repository SourceNeeds Review