Podcast Generation with GPT Realtime Mini
Generate real audio narratives from text content using Azure OpenAI's Realtime API.
Quick Start
-
Configure environment variables for Realtime API
-
Connect via WebSocket to Azure OpenAI Realtime endpoint
-
Send text prompt, collect PCM audio chunks + transcript
-
Convert PCM to WAV format
-
Return base64-encoded audio to frontend for playback
Environment Configuration
AZURE_OPENAI_AUDIO_API_KEY=your_realtime_api_key AZURE_OPENAI_AUDIO_ENDPOINT=https://your-resource.cognitiveservices.azure.com AZURE_OPENAI_AUDIO_DEPLOYMENT=gpt-realtime-mini
Note: Endpoint should NOT include /openai/v1/
- just the base URL.
Core Workflow
Backend Audio Generation
from openai import AsyncOpenAI import base64
Convert HTTPS endpoint to WebSocket URL
ws_url = endpoint.replace("https://", "wss://") + "/openai/v1"
client = AsyncOpenAI( websocket_base_url=ws_url, api_key=api_key )
audio_chunks = [] transcript_parts = []
async with client.realtime.connect(model="gpt-realtime-mini") as conn: # Configure for audio-only output await conn.session.update(session={ "output_modalities": ["audio"], "instructions": "You are a narrator. Speak naturally." })
# Send text to narrate
await conn.conversation.item.create(item={
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": prompt}]
})
await conn.response.create()
# Collect streaming events
async for event in conn:
if event.type == "response.output_audio.delta":
audio_chunks.append(base64.b64decode(event.delta))
elif event.type == "response.output_audio_transcript.delta":
transcript_parts.append(event.delta)
elif event.type == "response.done":
break
Convert PCM to WAV (see scripts/pcm_to_wav.py)
pcm_audio = b''.join(audio_chunks) wav_audio = pcm_to_wav(pcm_audio, sample_rate=24000)
Frontend Audio Playback
// Convert base64 WAV to playable blob const base64ToBlob = (base64, mimeType) => { const bytes = atob(base64); const arr = new Uint8Array(bytes.length); for (let i = 0; i < bytes.length; i++) arr[i] = bytes.charCodeAt(i); return new Blob([arr], { type: mimeType }); };
const audioBlob = base64ToBlob(response.audio_data, 'audio/wav'); const audioUrl = URL.createObjectURL(audioBlob); new Audio(audioUrl).play();
Voice Options
Voice Character
alloy Neutral
echo Warm
fable Expressive
onyx Deep
nova Friendly
shimmer Clear
Realtime API Events
-
response.output_audio.delta
-
Base64 audio chunk
-
response.output_audio_transcript.delta
-
Transcript text
-
response.done
-
Generation complete
-
error
-
Handle with event.error.message
Audio Format
-
Input: Text prompt
-
Output: PCM audio (24kHz, 16-bit, mono)
-
Storage: Base64-encoded WAV
References
-
Full architecture: See references/architecture.md for complete stack design
-
Code examples: See references/code-examples.md for production patterns
-
PCM conversion: Use scripts/pcm_to_wav.py for audio format conversion