gemini-stt

Transcribe audio files using Google's Gemini API or Vertex AI

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "gemini-stt" with this command: npx skills add araa47/gemini-stt

Gemini Speech-to-Text Skill

Transcribe audio files using Google's Gemini API or Vertex AI. Default model is gemini-2.0-flash-lite for fastest transcription.

Authentication (choose one)

Option 1: Vertex AI with Application Default Credentials (Recommended)

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

The script will automatically detect and use ADC when available.

Option 2: Direct Gemini API Key

Set GEMINI_API_KEY in environment (e.g., ~/.env or ~/.clawdbot/.env)

Requirements

  • Python 3.10+ (no external dependencies)
  • Either GEMINI_API_KEY or gcloud CLI with ADC configured

Supported Formats

  • .ogg / .opus (Telegram voice messages)
  • .mp3
  • .wav
  • .m4a

Usage

# Auto-detect auth (tries ADC first, then GEMINI_API_KEY)
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg

# Force Vertex AI
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex

# With a specific model
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --model gemini-2.5-pro

# Vertex AI with specific project and region
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex --project my-project --region us-central1

# With Clawdbot media
python ~/.claude/skills/gemini-stt/transcribe.py ~/.clawdbot/media/inbound/voice-message.ogg

Options

OptionDescription
<audio_file>Path to the audio file (required)
--model, -mGemini model to use (default: gemini-2.0-flash-lite)
--vertex, -vForce use of Vertex AI with ADC
--project, -pGCP project ID (for Vertex, defaults to gcloud config)
--region, -rGCP region (for Vertex, default: us-central1)

Supported Models

Any Gemini model that supports audio input can be used. Recommended models:

ModelNotes
gemini-2.0-flash-liteDefault. Fastest transcription speed.
gemini-2.0-flashFast and cost-effective.
gemini-2.5-flash-liteLightweight 2.5 model.
gemini-2.5-flashBalanced speed and quality.
gemini-2.5-proHigher quality, slower.
gemini-3-flash-previewLatest flash model.
gemini-3-pro-previewLatest pro model, best quality.

See Gemini API Models for the latest list.

How It Works

  1. Reads the audio file and base64 encodes it
  2. Auto-detects authentication:
    • If ADC is available (gcloud), uses Vertex AI endpoint
    • Otherwise, uses GEMINI_API_KEY with direct Gemini API
  3. Sends to the selected Gemini model with transcription prompt
  4. Returns the transcribed text

Example Integration

For Clawdbot voice message handling:

# Transcribe incoming voice message
TRANSCRIPT=$(python ~/.claude/skills/gemini-stt/transcribe.py "$AUDIO_PATH")
echo "User said: $TRANSCRIPT"

Error Handling

The script exits with code 1 and prints to stderr on:

  • No authentication available (neither ADC nor GEMINI_API_KEY)
  • File not found
  • API errors
  • Missing GCP project (when using Vertex)

Notes

  • Uses Gemini 2.0 Flash Lite by default for fastest transcription
  • No external Python dependencies (uses stdlib only)
  • Automatically detects MIME type from file extension
  • Prefers Vertex AI with ADC when available (no API key management needed)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Omniscient

全知全能技能 — 整合认知套件、执行框架、系统控制三大能力层,并配备编排引擎。 认知层:四种思维操作码(直用/改进/迁移/构建)覆盖所有思考任务; 执行层:大语言模型 + 命令执行工具,自动化代码生成与脚本执行; 操控层:Windows桌面软件、系统硬件、串口设备、物联网平台、图形界面自动化、蓝牙设备、GPU显卡...

Registry SourceRecently Updated
General

系统清理技能

定期清理OpenClaw系统中的备份、临时及会话文件,分析磁盘使用并检查系统状态,优化系统性能。

Registry SourceRecently Updated
General

Whoislookup

Look up domain WHOIS registration info — registrar, creation date, expiry date, nameservers, and domain status. Use when asked to check who owns a domain, wh...

Registry SourceRecently Updated
General

WayinVideo - Find Moments in the Video

Find specific moments in a video using a natural language query. Ideal for locating particular scenes, topics, or events in long videos (e.g., “the part wher...

Registry SourceRecently Updated