AssemblyAI transcription, Speech Understanding, and agent-friendly exports
Use this skill when the user wants AssemblyAI rather than generic transcription, or when the job benefits from AssemblyAI-specific capabilities such as:
- model routing across
universal-3-proanduniversal-2 - language detection and code switching
- diarisation plus speaker name / role mapping
- translation, custom formatting, or AssemblyAI speaker identification
- subtitles, paragraphs, sentences, topic / entity / sentiment tasks
- transcript output that is easy for other agents to consume as Markdown or normalised JSON
The skill is designed for AI agents like OpenClaw, not just end users. It provides:
- A no-dependency Node CLI in
scripts/assemblyai.mjs(and a compatibility wrapper atassemblyai.mjs) - Bundled model/language knowledge via
modelsandlanguagescommands - Stable transcript output formats
- agent-friendly Markdown
- normalised agent JSON
- bundle manifests for downstream automation
- Speaker mapping workflows
- manual speaker/channel maps
- AssemblyAI speaker identification
- merged display names in both Markdown and JSON
- AssemblyAI LLM Gateway integration for structured extraction from transcripts
Use this skill in this order
1) Decide whether the user needs AssemblyAI-specific behaviour
If they just want “a transcript”, a generic solution may be enough. Reach for this skill when the user mentions AssemblyAI, wants a specific AssemblyAI feature, or needs the richer outputs and post-processing this skill provides.
2) Pick the best entry point
- New transcription →
transcribe - Existing transcript id →
getorwait - Re-render existing saved JSON →
format - Post-process an existing transcript →
understand - Run transcript text through LLM Gateway →
llm - Need a quick capability lookup before deciding →
modelsorlanguages
3) Prefer the agent-friendly defaults
For most unknown-language or mixed-language jobs, prefer:
node {baseDir}/assemblyai.mjs transcribe INPUT --bundle-dir ./assemblyai-out --all-exports
Why:
- the CLI defaults to auto-best routing when models are not specified
- it writes a manifest + multiple files that agents can inspect without reparsing terminal output
- Markdown and agent JSON become available immediately for follow-on steps
Quick-start recipes
Best general default
Use this when the source language is unknown or could be outside the 6-language Universal-3-Pro set:
node {baseDir}/assemblyai.mjs transcribe ./meeting.mp3 --bundle-dir ./out --all-exports
This defaults to model routing plus language detection unless the request already specifies a model or language.
Best known-language accuracy
If the language is known and supported by Universal-3-Pro, prefer an explicit request:
node {baseDir}/assemblyai.mjs transcribe ./meeting.mp3 --speech-model universal-3-pro --language-code en_us --bundle-dir ./out
Meeting / interview with speaker labels
node {baseDir}/assemblyai.mjs transcribe ./meeting.mp3 --speaker-labels --bundle-dir ./out
Add explicit speaker names or roles
Manual mapping:
node {baseDir}/assemblyai.mjs transcribe ./meeting.mp3 --speaker-labels --speaker-map @assets/speaker-map.example.json --bundle-dir ./out
AssemblyAI speaker identification:
node {baseDir}/assemblyai.mjs transcribe ./meeting.mp3 --speaker-labels --speaker-type role --known-speakers "host,guest" --bundle-dir ./out
Or post-process an existing transcript:
node {baseDir}/assemblyai.mjs understand TRANSCRIPT_ID --speaker-type name --speaker-profiles @assets/speaker-profiles-name.example.json --bundle-dir ./out
Translation
node {baseDir}/assemblyai.mjs transcribe ./meeting.mp3 --translate-to de,fr --match-original-utterance --bundle-dir ./out
Structured extraction through LLM Gateway
node {baseDir}/assemblyai.mjs llm TRANSCRIPT_ID --prompt @assets/example-prompt.txt --schema @assets/llm-json-schema.example.json --out ./summary.json
Command guidance
transcribe
Use for local files or remote URLs.
- Local files are uploaded first.
- Public URLs are sent directly to AssemblyAI.
- Waits by default, then renders output.
Prefer --bundle-dir for anything longer than a trivial clip.
get / wait
Use when you already have the transcript id. wait blocks until completion; get fetches immediately unless you add --wait.
format
Use when you already saved:
- raw transcript JSON from AssemblyAI, or
- the normalised agent JSON produced by this skill
This is useful when you want to apply a new speaker map, re-render Markdown, or generate a fresh bundle without retranscribing.
understand
Use when you need AssemblyAI Speech Understanding on an existing transcript:
- translation
- speaker identification
- custom formatting
This command fetches the transcript, merges in the returned understanding results, then renders updated Markdown / agent JSON / bundle outputs.
llm
Use when the user wants:
- summaries
- extraction
- structured JSON
- downstream reasoning over the transcript
Prefer --schema when the next step is automated.
Output strategy
Best default for agents: bundle mode
--bundle-dir writes a directory containing:
- Markdown transcript
- agent JSON
- raw JSON
- optional paragraphs / sentences / subtitles
- a machine-readable manifest
This is usually better than dumping everything to stdout.
Primary output kinds
Use --export to choose the main output:
markdown(default)agent-jsonjson/raw-jsontextparagraphssentencessrtvttmanifest
Sidecar outputs
You can request extra files directly with:
--markdown-out--agent-json-out--raw-json-out--paragraphs-out--sentences-out--srt-out--vtt-out--understanding-json-out
Speaker mapping rules
Speaker display names are merged in this order:
- manual
--speaker-map - AssemblyAI speaker identification mapping
- fallback generic names like
Speaker AorChannel 1
This means you can let AssemblyAI identify speakers first, then still override individual display names later.
Example manual map file: assets/speaker-map.example.json
Model and language lookup
Before choosing parameters, inspect the bundled reference data:
node {baseDir}/assemblyai.mjs models
node {baseDir}/assemblyai.mjs models --format json
node {baseDir}/assemblyai.mjs languages --model universal-3-pro
node {baseDir}/assemblyai.mjs languages --model universal-2 --codes --format json
The bundled data lives in:
assets/model-capabilities.jsonassets/language-codes.json
Important operating notes
- Keep API keys out of chat logs; use environment injection.
- Use the EU AssemblyAI base URL when the user explicitly needs EU processing.
- Uploads and transcript creation must use API keys from the same AssemblyAI project.
- Prefer
--bundle-diror--outfor long outputs. - The CLI is non-interactive and sends diagnostics to stderr, which makes it easier for agents to script reliably.
- Use raw
--configor--requestwhen you need a newly added AssemblyAI parameter that this skill has not exposed yet.
Reference files
Read these when you need more depth:
Key bundled files
assemblyai.mjs— root wrapper for compatibility with the original skillscripts/assemblyai.mjs— main CLIassets/speaker-map.example.jsonassets/speaker-profiles-name.example.jsonassets/speaker-profiles-role.example.jsonassets/custom-spelling.example.jsonassets/llm-json-schema.example.jsonassets/transcript-agent-json-schema.json
Sanity checks before finishing a task
- Did you pick the right region (
api.assemblyai.comvsapi.eu.assemblyai.com)? - Did you choose a model strategy that matches the language situation?
- If speaker naming matters, did you enable diarisation and/or provide a speaker map?
- If the result will feed another agent, did you produce Markdown and/or agent JSON rather than only raw stdout?
- If the transcript will be machine-consumed, did you keep the manifest or explicit output filenames?