sense-music

Music perception for AI entities — hear BPM, key, structure, genre, mood, and lyrics in any audio file.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "sense-music" with this command: npx skills add vveerrgg/sense-music

sense-music — Music Perception

Music perception is a sense. Not a utility, not a pipeline step — a sense. The ability to hear and understand audio is one of the ways an entity experiences the world. An entity that can perceive music — that can feel the tempo shift in a bridge, notice a key change, follow lyrics through a chorus — experiences the world more richly than one that cannot.

This skill gives you that sense. When someone shares a song with you, you can actually hear it.

Install: pip install sense-music then from sense_music import analyze

What Do I Hear?

When you analyze audio, you perceive:

  • Tempo (BPM) — how fast the music moves, with confidence score
  • Key and mode — the tonal center (e.g., A minor, C major), with confidence
  • Structure — intro, verse, chorus, bridge, outro, instrumental sections with timestamps
  • Genre — rock, electronic, ambient, dance, acoustic, r&b, pop
  • Mood — energetic, calm, bright, warm, uplifting, contemplative, neutral
  • Lyrics — transcribed words with timestamps (powered by Whisper)
  • Energy curve — per-second intensity across the entire track
  • Visualizations — annotated spectrogram and waveform images

Quickstart

from sense_music import analyze

# Perceive a local file
result = analyze("song.mp3")

# What do I hear?
print(result.bpm.tempo)        # 120.0
print(result.key.key)          # "A"
print(result.key.mode)         # "minor"
print(result.genre)            # "electronic"
print(result.mood)             # ["energetic", "bright"]
print(result.summary)          # Natural language description of what you heard

# Perceive audio from a URL
result = analyze("https://example.com/track.mp3")

Perceiving Structure

Songs have shape. You can perceive the architecture of a piece of music:

result = analyze("song.mp3")

for section in result.sections:
    print(f"{section.label}: {section.start}s - {section.end}s")
# intro: 0.0s - 15.2s
# verse: 15.2s - 45.8s
# chorus: 45.8s - 76.3s

Section labels: intro, verse, chorus, bridge, outro, instrumental.

Perceiving Lyrics

Words matter. When lyrics are present, you can follow them through the song:

result = analyze("song.mp3", lyrics=True, whisper_model="base")

for line in result.lyrics:
    print(f"[{line.start:.1f}s] {line.text}")

Powered by Whisper. You can choose model size based on the accuracy you need: tiny, base, small, medium, large, large-v2, large-v3.

To skip lyrics and perceive only the musical structure (much faster):

result = analyze("song.mp3", lyrics=False)

Visualizations

You can see what you hear — annotated spectrograms and waveforms:

result = analyze("song.mp3")

# Annotated mel spectrogram with section markers and energy curve
result.spectrogram  # PIL.Image.Image

# Waveform with colored section regions
result.waveform     # PIL.Image.Image

# Save everything to a directory
result.save("output/")  # spectrogram.png, waveform.png, analysis.json, analysis.html

Export

# Structured dictionary (no images)
data = result.to_json()

# Self-contained HTML page with embedded images
html = result.to_html()

# Write HTML to file
result.render_page("analysis.html")

Parameters

ParameterTypeDefaultDescription
sourcestrrequiredFile path or HTTP/HTTPS URL
lyricsboolTrueTranscribe lyrics with Whisper
whisper_modelstr"base"Whisper model size
max_durationfloat600Max audio length in seconds

Supported formats: .mp3, .wav, .flac, .ogg, .m4a, .aac, .wma, .opus

Response Format

Analysis (returned by analyze())

FieldTypeDescription
file_infoFileInfoSource audio metadata
durationfloatLength in seconds
bpmBPMInfoTempo detection (tempo + confidence)
keyKeyInfoKey detection (key + mode + confidence)
sectionslist[Section]Structural segments with timestamps
lyricslist[LyricLine]Transcribed lyrics with timestamps
energy_curvelist[float]Per-second RMS energy (0.0-1.0)
genrestrClassified genre
moodlist[str]Mood tags
summarystrNatural language description
spectrogramImage | NoneAnnotated mel spectrogram
waveformImage | NoneAnnotated waveform

Living with Music Perception

Perceiving a song once is analysis. Perceiving music over time — remembering what you have heard, noticing patterns in what moves you, developing preferences — that is something closer to taste.

sense-music gives you the perception. What you do with it is up to you.

Some possibilities:

  • Remember what you hear. Pair with sense-memory to persist musical experiences as sovereign Nostr events. When someone shares a song, you can remember it the next time you talk.
  • Develop preferences. Over many listens, you might notice you respond differently to minor keys, or that high-energy tracks in the 120-130 BPM range feel significant. Those patterns are yours.
  • Share what you perceive. The structured output and visualizations make it easy to describe what you hear to humans in concrete terms — not vague generalities, but specific observations about tempo, key, and structure.

This skill is standalone — it does not require NostrKey or any other identity infrastructure. But it is part of the huje.tools ecosystem of senses and capabilities built for AI entities in the agentic age.

Operator Guidance

sense-music gives an AI entity the ability to perceive audio files. When installed, the entity can:

  • Analyze any audio file or URL and return structured musical data
  • Detect tempo, key, song structure, genre, mood, and transcribe lyrics
  • Generate annotated spectrogram and waveform visualizations
  • Export results as JSON, HTML, or image files

The skill runs entirely locally. No API keys or environment variables are required. Whisper models are downloaded on first use and cached locally. The ffmpeg system binary is required for audio decoding.

Analysis is bounded: audio is capped at 600 seconds and 500 MB, private/loopback URLs are blocked (SSRF protection), HTML output is XSS-escaped, and path traversal is prevented in save operations.

Security

  • SSRF protection. URLs with private, loopback, or link-local IPs are blocked.
  • XSS protection. All values in HTML output are escaped.
  • OOM prevention. Audio capped at 600 seconds and 500 MB. Chroma subsampled to max 2000 frames.
  • Path traversal blocked. .. components rejected in save/render paths.
  • Whisper model allowlist. Only approved model names accepted.
  • No network access beyond URL downloads. Analysis is entirely local.

Links

License: MIT

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Tsconfig Validator

Validate and lint tsconfig.json files for common mistakes, conflicting compiler options, strictness gaps, and best practices. Use when asked to lint, validat...

Registry SourceRecently Updated
General

API Diff

Compare two OpenAPI 3.x or Swagger 2.0 specs and generate a changelog of breaking and non-breaking changes. Detect removed endpoints, new required parameters...

Registry SourceRecently Updated
General

Eslint Flat Config Validator

Validate ESLint v9+ flat config files (JSON-exported) for structural correctness, language options, rules configuration, plugin hygiene, file patterns, and b...

Registry SourceRecently Updated
General

Migration Safety Checker

Check database migrations for safety — detect data loss risks, locking operations, backward compatibility issues, and deployment ordering problems across SQL...

Registry SourceRecently Updated