Skip to content
Chapter 04 · 10 min

Audio & music

Audio generation is quieter in the headlines than images, but it's arguably further along in practical use — synthetic voices and transcription are everywhere already. This chapter covers how machines hear and speak, and where music generation really stands.

Sound is just a wiggling line in time. Teach a machine the shapes of the wiggles, and it can draw new ones.

Turning sound into something a model can learn

Sound is a waveform — air pressure over time, millions of samples a second. That's too fine-grained to model directly and efficiently, so audio AI usually works on a more compact representation: a spectrogram (a picture of which frequencies are present over time, which lets image-style techniques apply) or learned audio tokens (chunks of sound treated like the tokens of a language model).

Once audio is tokens or a spectrogram, the familiar machinery applies: transformers and diffusion models generate it the same way they generate text or images. The recurring theme again — find the right representation, and one set of tools handles a new modality.

Speech: the mature workhorse

Two speech capabilities are genuinely production-grade. Speech-to-text (transcription) is reliable enough to power captions, meeting notes, and voice interfaces across many languages. Text-to-speech (synthesis) has crossed from robotic to often indistinguishable from human, with natural intonation and emotion.

Voice cloning is the capability that deserves a flag: from a short sample of someone's voice, a model can synthesise new speech in that voice. This powers wonderful things (accessibility, dubbing, restoring lost voices) and obvious harms (fraud, impersonation, non-consensual audio). The technology doesn't distinguish; the use does.

Music: impressive, complicated

Music generation has advanced fast — models can produce coherent instrumental and vocal tracks from a text description. The technical achievement is real. The complications are mostly not technical: music is dense with copyright and licensing questions, because models trained on recorded music can produce output uncomfortably close to their training data, and the rights landscape is contested and evolving.

For a business, the practical caution is that the legal status of AI-generated music — who owns it, whether it infringes, whether it can be used commercially — is genuinely unsettled and varies by jurisdiction. The capability is ahead of the rules, more so than in most modalities. Tread carefully and get specifics from counsel before commercial use.

Where audio AI pays off now

The reliable, low-controversy wins are on the understanding and synthesis side: transcription and captioning, voiceover and narration (with consent), accessibility, voice interfaces, and audio search. These are mature and broadly safe to build on. Generative music and voice cloning are powerful but carry the legal and ethical weight above — match your appetite for that to the use.

In one line each

  • Audio is modelled via compact representations — spectrograms or learned tokens — so image- and text-style tools apply.
  • Speech-to-text and text-to-speech are production-grade; synthetic voices are often indistinguishable from human.
  • Voice cloning is powerful and dangerous — assume voice is not an authentication factor anymore.
  • Music generation is technically impressive but legally unsettled; the safe wins are transcription, narration, and accessibility.
Audio & music · AI courses · SDEN