I wonder if a linear-space, constant-time model like RWKV or S4 would work better here. For audio, I wouldn't think you'd need long range context, and all-to-all mapping seems like overkill.
Maybe a transformer could be running in parallel, but much lower frequency, where the linear model feeds it "summary" tokens once per second, whose information would mostly be "text", but also some hint of emotion and other cues. Then the output of this could be fed back to the linear model so that it would know what it was saying and with what emotion. Basically the transformer would be the low frequency long range context thinker (and feeler), and the linear model would translate that to and from phonetics.
They'd be trained in parallel, so those transformer tokens would attain meaning at training time, not something that would have to be pre-defined. So it'd still be purely phonetic e2e, no direct translation to text. It could even end up being a good way to compress text for LLMs, since low-value words might have smaller representation in the token.
Probably would never reach the level of text based LLMs for logic and code and such, but that somewhat parallels humans anyway; it's pretty hard to explain an algorithm in detail in plain conversation.
Why not normal audio codecs? How are JPEG and MP3 (i.e., DCT/MDCT) not a reasonable way to go about tokenizing spatial and time domain signals for these kinds of models?
Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.
I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.
This has got to be one of the most visually pleasing explanations I have seen of these concepts. Congrats!
I attempted some similar VQ-VAE work instead trying to tokenize rendered text. I was curious if I could make a visual llm working on 10 pt rendered font, but I also tried using PDF sources. The basic idea was to do what more advanced diffusion image models can do where they generate images of text. Make a specific image text diffusion model to do completions. Further I wondered if I could embed things like document type and language so you could have a latent representation of text more abstracted than current dictionary tokenizers. Learned a lot and thought it was all beautifully displayed in this post.
An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.
Thanks for posting, I wasn't aware of Kyutai and it seems your work is perfect for something I'm working on.
This is fascinating.
Obviously working directly with audio is vastly more complex than with text.
But it is very exciting to see how part of making LLMs work natively with speech, is finding a codec that is maximally efficient at encoding speech.
I even have to wonder if, at some point, we ultimately create a popular voice codec usable with LLMs based not on the Fourier transform or similar, but rather on some kind of set of physical parameters describing vocal cord shape, tongue position, throat/chest/mouth shape, etc.
I can imagine such a model being arrived at statistically (determining the necessary number of parameters), and then almost becoming "hard-coded" as a standard since human anatomy doesn't change much there, beyond certain ranges.
I think it's called formant speech encoding, and it would be interesting if LLMs wind up massively advancing that field. Since I think historically it's had to do more with speech synthesis than audio compression.
Thanks for sharing this well written post that I will share with my team; we just recently started using audio/voice in our AI suite and the details herein will be helpful and informative.
I've been messing around with Higgs Audio that actually uses the delay pattern. It has to apply it and then unapply it after the generation. I noticed it's actually really hard to chunk and stream audio correctly when you need to apply and reapply these patterns essentially to the "entire" output.
I can't wait for LLMs to actually understand how they and you are speaking. It's going to be so cool when an AI can correct your second language pronunciation or laugh at you for making a silly sound. The usecases and value will explode when that happens 100%
Out of curiosity, would it be possible to attach pitch, emotion, tone info as text-based metadata to each word during ASR, so that the asr output retains these metadata?
I wouldn't mind so much if they cheat on the way back but listen in earnest. There are use cases like teaching language where having the AI understand the sounds carefully matters a ton.
the OP is quite an interesting team to watch regarding open-weights* voice-related efforts. this is a nice read to understand the core of their approach.
quite unfortunate, however, their approach to accessibility. unmute [1], which uses the approach discussed in this post, runs quite well with claimed feature of adapting to any voice provided you have a 10 second recording. this is not made available to public at all, despite an issue raised since july. [2]
given the pace of the industry, it is a shame that we need to look elsewhere for using an otherwise well-designed tooling.
[1] https://news.ycombinator.com/item?id=44109610 [2] https://github.com/kyutai-labs/unmute/issues/99
Typo: "not even a the length of one word"
> Many LLMs have voice interfaces, but they usually work by transcribing your speech, generating the answer in text, and using text-to-speech to read the response out loud. That’s perfectly fine in many cases (...), but it’s a wrapper, not real speech understanding.
But I can say the same about tokenization. LLMs first convert groups of characters to tokens, then use that to generate tokens, and then convert the tokens back to characters. That's not real understanding! If LLMs are so smart, we should be able to skip the tokenization step.
How many epochs did you train with ? 100k hours is not a lot for an LLM, Feels like bitter lesson
Another interesting thing here is that the model presumably has some understanding of the passage of time. That's one thing that can be odd about chat models, in that they will respond the same no matter whether you respond a second later or a month later.
I think even for text models, "streams" could be useful. Perhaps if the LLM sees too long of a pause after explaining something and asking a question, they could interject a "do you need help?" or something. Pure chat GPTs don't have that ability.
Awesome post!
Man, one of the best uses of all those AI algorithms based around finding similarities between stuff, would be to give you actually relevant recommendations for music.
All the streaming services are shit at it. They can't do much beyond shallow similarities or hardcoded recommendations that are probably just based on manually-entered keywords like the genre etc.
Has that already been done?
Or is it yet another of those what-could-have-been utopian things that got crippled before it was born because of corporate overcontrolling/overcautiousness (not being able to train on copyrighted music)
Maybe some open-source project could do it?
(I don't even feel confident in asking AI if a music-recc AI exists because ChatGPT 5 didn't know ChatGPT 5 was out, and Claude still thinks iOS 26 isn't out yet..sigh)
Y’all need to learn about the history and development of spoken language and writing. Writing isn’t just a copy or derivation of writing. LLMs work because of the conceptual characteristics of writing (consider the distinctions between ideographic, logographic, alphabetical…). What a sloppy mess!
Read some Wittgenstein and Goodman, but especially Derrida who calls this logocentrism.
> Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.
I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.
AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.
It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.