FFmpeg 8.0 adds Whisper support
Once local transcription is in more places hopefully we can persuade content creator not to burn bouncing sub-titles into their videos.
I've seen professionally produced recordings on dry and technical subjects with good sound quality where they've decided to use distracting sub-titles with no way to disable them.
It seems so unnecessary if you're not making novelty videos about cats.
Also local transcription allows for automatic translation and again overlaying subtitles on top of an existing burnt in set is a really poor reading experience.
Does this have the ability to edit historic words as more info becomes available?
Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".
Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".
Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.
Related, a blog article by the author of the patch:
Run Whisper audio transcriptions with one FFmpeg command
https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
Posted here, with 0 comments: https://news.ycombinator.com/item?id=44869254
I wonder if Apple's upcoming speech APIs can be added too. Would be cool to have it just work out of the box on Macs, without needing to source a model.
https://developer.apple.com/documentation/speech/speechtrans...
https://developer.apple.com/documentation/speech/speechanaly...
https://www.macstories.net/stories/hands-on-how-apples-new-s...
Am I correct in understanding that Whisper is a speech recognition AI model originally created by OpenAI?
https://en.wikipedia.org/wiki/Whisper_(speech_recognition_sy...
I hope this is the start of more ML filters in ffmpeg. They added the sr (super resolution) filter years ago, but it's old and it's difficult to get the weights so you can run it, since they're not included. They have added support for multiple inference libraries like libtorch, but again, it's difficult to even get started. Hopefully they can get behind a consistent ML strategy, ideally with a "models" directory with ready to use models for upscaling, temporal upscaling, noise cancelling, etc. A lot of audio and video filter research use ML now, new codecs will probably also use it soon.
I had a small bash pipeline for doing this until now.
ffmpeg -f pulse -i "$(pactl get-default-source)" -t 5 -f wav -ar 16000 -ac 1 -c:a pcm_s16le - \
| ./main - \
| head -2 \
| tail -1 \
| cut -d] -f2 \
| awk '{$1=$1};1'
The reading from mic part (-f pulse, pactl...) is linux-specific rest of it should be cross platform. The `main` executable is the whisper.cpp executable (see whisper.cpp github readme, it's just the output of `make base.en` from that).Edit: -t 5 controls recording duration.
Oh and add 2>/dev/null to silence the debug output. I copied this from a pipe that further sends it into an LLM that then looks at the meaning and turns it into a variety of structured data (reminders, todo items, etc) which I then....
I know nothing about Whisper, is this usable for automated translation?
I own a couple very old and as far as I'm aware never translated Japanese movies. I don't speak Japanese but I'd love to watch them.
A couple years ago I had been negotiating with a guy on Fiver to translate them. At his usual rate-per-minute of footage it would have cost thousands of dollars but I'd negotiated him down to a couple hundred before he presumably got sick of me and ghosted me.
I wish they worked with the mpv folks instead of shoehorning this in. Based on the docs it looks like getting live transcription for a video will involve running the demuxer/decoder on one thread, and this whisper filter on another thread, using ffmpeg's AVIO (or to a REST API [1].... shudders) to synchronize those two parallel jobs. It could have been way simpler.
Other than for the "live transcription" usecase (that they made unnecessarily complicated), I don't see how this is any better than running Whisper.cpp directly. Other people in this thread are basically saying "ffmpeg's interface is better understood" [2] but LLMs make that point moot since you can just ask them to do the drudgery for you.
[1] https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
I've been using FFmpeg and Whisper to record and transcribe live police scanner audio for my city, and update it in real-time to a live website. It works great, with the expected transcription errors and hallucinations.
"Making sure you're not a bot!" with no way to get to the actual document that is supposed to be at the URL. Anubis can be configured to be accessible for people without the latest computers by using the meta-refresh proof of work but very few people take any time to configure it and just deploy the defaults. Just like with cloudflare.
That said, I suppose I'm glad they're concentrating on making the ffmpeg code better rather than fixing bugs in the web interface for the development tracker. Having whisper integrated will be really useful. I'm already imagining automatic subtitle generation... imagining because I can't read the page or the code to know what it is.
The only problem with this PR/diff is that it creates just a avfilter wrapper around whisper.cpp library and requires the user to manage the dependencies on their own. This is not helpful for novice users who will first need to:
1. git clone whisper.cpp
2. Make sure they have all dependencies for `that` library
3. Hope the build passes
4. Download the actual model
AND only then be able to use `-af "whisper=model...` filter.
If they try to use the filter without all the prereqs they'll fail and it'll create frustration.
It'd be better to natively create a Whisper avfilter and only require the user to download the model -- I feel like this would streamline the whole process and actually make people use it much more.
Does this mean that any software which uses ffmpeg can now add a transcription option? Audacity, Chrome, OBS etc
Shut off the broken bot filter so we can read it please
Annoyingly, something is broken with their anti not stuff, as it keeps refusing to let me see the page.
I wonder if they'll be satisfied there or add a chunk of others now that they've started. Parakeet is supposed to be good?
Should they add Voice Activity Detection? Are these separate filters or just making the whisper filter more fancy?
Not sure it will be packaged in Debian, with an external binary model god knows how it was produced...
on an aside, my favorite whisper 'hack' is you can just speed up audio 10x to process it 10x faster, then adjust the timings after
Is Whisper still SOTA 3 years later? It does not seem there is a clearly better open model. Alec Radford really is a genius!
How can I run Whisper or this software in Linux or Android as a non-technical user?
Basically a simple audio-to-text for personal use?
Can whisper do multilingual yet? Last time I tried it on some mixed dutch/english text it would spit out english translations for some of the dutch text. Strange bug/feature since from all appearances it had understood the dutch text perfectly fine.
Does this finally enable dynamically generating subtitles for movies with AI?
May I ask, if there is a movie where English people speak English, French people speak French, and German people speak German, is there a software that can generate subtitles in English, French and German without translating anything? I mean, just record what it hears.
I've been playing with whisper to try to do local transcription of long videos, but one issue I've found is that long (>15 seconds) spans without any speech tend to send it into a hallucination loops that it often can't recover from. I wonder if, with direct integration into ffmpeg, they will be able to configure it in a way that can improve that situation.
I have recently found that parakeet from NVIDIA is way faster and pretty much as correct as Whisper, but it only works with English.
took me longer than i'd care to admit to figure out how to install whisper as a user/system package on macOS w/o brew (which pulls in all of llvm@16 during install)
brew install uv
uv tool install openai-whisper
then add ~/.local/bin/ to $PATH
OH: "New changelog entries go to the bottom, @vpalmisano .. Didn't I tell you this once?"
I tried to use whisper to generate non-english subs from english audio, but wasnt able to figure out. I know it can do english subs from non-english audio, and that earlier (less precise) versions could do any language audio -> any language subs, but latest whisper only to english subs.
Anyone found a way?
It failed to identify me as a human twice before let me access the page
Fantastic! I am working on a speech-to-text GNOME extension that would immensely benefit from this.
Did ffmpeg move their bug tracker to Forgejo?
https://code.ffmpeg.org/FFmpeg/FFmpeg/issues
I still see their old one too, but Forgejo one is nice.
I was expecting a lot more comments on if this is a necessary feature or if this even belongs in a library like ffmpeg. I think this is bloat, especially when the feature doesn't work flawless, whisper is very limited.
Anyone got this to compile on macOS yet? The homebrew binary doesn't yet (and probably won't ever) include the --enable-whisper compile option.
Is anyone able to get streaming audio to text conversion working with whisper.cpp?
I tried several times to get this into a reasonable shape, but all have been failures. If anyone has pointers I really appreciate it.
"multi-modal feature extraction → semantic translation → cross-modal feature transfer → precise temporal alignment," is all we need
More precisely libavfilter, so it's also soon in mpv and other dependent players.
This is going to be great for real-time audio translation.
How could one in theory, use this to train on a new language? Say for a hubby project; I have recordings of some old folks stories in my local dialect.
│
└── Dey well; Be well
I guess that there is no streaming option for sending generated tokens to, say, an LLM service to process the text in real-time.
Unrelated, but can I use Whisper in DaVinci resolve to automatically transcribe my videos and add subs?
Aww, I literally just implemented this using whisper.cpp and ffmpeg lib, code is even similar...
Labeling multiple people talking is something i found lacking with whisper, is it better now?
Why would one use FFmpeg with Whisper support, instead of using Whisper directly?
What's the benefit VS using whisper as a separate tool?
as someone who has a live application using whisper and ffmpeg, this does seem like just feature creep. ffmpeg and whisper both are otherwise well limited CLI tools adhering to the unix philosophy, this ... idk
Can't view site. Some sort of misconfigured CAPTCHA bullshit.
Oh noes!
Sad Anubis
invalid response.
Go home
Protected by Anubis From Techaro. Made with in .
Mascot design by CELPHASE.
That's great. How does Whisper compare to Google Gemini's transcription capabilities?
Does this whisper also do text-to-speech?
Now if it only did separate speaker identification (diarization)
hell yeah
[dead]
[dead]
[dead]
Very interesting to see this!
Whisper is genuinely amazing - with the right nudging. It's the one AI thing that has genuinely turned my life upside-down in an unambiguously good way.
People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription. It's basically Aegisub 2.0, if you're old, like me.
HOWTO:
Drop a video or audio file to the right window, then go to Video > Audio to text (Whisper). I get the best results with Faster-Whisper-XXL. Use large-v2 if you can (v3 has some regressions), and you've got an easy transcription and translation workflow. The results aren't perfect, but Subtitle Edit is for cleaning up imperfect transcripts with features like Tools > Fix common errors.
EDIT: Oh, and if you're on the current gen of Nvidia card, you might have to add "--compute_type float32" to make the transcription run correctly. I think the error is about an empty file, output or something like that.
EDIT2: And if you get another error, possibly about whisper.exe, iirc I had to reinstall the Torch libs from a specific index like something along these lines (depending on whether you use pip or uv):
If you get the errors and the above fixes work, please type your error message in a reply with what worked to help those who come after. Or at least the web crawlers for those searching for help.https://www.nikse.dk/subtitleedit
https://www.nikse.dk/donate
https://github.com/SubtitleEdit/subtitleedit/releases