HNPWA with Next.js

FFmpeg 8.0 adds Whisper support

rilawa | 1008 points

Whisper is genuinely amazing - with the right nudging. It's the one AI thing that has genuinely turned my life upside-down in an unambiguously good way.

People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription. It's basically Aegisub 2.0, if you're old, like me.

HOWTO:

Drop a video or audio file to the right window, then go to Video > Audio to text (Whisper). I get the best results with Faster-Whisper-XXL. Use large-v2 if you can (v3 has some regressions), and you've got an easy transcription and translation workflow. The results aren't perfect, but Subtitle Edit is for cleaning up imperfect transcripts with features like Tools > Fix common errors.

EDIT: Oh, and if you're on the current gen of Nvidia card, you might have to add "--compute_type float32" to make the transcription run correctly. I think the error is about an empty file, output or something like that.

EDIT2: And if you get another error, possibly about whisper.exe, iirc I had to reinstall the Torch libs from a specific index like something along these lines (depending on whether you use pip or uv):

    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

    uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

If you get the errors and the above fixes work, please type your error message in a reply with what worked to help those who come after. Or at least the web crawlers for those searching for help.

https://www.nikse.dk/subtitleedit

https://www.nikse.dk/donate

https://github.com/SubtitleEdit/subtitleedit/releases

kmfrk | 4 days ago

Once local transcription is in more places hopefully we can persuade content creator not to burn bouncing sub-titles into their videos.

I've seen professionally produced recordings on dry and technical subjects with good sound quality where they've decided to use distracting sub-titles with no way to disable them.

It seems so unnecessary if you're not making novelty videos about cats.

Also local transcription allows for automatic translation and again overlaying subtitles on top of an existing burnt in set is a really poor reading experience.

Lio | 4 days ago

Does this have the ability to edit historic words as more info becomes available?

Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".

Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".

Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.

londons_explore | 4 days ago

Related, a blog article by the author of the patch:

Run Whisper audio transcriptions with one FFmpeg command

https://medium.com/@vpalmisano/run-whisper-audio-transcripti...

Posted here, with 0 comments: https://news.ycombinator.com/item?id=44869254

JohnKemeny | 4 days ago

I wonder if Apple's upcoming speech APIs can be added too. Would be cool to have it just work out of the box on Macs, without needing to source a model.

https://developer.apple.com/documentation/speech/speechtrans...

https://developer.apple.com/documentation/speech/speechanaly...

https://www.macstories.net/stories/hands-on-how-apples-new-s...

hbn | 3 days ago

Am I correct in understanding that Whisper is a speech recognition AI model originally created by OpenAI?

https://en.wikipedia.org/wiki/Whisper_(speech_recognition_sy...

voxadam | 4 days ago

I hope this is the start of more ML filters in ffmpeg. They added the sr (super resolution) filter years ago, but it's old and it's difficult to get the weights so you can run it, since they're not included. They have added support for multiple inference libraries like libtorch, but again, it's difficult to even get started. Hopefully they can get behind a consistent ML strategy, ideally with a "models" directory with ready to use models for upscaling, temporal upscaling, noise cancelling, etc. A lot of audio and video filter research use ML now, new codecs will probably also use it soon.

sorenjan | 3 days ago

I had a small bash pipeline for doing this until now.

  ffmpeg -f pulse -i "$(pactl get-default-source)" -t 5 -f wav -ar 16000 -ac 1 -c:a pcm_s16le - \
  | ./main - \
  | head -2 \
  | tail -1 \
  | cut -d] -f2 \
  | awk '{$1=$1};1'

The reading from mic part (-f pulse, pactl...) is linux-specific rest of it should be cross platform. The `main` executable is the whisper.cpp executable (see whisper.cpp github readme, it's just the output of `make base.en` from that).

Edit: -t 5 controls recording duration.

Oh and add 2>/dev/null to silence the debug output. I copied this from a pipe that further sends it into an LLM that then looks at the meaning and turns it into a variety of structured data (reminders, todo items, etc) which I then....

porridgeraisin | 4 days ago

I know nothing about Whisper, is this usable for automated translation?

I own a couple very old and as far as I'm aware never translated Japanese movies. I don't speak Japanese but I'd love to watch them.

A couple years ago I had been negotiating with a guy on Fiver to translate them. At his usual rate-per-minute of footage it would have cost thousands of dollars but I'd negotiated him down to a couple hundred before he presumably got sick of me and ghosted me.

donatj | 4 days ago

I wish they worked with the mpv folks instead of shoehorning this in. Based on the docs it looks like getting live transcription for a video will involve running the demuxer/decoder on one thread, and this whisper filter on another thread, using ffmpeg's AVIO (or to a REST API [1].... shudders) to synchronize those two parallel jobs. It could have been way simpler.

Other than for the "live transcription" usecase (that they made unnecessarily complicated), I don't see how this is any better than running Whisper.cpp directly. Other people in this thread are basically saying "ffmpeg's interface is better understood" [2] but LLMs make that point moot since you can just ask them to do the drudgery for you.

[1] https://medium.com/@vpalmisano/run-whisper-audio-transcripti...

[2] https://news.ycombinator.com/item?id=44890067

jhatemyjob | 4 days ago

I've been using FFmpeg and Whisper to record and transcribe live police scanner audio for my city, and update it in real-time to a live website. It works great, with the expected transcription errors and hallucinations.

webinar | 4 days ago

"Making sure you're not a bot!" with no way to get to the actual document that is supposed to be at the URL. Anubis can be configured to be accessible for people without the latest computers by using the meta-refresh proof of work but very few people take any time to configure it and just deploy the defaults. Just like with cloudflare.

That said, I suppose I'm glad they're concentrating on making the ffmpeg code better rather than fixing bugs in the web interface for the development tracker. Having whisper integrated will be really useful. I'm already imagining automatic subtitle generation... imagining because I can't read the page or the code to know what it is.

superkuh | 4 days ago

The only problem with this PR/diff is that it creates just a avfilter wrapper around whisper.cpp library and requires the user to manage the dependencies on their own. This is not helpful for novice users who will first need to:

1. git clone whisper.cpp

2. Make sure they have all dependencies for `that` library

3. Hope the build passes

4. Download the actual model

AND only then be able to use `-af "whisper=model...` filter.

If they try to use the filter without all the prereqs they'll fail and it'll create frustration.

It'd be better to natively create a Whisper avfilter and only require the user to download the model -- I feel like this would streamline the whole process and actually make people use it much more.

manca | 3 days ago

Does this mean that any software which uses ffmpeg can now add a transcription option? Audacity, Chrome, OBS etc

instagraham | 4 days ago

Shut off the broken bot filter so we can read it please

boutell | 4 days ago

Annoyingly, something is broken with their anti not stuff, as it keeps refusing to let me see the page.

realxrobau | 4 days ago

I wonder if they'll be satisfied there or add a chunk of others now that they've started. Parakeet is supposed to be good?

Should they add Voice Activity Detection? Are these separate filters or just making the whisper filter more fancy?

lawik | 4 days ago

Not sure it will be packaged in Debian, with an external binary model god knows how it was produced...

zoobab | 4 days ago

on an aside, my favorite whisper 'hack' is you can just speed up audio 10x to process it 10x faster, then adjust the timings after

miladyincontrol | 4 days ago

Is Whisper still SOTA 3 years later? It does not seem there is a clearly better open model. Alec Radford really is a genius!

WanderPanda | 4 days ago

How can I run Whisper or this software in Linux or Android as a non-technical user?

Basically a simple audio-to-text for personal use?

mkbkn | 4 days ago

Can whisper do multilingual yet? Last time I tried it on some mixed dutch/english text it would spit out english translations for some of the dutch text. Strange bug/feature since from all appearances it had understood the dutch text perfectly fine.

bondarchuk | 4 days ago

Does this finally enable dynamically generating subtitles for movies with AI?

zzsshh | 4 days ago

May I ask, if there is a movie where English people speak English, French people speak French, and German people speak German, is there a software that can generate subtitles in English, French and German without translating anything? I mean, just record what it hears.

radiator | 3 days ago

I've been playing with whisper to try to do local transcription of long videos, but one issue I've found is that long (>15 seconds) spans without any speech tend to send it into a hallucination loops that it often can't recover from. I wonder if, with direct integration into ffmpeg, they will be able to configure it in a way that can improve that situation.

re | 4 days ago

I have recently found that parakeet from NVIDIA is way faster and pretty much as correct as Whisper, but it only works with English.

yewenjie | 4 days ago

took me longer than i'd care to admit to figure out how to install whisper as a user/system package on macOS w/o brew (which pulls in all of llvm@16 during install)

    brew install uv
    uv tool install openai-whisper
    then add ~/.local/bin/ to $PATH

jd3 | 3 days ago

OH: "New changelog entries go to the bottom, @vpalmisano .. Didn't I tell you this once?"

cheerioty | 3 days ago

I tried to use whisper to generate non-english subs from english audio, but wasnt able to figure out. I know it can do english subs from non-english audio, and that earlier (less precise) versions could do any language audio -> any language subs, but latest whisper only to english subs.

Anyone found a way?

MaxikCZ | 4 days ago

It failed to identify me as a human twice before let me access the page

atum47 | 3 days ago

Fantastic! I am working on a speech-to-text GNOME extension that would immensely benefit from this.

https://github.com/kavehtehrani/gnome-speech2text

kwar13 | 4 days ago

Did ffmpeg move their bug tracker to Forgejo?

https://code.ffmpeg.org/FFmpeg/FFmpeg/issues

I still see their old one too, but Forgejo one is nice.

shmerl | 4 days ago

I was expecting a lot more comments on if this is a necessary feature or if this even belongs in a library like ffmpeg. I think this is bloat, especially when the feature doesn't work flawless, whisper is very limited.

dncornholio | 4 days ago

Anyone got this to compile on macOS yet? The homebrew binary doesn't yet (and probably won't ever) include the --enable-whisper compile option.

varenc | 3 days ago

Is anyone able to get streaming audio to text conversion working with whisper.cpp?

I tried several times to get this into a reasonable shape, but all have been failures. If anyone has pointers I really appreciate it.

iambvk | 4 days ago

"multi-modal feature extraction → semantic translation → cross-modal feature transfer → precise temporal alignment," is all we need

almaight | 3 days ago

More precisely libavfilter, so it's also soon in mpv and other dependent players.

This is going to be great for real-time audio translation.

baxter001 | 3 days ago

How could one in theory, use this to train on a new language? Say for a hubby project; I have recordings of some old folks stories in my local dialect.

│

└── Dey well; Be well

mockingloris | 4 days ago

I guess that there is no streaming option for sending generated tokens to, say, an LLM service to process the text in real-time.

martzoukos | 4 days ago

Unrelated, but can I use Whisper in DaVinci resolve to automatically transcribe my videos and add subs?

XCSme | 3 days ago

Aww, I literally just implemented this using whisper.cpp and ffmpeg lib, code is even similar...

igorguerrero | 3 days ago

Labeling multiple people talking is something i found lacking with whisper, is it better now?

yieldcrv | 4 days ago

Why would one use FFmpeg with Whisper support, instead of using Whisper directly?

dotancohen | 4 days ago

[deleted]

| 4 days ago

What's the benefit VS using whisper as a separate tool?

BiteCode_dev | 3 days ago

as someone who has a live application using whisper and ffmpeg, this does seem like just feature creep. ffmpeg and whisper both are otherwise well limited CLI tools adhering to the unix philosophy, this ... idk

maxlin | 15 hours ago

Can't view site. Some sort of misconfigured CAPTCHA bullshit.

    Oh noes!
    Sad Anubis
    invalid response.

    Go home

    Protected by Anubis From Techaro. Made with  in .

    Mascot design by CELPHASE.

burnt-resistor | 2 days ago

That's great. How does Whisper compare to Google Gemini's transcription capabilities?

de6u99er | 4 days ago

Does this whisper also do text-to-speech?

thedangler | 4 days ago

Now if it only did separate speaker identification (diarization)

pmarreck | 4 days ago

hell yeah

correa_brian | 4 days ago

[dead]

hacker_88 | 4 days ago

[dead]

bestspharma | 2 days ago

[dead]

bestspharma | 2 days ago

[deleted]

| 4 days ago

Very interesting to see this!

ggap | 4 days ago