I don't understand why this doesn't actually do the transcription / translation locally. Sending the data to openAI for paid conversion makes no sense. Whisper can be legally run on your computer, for free.
Running it locally way more sense for an open source project, because why would you pay and be dependent upon a 3rd party if you don't have to be.
It also makes way more sense for a service because then _you_ don't have to give all your money to openAI and skim off of what's left.
This is just.... bewildering. I really wanted to use it, but I'm not going to pay openAI to transcribe podcasts for me when i can literally use the exact same language model and do it locally with free open source code.
I'm hoping someone will fork this and teach it to run whisper locally.
[edit: getting the exact right version of python and PyTorch and dependencies to make whisper run was a pain but now i've got it set up and it's a trivial command to transcribe every mp3 i feel like transcribing]
To do Whisper transcription for free locally you can use AirCaption (www.aircaption.com). It's an electron desktop app running Whisper.cpp (https://github.com/ggerganov/whisper.cpp). Just released a few days ago.
I dunno about openAI as a service, but on my M1 mac i think whisper took something on the order of 8x realtime to process with the "large" language model. That is to say... 8 minutes of processing for every 1 minute of audio. It was surprisingly not fast. I assume openAIs servers have more GPU at their disposal to make this go faster.
Are you using whisper.cpp? You really want to be using that if you care about speed. You should be able to get better than real-time transcription on an M1.
"cost and convenience":
cost: $57.60 vs 0
Why would you want to pay nearly $700 a year just to avoid running a program in the background on whatever computer you already have open?
convenience:
yes, it's a nicer interface, but the current state of the "geeky" version is type command on command line, with path to file. The end. unless you're really afraid of the command line it's not that much more convenient.
The text line being highlighted while you listen is nice but a) we wrote something that did it at the word level (as opposed to sentence..ish level) nearly 20 years ago, b) in this context it's not actually that useful. With video sure... you can click the text and go to teh right place in the video. With spoken text (what this is best at) you click and go to the point...where they're saying what you just read. Unless you really want to hear what you just read, there's not a lot of added value.
Would it be good for podcasts to use an interface like this for playback? absolutely. It'd be a massive upgrade, but that's not what this is offering.
maybe someone will extract that code and let us combine the MP3 and timestamped text file in a web site (if that doesn't already exist). That'd be cool.
But, the cost you propose is way too much for most people, especially in countries that aren't rich. In many places $400 a month is a really good salary. So yeah, if you're rich $700 a year is not a big deal, but...
First, this $57.60 is a VERY pessimistic upper limit. Remember, that's based on having to transcribe every working hour of every day. The number of hours/month required for transcription is probably pareto-distributed among the workforce. I'd bet 90% of people would need to transcribe up to ~4 hours (1 important 1-hour meeting per week), corresponding to an API cost of USD$1.44 per month.
Second, don't underestimate the business value of a nice interface. IMO, the value of excellent UI/UX is part of why ChatGPT took off the way it did. The number of people willing to pay a few dollars per month in order to never have to see a command line is quite a bit larger than the number of people willing to host their own `whisper-large` inference.
Speaking of hosting, do you already own hardware that supports sufficiently fast inference? If not, how much would a good enough cloud instance cost you per month? It depends on how fast is fast enough, but more than $0, that's for sure.
Running it locally way more sense for an open source project, because why would you pay and be dependent upon a 3rd party if you don't have to be.
It also makes way more sense for a service because then _you_ don't have to give all your money to openAI and skim off of what's left.
This is just.... bewildering. I really wanted to use it, but I'm not going to pay openAI to transcribe podcasts for me when i can literally use the exact same language model and do it locally with free open source code.
I'm hoping someone will fork this and teach it to run whisper locally.
[edit: getting the exact right version of python and PyTorch and dependencies to make whisper run was a pain but now i've got it set up and it's a trivial command to transcribe every mp3 i feel like transcribing]