How I OCR hundreds of hours of video (2011)

samirmenon · on Aug 10, 2014

This is awesome. Who cares that it's a little cumbersome - it works!

On a more national scale, could something like this be done for Congress? C-SPAN already does most of the hard work of filming and uploading to the web, so perhaps it won't be too difficult. I think it would certainly attract a lot of interest... maybe I'll give it a go.

ianstallings · on Aug 11, 2014

And one could combine it with closed-captioning of the audio track for a pretty cool searchable result.

ersii · on Aug 11, 2014

The lovely and fantastic Internet Archive has exactly this, for US News TV (Eg. C-SPAN IIRC) and they say they have clips from 594,000 shows since 2009 available. From what I've read and heard, they basically eat the subtitle/close caption track that's baked into the video/audio stream.

Feel free to check it out at https://archive.org/details/tv and search around for some fun terms/words.

If you really like it and if you like the Internet Archive, feel free to donate a one-time sum or set up a subscription at http://archive.org/donate/ - they're a US-based 501(c)(3) non-profit organisation - so donations are tax deductable if you're US based.

waldoj · on Aug 10, 2014

Here's the code on GitHub: https://github.com/openva/video-indexer It's terrible (I wrote it for a very narrow use case, and only run it ~200 times each year), but it's enough to get the idea.

joe_bleau · on Aug 11, 2014

I ran across this while researching a way to OCR data from a video of a frequency counter and digital multimeter. I didn't use his exact workflow, but it got me pointed in the right direction. Much better than manually typing in data every second of a 20 minute video.

PeterisP · on Aug 11, 2014

His OCR errors (Del. Jennifer L. McClellan -> Del. Jennifer L i\1cCie1ian) look like something that would be easily fixable at the right spot - the dictionaries and language models used by Tesseract.

While a spellchecker might fix Jenn1fer -> Jennifer, at the OCR stage there is much more information to do it properly; but it obviously doesn't know that McClellan is valid word and thus a much more likely alternative than i\1cCie1ian, and it needs to be told that. The list of speakers on those videos is limited, and their surnames can be added to the appropriate dictionaries to improve their recognition.

robinhoodexe · on Aug 10, 2014

http://webcache.googleusercontent.com/search?q=cache:http://...

Google cache of the site if it's unavailable (I'm getting a database error).

waldoj · on Aug 10, 2014

Memcached wasn't running for some reason. I've just fired it up, and all's well now.

bajsejohannes · on Aug 11, 2014

I recommend monit for keeping processes like these running.

http://mmonit.com/monit/

burnte · on Aug 10, 2014

I would think the first few steps could be combined into one, faster step by using Handbrake to rip DVDs directly to MP4. But I also don't see why that stage takes hours on his machine, even on my 2006 rig it took less than the playtime of the DVD.

waldoj · on Aug 10, 2014

I don't want to rip directly to MP4, because I retain a copy of the VOB files. I was doing all of this on a Mac mini that I bought in 2006.

Noctem · on Aug 10, 2014

Nevertheless, I would still strongly recommend using Handbrake or something else that uses x264— that's the best H.264 encoder. x264 has good presets for various speeds/qualities, or I could help you optimize the settings for this type of content if you're interested.

Handbrake has a CLI in addition to the GUI, ffmpeg is another good CLI option.

waldoj · on Aug 11, 2014

Yes, I've tried both Handbrake and ffmpeg (and mplayer), and spent a lot of time tweaking the settings. I prefer MPEG Streamclip for this process—its ability to add new files to the queue after I've already starting the encoding process is particularly useful. Handbrake sometimes chokes on the DVD files generated by the legislature (I have no idea why), which is a dealbreaker For basically any other video encoding/decoding task, I use ffmpeg or mplayer Thank you, though!

pronoiac · on Aug 11, 2014

I don't understand why you have to upload the videos. Is there a part that requires a ton of processor power? It seems like you could install the right packages on the Mac, or just start a Linux VM on it.

MisterNegative · on Aug 10, 2014

The title is very misleading for me, I expected magic but it was kind of disappointing. They don't even OCR actual video, instead they just take a few screenshots.

ianstallings · on Aug 11, 2014

I can't think of a better way to OCR video though. You have render a frame to scan for text. And he automates it. You could hypothetically do that on more frames if you see it missing text.

yutah · on Aug 11, 2014

maybe grabbing the frame before the video compression and use 100% quality JPEG when grabbing a frame would help for the OCR