Especially if it's meant to be an educational and in any case non-realtime encoder, intrinsics are not really necessary and make the code less portable.
I also wonder how much the compiler can do autovectorisation on code like this --- it's pretty much exactly the type of code that autovectorisation is intended for.
Edit: I noticed in the benchmark that it compressed the 10s foreman.cif (demo video) in half a second, so it's already 20x faster than realtime on that small resolution.
How are you supposed to get acceptable performance without intrinsics?