How is this possibly true? The argument that 16-bit DCT somehow gives better pre...

Cadwhisker · on May 20, 2022

Replying to just your 1st paragraph:

The process is: Raw input pixels (8 or 10 bit) minus predicted pixels (8 or 10 bit) -> residual pixels (8 or 10 bit + 1 sign bit).

You take these residual pixels and pass them through a 2D DCT, then scale and quantise them. At the end of this, the quantised DCT residual values are signed 16-bit numbers - you don't get to choose the bit-depth here; it's part of the standard (section 8.6). For every 16x16 pixel input, you get a 16x16 array of signed 16-bit numbers.

The last step is to pass all non-zero quantised DCT residual values through an entropy coder (usually an arithmetic coder), then you get the final bitstream.

The key point is that it didn't matter if the original raw pixel input was 8-bit or 10-bit; the quantised DCT residual values became 16 bits before being compressed and transmitted. This is also true for 12-bit raw pixel inputs.

This seems impossible; for 8-bit inputs, you've doubled the size of the data (slightly less than double for 10-bits), so you must be making things worse! The key is that after scaling and quantisation, most of those 16-bit words are zero. Those that are non-zero are statistically closer to zero so that the entropy encoder won't have to spend a lot of bits signalling them.

The last part comes when you reverse this process. The mathematical losses from scaling and quantising 10-bit inputs into the transmitted 16-bit values are less than the losses for 8-bit inputs. When you run the inverse quant, scale and iDCT, you end up with values that are closer to the original residual values at 10-bit than you do at 8-bit.