Well keyframes are a thing and it's likely that the still image is going to be encoded a few hundred times on an average length song. Especially in a HD video (which is what you want as the audio bitrate grows along with it) the video part could be massive compared to the actual music.