Now, to what's happening, you're basically applying a FIR filter to each pixel, so that each one depends also on the frequency information of adjacent pixels (in 2 dimensions)
Minor nitpicks- "Convolved" with the pixels. And the FIR filter doesn't depend on the frequency information in the adjacent pixels, but rather the intensity of the pixels. A short FIR filter must have large frequency support, so the filtering depends on the frequency information given by all pixels.
That's a much more profound explanation than the given. You can actually come up with values yourself then, and it reveals why cases like "edge detection" and "blur" work so nicely (edge detection approximates differentiation; blur acts as a low pass filter;...).
It's referred to as "convolution" in the image processing community too.