I would assume is something similar to joining multiple frames/attentions? in channel dimension and then moving values inside so convolution will have access to some channels from other video frames.
I was working on similar idea few years ago using this paper as reference and it was working extremely well for consistency also helping with flicker.
https://arxiv.org/abs/1811.08383
I was working on similar idea few years ago using this paper as reference and it was working extremely well for consistency also helping with flicker. https://arxiv.org/abs/1811.08383