I have an idea for you to try - instead of training a model to produce subsequent animation frames (which is tough), instead, take a model trained on pixel art sprites in general, and then use a ControlNet with the input to the ControlNet being either a pose model or a higher res 3d model of a generic dummy character made in blender - and then generate output frame by frame, keeping the input prompting the same, but moving the ControlNet input frame by frame.
To get it down to small pixeled 'sprite' scale, the right thing may be to actually output 'realistic' character animation frames this way, and then 'de-res' them via img2img into pixel art. The whole pipeline could be automated so that your only inputs are a single set of varied walking/posing/jumping control net poses and the prompts describing the characters.
Something like how this posing works: https://www.youtube.com/watch?v=CiG_v61cLxI
To get it down to small pixeled 'sprite' scale, the right thing may be to actually output 'realistic' character animation frames this way, and then 'de-res' them via img2img into pixel art. The whole pipeline could be automated so that your only inputs are a single set of varied walking/posing/jumping control net poses and the prompts describing the characters.