Oh, I think you're right. The nested FOR loops of the "Micropoly software rasterizer" are running on the GPU.[1] I think. I though that was CPU code at first. They're iterating over one or two pixels, not a big chunk of screen. For larger triangles, they use the GPU fill hardware. The key idea is to have triangles be about pixel sized. If triangles are bigger than 1-2 pixels, and there's more detail available, use smaller triangles.
[1] https://youtu.be/eviSykqSUUw?t=2134