It's a controlnet model trained on SAM segmentation maps. The final model takes a prompt and a segmentation map as input and generates an image conditioned on them.
Controlnet is a neural network added to an already trained model so they can be conditioned on new stuff like canny edge, depth map, segmentation map. Controlnet let you train this model on the new condition "easily", without catastrophic forgetting and without a huge dataset. In the repo linked by OP, they have trained a controlnet model on the segmentation map generated by SAM: https://segment-anything.com/