The human face training data was probably much more uniform: all well-lit photos of human faces in the center of the photo looking directly at the camera. Whereas the training set for this was probably any photos of cats, in any lighting conditions, with the cat in any pose.
It looks to me that cats as a whole are more complex than human faces. I suspect we would get similar results with full humans. Cats have huge variation in color and pose that faces do not. An additional factor is that most portraits have the background in a separate focal plane while most cats are photographed against a complex background.
It would be interesting to see how realistic a cat thinks these are, maybe by measuring brain activity or reactions. It's possible that a cat may not be fooled by cats we think look real, or perhaps more interestingly, that a cat is fooled by a not particularly good image.
Common cuckoos lay their eggs in other birds' nests. The chicks don't necessarily look much like the host species to the human eye, but they can fool their hosts along the correct dimensions to get food from them. It's an interesting question to what degree ML algorithms trained on human dimensions could be foiled by an animal whose brain has been wired for different perceptions, or how feasible it is to train an ML algorithm on animal perception, or if it's possible to make an algorithm that successfully fools, say, both man and dog.
On the last point, for example: to make fake sounds that fool animals with different hearing ranges, presumably you have to be able to output sounds across the union of the ranges and train on sound data over the union of the ranges.
(Note: I'm not a biologist, if someone more informed wants to correct me on anything here you are welcome to do so.)
In a plenty of results, I'm getting an ok cat face with a cat-like blob attached to it. So I'd say it's difficult for the model to discern any features in a mass of color-splotched fur.