They can't publish their training databases because that would be publishing of copyrighted material which is illegal. They can only train which is potential fair use.
It would be more accurate to say that they don't publish their training databases (including sanitized pointers to the copyrighted stuff) because they aren't sure that training is fair use.
They are sure, however, that it is a kind of infringement. Citing "fair use" is an admission of infringement - just a specific kind of infringement that is allowed.
Is it because people are violating copyrights to train these AIs?