Private third-party GitHub repos is another good example. If licenses don't apply to training data, as GitHub has asserted, why not use those too? Do they think they'll get in trouble over it? Why doesn't the same trouble apply to my publicly-readable GPL-licensed code?
I assume there's something in their terms of service about not poking around in private repos and using the code even for internal purposes except for necessary maintenance like backups, court orders, etc.
I am not a lawyer but I also assume Microsoft's position, at least in part, is that they can download and use code in GitHub public repos just like anyone else can and developing a public service based on training with that (and a lot of other) code isn't redistributing that code.
Copyright is not the only law. Something might be permitted by copyright law (as fair use, an implied license, etc)-yet simultaneously violate other laws-breach of contract, misappropriation of trade secrets, etc.