Photos: if you build a facebook app, you can probably ask for permissions to fotos of your app users. Also the open datasets for machine learning with images like the coco dataset are pretty big. Can you really handle a lot more than that? Even hinton starts with mnist for new ideas like capsules.
Language modeling: hacker news, public mailing lists, wikipedia, github.
Health: you can usually get data if you work at a hospital as an md or researcher. Just need a reasonable idea and an IRB. If you want the pharmacy data, I imagine you could get at it by going to work as a researcher in pharma, insurance, or retailer.
alphago was built using publicly available games of go pros. Alphagozero didn't even depend on data at all.
For AI, the limiting factors are ideas, code, time, hardware.
AlphaGo and AG0 were built with ridiculous amounts of compute power that Google donated to the effort. To replicate their results would cost millions of dollars.
Language modeling: hacker news, public mailing lists, wikipedia, github.
Health: you can usually get data if you work at a hospital as an md or researcher. Just need a reasonable idea and an IRB. If you want the pharmacy data, I imagine you could get at it by going to work as a researcher in pharma, insurance, or retailer.
alphago was built using publicly available games of go pros. Alphagozero didn't even depend on data at all.
For AI, the limiting factors are ideas, code, time, hardware.