The layers of extreme complexity in this situation are astounding.
The app has some text internally, it renders it by rasterizing fonts to a bitmap, the OS takes the bitmap and composites it within the wider UI. Google Assistant grabs a screenshot of the fully composited, post-processed OS UI, sends it over the internet to a server, which uses an OCR model to read all the text, and a different model to work out which is the relevant text, which is then sent back over the internet and displayed on the device.
I used to be upset at such absurdities. Now I try to appreciate what a relatively universal interface images are instead! :D It doesn’t matter how text gets drawn if we OCR it. (Though I feel better when we do so on device.)
On Android it more or less just uses the accessibility APIs to grab the actual text, you can do it without using Google Assistant even by selecting text inside an app's thumbnail window from the Recent Apps screen.
The app has some text internally, it renders it by rasterizing fonts to a bitmap, the OS takes the bitmap and composites it within the wider UI. Google Assistant grabs a screenshot of the fully composited, post-processed OS UI, sends it over the internet to a server, which uses an OCR model to read all the text, and a different model to work out which is the relevant text, which is then sent back over the internet and displayed on the device.
All so the user can copy some text.