Maybe a dumb question - how does the model which is trained to predict the next words answer questions, as shown in the reading comprehension example? Do you just feed it the question and watch it generate the answer, or is something else going on?
If I remember correctly, they say that since the training set contains extracts of question-answer sessions, it will detect the pattern and follow it when you give an appropriate prompt.
So yes, you just feed the question and, detecting that it is a question, it answers.
you add a linear classifier at the top to predict start and end positions of the answer span. The augmented model is trained on a qa dataset like squad to actually learn how to answer questions.
hugging face has a simple implementation that augments bert in this manner and you can see the code there. their bertqa model get like an 84 F1 on squad 1.1 which really strong performance. you can augment the thier gpt2 implementation similarly.