Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The training data is so large that it incidentally includes basically anything that Google would index plus the contents of as many thousands of copyrighted works that they could get their hands on. So that would definitely include some test prep books.


They seem to be taking this into account: We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. (this is from the technical report itself: https://cdn.openai.com/papers/gpt-4.pdf, not the article).


By the same token, though, whatever test questions and answers it might have seen represent a tiny bit of the overall training data. It would be very surprising if it selectively "remembered" exact answers to all those questions, unless it was specifically trained repeatedly on them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: