Not sure how you would do that without having the ground truth to compare to. It's also very hard to measure once you start messing with the formatting (like converting it to markdown or suppressing page numbers and repeated headers/footers). I think it would also vary a lot depending on the quality of the original scan and the format and content of the document. There's really no substitute from just trying it on your document and then quickly looking through the output by hand (at least for now-- probably in a year models we be good enough and have big enough context windows to do this really well, too!).
Standard datasets can no longer be used for benchmarking against LLMs since they have already been fed into it and are thus too well-known to compare to lesser known documents.
Oh you meant for just a single benchmarked document. I thought you meant to report that for every document you process. I wouldn't want to mislead people by giving stats on a particular kind of scan/document, because it likely wouldn't carry over in general.