I think I take something different away from the article, yes tokenizers are important but they're a means to get at something much much bigger which is how to clean up and normalize unstructured data. It's a current endeavor of mine at $dayjob for how to do this in a way that can work reasonably well even for badly mangled documents. I don't have any silver bullets, at least nothing worthy of a blog-post yet, but since this is needed when dealing with OCR documents so "post-ocr correction" turns up quite a few different approaches.
And this is an aside, but I see folks using LLMs to do this correction in the first place. I don't think using LLMs to do correction in a multi-pass system is inherently bad but I haven't been able to get good results out of "call/response" (i.e. a prompt to clean up this text). The best results are when you're running an LLM locally and cleaning incrementally by using token probabilities to help guide you. You get some candidate words from your wordlist based on the fuzzy match of the text you do have, and candidate words predicted from the previous text and when both align -- ding! It's (obviously) not the fastest method however.
you might have better luck giving the LM the original document and having it generate its own OCR independently, then asking the llm to tiebreak between its own generation and the OCR output while the image is still in the context window until it is satisfied that it got things correct
And this is an aside, but I see folks using LLMs to do this correction in the first place. I don't think using LLMs to do correction in a multi-pass system is inherently bad but I haven't been able to get good results out of "call/response" (i.e. a prompt to clean up this text). The best results are when you're running an LLM locally and cleaning incrementally by using token probabilities to help guide you. You get some candidate words from your wordlist based on the fuzzy match of the text you do have, and candidate words predicted from the previous text and when both align -- ding! It's (obviously) not the fastest method however.