Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To establish bounds in this way, you have to start with some claim about the distribution of the input data. In this case, the data is natural human language, so it's difficult or impossible to directly state what distribution the input was drawn from. Even worse, the prize is for compressing a particular text, not a sample from a distribution, so tight bounds are actually not possible to compute.

There is some discussion on the Hutter prize page, under "What is the ultimate compression of enwik9?"

http://prize.hutter1.net/hfaq.htm




From this article:

More empirically, Shannon's lower estimate suggests that humans might be able to compress enwik9 down to 75MB, and computers some day may do better.


Notable that enwiki8/9 isn't really just human text - a good ~half of the data is random xml markup which may not have the same properties as text.


The current best compresses it to <15MB because it's not a human which that example is referring to. :)


That's enwik8. enwik9 is an order of magnitude larger.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: