Aren't the block sizes too small? gzip uses 64k block sizes and it seems like th... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		pxx on June 9, 2021 \| parent \| context \| favorite \| on: Text Classification by Data Compression Aren't the block sizes too small? gzip uses 64k block sizes and it seems like the compressed sizes are several times larger.

w-m on June 9, 2021 [–]

How about interleaving the test data then, instead of appending it to the very end? For gzip, if the block size is 64k (another comment says 32k?), split the corpus text into 32k blocks, and interleave it with 32k blocks of the test set.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact