Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Article has the numbers for their input:

uncompressed: 327005

(gzip) zopfli --i100: 75882

zstd -22 --long --ultra: 69018

xz -9: 67940

brotli -Z: 67859

lzip -9: 67651

bzip2 -9: 63727

bzip3: 61067

 help



This matches my experience compressing structured text btw. Bzip2 will beat every other tool out there, both on compression ratio and, sadly, decompression time

OP says decompression time is so high because it has similar properties to a memory-hard password hash: it's bandwidth-bound due to the random access requirement. Even xz decompresses 2.5x faster, and I don't find it particularly fast

This is why I switched away, also for text compression; searching for anything that isn't near the beginning of a large file is tedious. My use-case for compression is generally not like OP's, that is, compressing 100KB so that it can fit into Minecraft (if I understood their purpose correctly); I compress files because they take too much disk space (gigabytes). But if I never wanted to access them, I'd not store them, so decompression speed matters. So I kinda do agree with GP that Bzip2 has limited purposes when Zstd is just a few % more bytes to store for over an order of magnitude more speed (1GB/s instead of 45MB/s)

Edit: And all that ignores non-json/xml/code/text compression tasks, where Bzip2/LZMA doesn't give you the best compression ratio. I'd argue it is premature optimization to use Bzip2 without a very specific use-case like OP has for very good code compression ratios and a simple decoder


I wonder what the combined speed would be for small to mid-sized text files if they were fully loaded into memory first? That swaps multiple random-accesses for single sequential read (even if sequential is not really with flash memory, it should still perform better than totally random access), and memory random access which should not be a bottleneck at these speeds.

Or perhaps this is already being done this way and it is a bottleneck?

This might not work for multi-gigabyte files, but most of text content is well under 1MiB.


This is essentially already being done. The kernel will keep in RAM as much of the file that the system is operating on as will fit

Code can care about cache locality and improve upon this default behavior, like if you find a heuristic where you can do another part of the decompression already while some of the data is still in a CPU cache and you don't have to hit main RAM, but whether that's possible will depend on the algorithm

Being in RAM sadly doesn't mean that random access doesn't matter anymore. Just like how hard drives (compare to, say, tape drives) are random access, you still have swapping times to load the data into the CPU (or arm-moving times in the case of an HDD)


I tested bzip3 vs xz compressing enwik8: bzip3 was 7x faster and had ~20% better compression

I have also tested bzip3 and initially I encountered a great number of cases where bzip3 was much better than zstd from all points of view, i.e. compression ratio and execution times, for any zstd options.

Unfortunately, later I have also found other cases where the bzip3 performance was not so good.

I spend a lot of time on compression/decompression, but sadly there is no algorithm that can be said to be superior to all others in all circumstances, so for optimum results you must still be prepared to use any of them.

Even the ancient gzip is preferable in certain cases, when speed is more important than compression ratio, because sometimes it happens to provide a better compromise than newer algorithms.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: