It's good, but is it "the future" when it's extra work?
Consider that you could hand-code an algorithm to recognize cats in images but we would rather let the machine just figure it out for itself. We're kind of averse to manual work and complexity where we can brute force or heuristic our way out of the problem. For the 80% of situations where piping it into zstd gets you to stay within budget (bandwidth, storage, cpu time, whatever your constraint is), it's not really worth doing about 5000% more effort to squeeze out thrice the speed and a third less size
It really is considerably better, but I wonder how many people will do it, which means less implicit marketing by seeing it everywhere like we do the other tools, which means even fewer people will know to do it, etc.
This seems very cool. Was going to suggest submitting it, but I see there was a fairly popular thread 5 months ago for anyone interested: https://news.ycombinator.com/item?id=45492803
The biggest savings for a service like GMail are going to be based around deduplication - e.g. if you can recognize that a newsletter went out to a thousand subscribers and store those all as deltas from a "canonical" copy - congratulations, that's >1000:1 compression, better than you could achieve with any general-purpose compression. Similarly, if you can recognize that an email is an Amazon shipping confirmation or a Facebook message notification or some other commonly repeated "form letter", you can achieve huge savings by factoring out all the common elements in them, like images or stylesheets.
I kind of doubt they would do this to be honest. Every near-copy of a message is going to have small differences in at least the envelope (not sure if encoding differences are also possible depending on the server), and possibly going to be under different guarantees or jurisdictions. And it would just take one mistake to screw things up and leak data from one person to another. All for saving a few gigabytes over an account's lifetime. Doesn't really seem worth it, does it?
That's why a base and a delta. Whereas PP was talking about general compression algorithm, my question was different.
In line with the original comment, I was asking about specialized "codecs" for gmail.
Humans do not read the same email many times. That makes it a good target for compression. I believe machines do read the same email many times, but that could be architected around.
These and other email specific redundancies ought to be covered by any specialized compression scheme. Also note, a lot of standard compression is deduplication. Fundamentally they are not that different.
Given that one needs to support deletes, this will end up looking like a garbage collected deduplication file system.
Looks similar to OpenZL ( https://openzl.org/ )
"OpenZL takes a description of your data and builds from it a specialized compressor optimized for your specific format."
Honestly, Openzl looks even cooler! It would be cool to have it integrated with parquet and avro encoders. If I understand correctly the compressed files should be decompressable with standard tools.
3bit is a bit ridiculous. From that page I am unclear if the current model is 3 or 4bit.
If it’s 4bit… well, NVIDIA showed that a well organized model can perform almost as well as 8bit.
reply