It's fantastic to see so much fascinating work in compression these days after a fair amount of stagnation. I know there had been a lot of specialized but codec work but the recent discoveries are very much for wide use.
You now have LZ4, Brotil, zstd, snappy, lzfse, lzma all pretty useful practical codec.
Brotil is interesting though. It can be an easy replacement for zlib at level 1 with fairly higher compression then zlib at similar speed.
On the other hand with with lzma it can handily beat lzma but with an even slower compression rate (from the levels that they published in the benchmark) but on the other hand with much higher decompression speed meaning its very good for distribution work. It would be interesting to see the compression ratio, time for levels between 1-9.
It's actually a much easier replacement for zlib then lzma for some. The benchmark shows only level 1,9 and 11. It seems that it can handily beat lzma but at the cost of compression speed (I wonder who use more memory). Then again its decompression speed is so much better making it a perfect choice for distribution.
What truly surprise me though is the work of 'Yann Collet'. A single person so handily beating google's snappy. Even his zstd work looks ground breaking. When I read couple of weeks ago that he was a part time hobby programmer I just didn't know how to be suitably impressed.
Am I reading right that apples lzfse is based of a mix of lz+zstd?
Research in compression hardly ever stalled, it just wasn't getting much publicity because it wasn't coming from Google. For example, looking at Brotli it appears to be based on ideas very similar to those explored by the PAQ project [1] not several years ago.
The web has definitely had a problem with legacy restrictions driving thinking. We lost nearly a decade with Internet Explorer stagnation and I think that caused a lot of people to assume that the odds of something new going mainstream were too low.
Now we're in a very different place where people upgrade software much faster than in the past and that's likely to start showing up in projects like this where it's actually plausible to think of something like Brotli seeing widespread usage within 1-2 years.
Ohh I know there has been a lot of work, exotic codecs for a long while. I'm not an expert but I have looked up whats out there fairly often. What I'm saying that most of them had been too slow or specialised to be used by most people. But in recent times those exotic codecs has been transformed, tuned to make a lot of practical codecs.
FSE is implementation of this new ANS entropy coding, also used e.g. in lzturbo 1.2. It's surprising that even still using huffman, brotli is slowly approaching its performance:
I'm curious about the viability of the following approach. Can a generic static dictionary be made to be held on the client side such that said compressor/decompressor can utilize efficiently? This would prevent the need to send along the dictionary with the compressed package every time. Even at an 80/20 - 90/10 success rate (of course we'd have a fall back to worse case), wouldn't this be a great advancement and reduce massive network load? With modern hd sizes, many could spare a few gigabites which could be dialed up/down as necessary depending on how optimized you'd want it. I would think we could identify particular binary patterns based on the users xfer habits (eg downloading text vs application vs audio vs video) and have different dictionaries optimized to their usage (eg roku would just store dictionaries maximized for video usage)
Yup, that's called SDCH (shared dictionary compression for HTTP, colloquially "sandwich"), which was first proposed in 2009 by Google. Brotli supports SDCH, but it's an additional component.
Is there anything out that let you have "pluggable" dictionaries? I have a client-server communication that often sends 90% of the same words. I'd like to build to build this dictionary on the server and ship it down to the clients.
I prototyped something myself and SHDC for the browser was the closet I could find.
Note that this bug originally seems to have been called something like "Implement LZMA compression", and the early comments are about that; it's some way in that someone says "oh, and Google have this new Brotli thing which might be even better".
Mozilla can be quite slow with certain features; for example work on U2F support hasn't even begun one year after a ticket was made (https://bugzilla.mozilla.org/show_bug.cgi?id=1065729). I'm not exactly holding my breath.
They've already tried to land the code to add brotli support but had to back it out because of an Android build failure. They're moving pretty quick on this. It helps that they already had brotli for WOFF so there is less review required.
I would be in favor of any new technology for the web requiring TLS. It provides both a carrot (of new features) and a stick (of falling behind competitors) for people to get off their asses and secure the web from a whole host of attacks.
Even if your page doesn't have sensitive information on it, an insecurely loaded page provides an attacker the avenue to inject potentially malicious code. This will be the case until the entire web is HTTPS-enabled.
CRIME: TLS compression can reveal private headers, like auth cookies. Fixed by turning off TLS compression. Not applicable to HTTP because HTTP never had header compression.
BREACH: Response body compression of a page where there's (a) something attacker controlled, (b) something private and unchanging in the body can reveal that secret, and (c) response length is visible to an attacker. Doesn't require HTTPS.
If an attack applied, it would be one like BREACH. Which isn't surprising: this is a direct replacement for "Accept-Encoding: gzip / Content-Encoding: gzip" and so we should expect it to be in the same security situation.
> Unlike other algorithms compared here, brotli includes a static dictionary. It contains 13’504 words or syllables of English, Spanish, Chinese, Hindi, Russian and Arabic, as well as common phrases used in machine readable languages, particularly HTML and JavaScript.
They could at least have measured the other algorithms with the same dictionary for fairness.
"We hope that this format will be supported by major browsers" - if I'm not mistaken it should also be implemented by all major webservers. In that regard I'm hoping Opera open sources or sells their OMPD proxy platform. Someone made a chrome extension which interfaces with their network and the speed is amazing. OMPD is a tradeoff and breaks certain javascript and css and is different from Opera Turbo, I am not sure what compression Turbo uses currently. Does anyone know if Brotli gets implemented in the google server and proxy compression? http://browsingthenet.blogspot.nl/2014/09/chrome-data-compre...
> I'm hoping Opera open sources or sells their OMPD proxy platform.
They won't open source it: the compression tech has for a number of years accounted for a large proportion of Opera's (browser) income. I also don't actually know what OMPD is. Opera Mini server? That's tied sufficiently closely to Presto that it'll only ever get released if Presto does.
The more recent Opera Turbo and the Opera Mini 11 for Android "high-compression mode" (though I have no idea how that really differs to Opera Turbo! [edit: per @brucel it is Opera Turbo]) are certainly available for licensing; Yandex Browser supports Opera Turbo, for example.
Disappointing they aren't dissolving the umlaut and instead stripping it. 'Broetil' would be equivalent to 'Brötil', yet they remove it (and thus change the sound) to create 'Brotil'.
This way, very much umlaut-capable, but not at all swiss-capable Germans won't be tempted to pronounce it in a terribly wrong and insulting imitation of some swiss dialect. Serious international tensions will be avoided by clearly separating the name from the language it is derived from. If your company motto was "don't be evil", you surely would not want your compression algorithms to cause naval standoffs on Lake Constance!
Also, simply dropping those dots where they actually belong is refreshingly post-metal-umlaut.
Er, no, the transliteration of "ö" to "oe" does not change the sound. This is considered a feature of the orthography of German, and you see this kind of thing in, e.g., crossword puzzles.
I meant transliteration to English. No English speaker will pronounce the o with the umlaut, so you have to pick how you want English speakers to pronounce your word from the two alternatives above. Bro-li sounds much better to me than bro-ee-li.
Sometimes they do, sometimes they don't. It doesn't really have any consistent English pronunciation, since it appears in a bunch of different morphological contexts, and in loanwords from at least five or six different languages. Some examples: phoenix (1 syllable, 'ee'), Zoe (2 syllables, 'oh-ee'), Joe (1 syllable, 'oh'), Joel (either 1 or 2 syllables, depending on schwa insertion), canoe (1 syllable, 'oo'), Dostoevsky (2 syllables, 'oh-eh' or 'oh-yeh'), coed (2 syllables, 'oh-eh'), etc.
If I saw something like Broetli and didn't immediately recognize the Germanic origin, I could see myself easily misanalyzing it as Bro-etli and pronouncing it that way.
Really like how big and influential companies are working on foundational technologies like compression, which smaller companies would have no chance of popularising. Earlier this summer Apple introduced lzfse. But this brotli seems even more amazing as its compression ratio seems to match that of lzma. Wonderful.
If Google wants me to personally adopt this outside of woff2 (woff2's major change is switch to Brotli over deflate), then they should submit a patch to nginx to add modules named brotli and brotli_static, and some command line gzip-like that does Brotli.
Yes and no. It is very common to simply strip umlauts and dashes when converting to ASCII compatible writing.
I even do it with my name most of the times. A Chinese official that sees Ø in my passport won't understand why I write 'OE', and might even start questioning if it is the same name, but 'O' never fails.
Airlines can read the gibberish in the bottom of my passport and see that the transliteration is actually 'OE', but most places don't have a scanner for that.
I have an actual question.
Who can name all the decompressors that outperform Brotli both in compression ratio and decompression speed?
As I see it, that's the ultimate goal, to boost I/O via transparent decompression.
Reading the PDF paper I see they use for benchmarking Xeon with price-tag in range of $560, how this can be reconciled with hot phrases like boosting browsing on mobile devices?!
Internet Scenario Benchmark (browser plugin simulation):
html8 : 100MB random html pages from a 2GB Alexa Top sites corpus.
number of pages = 1178
average length = 84886 bytes.
The pages (length + content) are concatenated into a single html8 file, but compressed/decompressed separately.
This avoids the cache scenario like in other benchmarks, where small files are processed repeatedly in the L1/L2 cache, showing unrealistic results.
size: 100,000,000 bytes.
Single thread in memory benchmark
cpu: Sandy bridge i7-2600k at 4.2 Ghz, all with gcc 5.1, ubuntu 15.04
LzTurbo benefits a lot from bigger blocks, in above example it uses 4MB block single-threadedly, and look how fast it is, if 16 threads are to be used what the outcome would be ... maybe 0.100 sec?!
You miss the point, your list is meaningful only in general compression cases, the Brotli thread (where we are) is all about boosting textual decompression while having/exploiting little resources (RAM mostly) as in the cases of Web Browsers.
As one (Jyrki) of co-authors commented:
```
For more clarity on the situation, you could compare LZMA, LZHAM and brotli at the same decoding memory use. Possibly values between 1-4 MB (window size 20-22) are the most relevant for the HTTP content encoding. Unlike in a compression benchmark, there are a lot of other things going on in a browser, and the allocated memory at decoding time is a critically scarce resource.
```
Source:
http://google-opensource.blogspot.bg/2015/09/introducing-bro...
The goal is to receive in our browser those 812,392,384 bytes as quickly as possible, in first case the winner is 77,286,010 bytes compressor, in second the winner is 90,239,627 bytes compressor, yes?
In first case transfer_time + decompression_time = 77s + 7s = 84s
In second case transfer_time + decompression_time = 9s + 1s = 10s
Now, you see that even in Web Browsing scenario the best performer is not established, right?
I'm just wondering what it's like for a data compression expert whether algorithm or coding. It must be really wonderful and energising to see such interest and momentum in ones field.
I wonder if anyone working on the field would have a comment on that.
It's strange to see an "introducing" post for an algorithm that's already included in a deployed technology (WOFF2), especially one that doesn't point that out.
The fundamentals of the CRIME attack work with any compression algorithm. That's why HTTP 2.0 doesn't use compression for headers and sends deltas instead.
>What's the point of mentioning a fictional compression startup as a comment on the algorithm made by Google?
A joke, or a pop culture reference. Stating a shared experience that others may also gleam some enjoyment out of, or to create the illusion of a shared community/experience to help maintain the illusion that we are not just meaningless souls adrift in an indifferent, eternal void, who's incomprehensible size only re-enforces the meaninglessness of our existence.
It's a kind of spam for me, as I still, even after your explanation, don't see it more related as spam to eggs and bacon is. So please user csimy "could you do the egg bacon spam and sausage without the spam then?" (1)
1) a reference to Monty Python which in this context demonstrates the relativity of the "shared experience." Even if it gave the name to the electronic spam!
Then flag/downvote it. Commenting about how unproductive an unproductive discussion is just further propagates the original unproductive discussion [1].
One one hand I understand it, pop culture references has been a thing for ages and ages. On the other hand, I don't know because many take it to an extreme (see: reddit).
Considering that Google's biggest press release of the year included a reference to that particular fictional world[1], and that we're talking about compression technology, I would say it's appropriate.
> The higher data density is achieved by a 2nd order context modeling, re-use of entropy codes, larger memory window of past data and joint distribution codes.
But it's actually, surprisingly, arguably, a valuable metric which, for whatever reason, was never done "in the real world" before, until the show requested it from actual researchers:
It's a terrible metric that fails basic dimensional analysis. Anytime you see a ratio of logs, you should already be suspicious. Say algorithm A is twice as fast as algorithm B. Then we'll have alpha r/R log(T)/log(t). The alpha r/R is constant, so the only interesting part is log(T)/log(t) = log(2t)/log(t) = log2/log(t) + 1. We can make this apparently unitless number take any value we want just by changing the unit used to measure time (or if a unit were fixed, by changing the amount of data used to test).
It's sort of vaguely superficially sensible, but the idea of being able to reduce comparison of compression algorithms (which fundamentally trade off speed for compression ratio) to a 1-dimensional value is laughable. Charles Bloom's Pareto frontier charts (http://cbloomrants.blogspot.com/2015/03/03-02-15-oodle-lz-pa...) are one of the more reasonable options.
The metric divides compression ratio by log compression time. (Ignore for now the normalization to a standard compressor.) This is:
r / log(T)
That doesn't seem like a good metric, because it overvalues changes in T. For example, say we currently can manage 10% compression (ratio = 100/90 = 1.11) and it takes us 16ms. That's
r / logT = 1.11 / log(16) = 0.2775
Now we have two proposals. One brings us to from 10% compression to 55% compression (ratio = 100/45 = 2.22) while the other one drops compression time to 4ms:
r / logT = 2.22 / log(16) = 0.555
r / logT = 1.11 / log( 4) = 0.555
But improving compression by 5x matters more than improving speed by 4x. Take this to the extreme: a compressor that exits immediately leaving its input unchanged has the best possible score here, despite being useless.
Sure, but "lol what's the weissman score HASHTAG SILICONVALLEYTHESHOW \m/" doesn't lead me to think that the poster was legitimately asking for the score so they could compare.
Yeah, it was a pretty painful post, but I would want to see the score. Its performance over gzip would be a more intuitive reference point than the '26% over Zopfli.'
"At Google, we think that internet users’ time is valuable, and that they shouldn’t have to wait long for a web page to load."
Says the company whose business model consists of convincing publisher to put ads in their website, thus slowing down page load times. Same company that lets you integrate Web elements like scripts and web fonts from their servers, once again making your webpage slower to load.
I know I'm exaggerating it a bit, but I really hate this company mission pitches that people in big companies constantly use to open their technical communications.
"At Walmart, we think that users' money is valuable, and that they shouldn't have to pay extra for their toilet paper."
Does the fact that Walmart charges money for their products make the above sentence disingenuous? Of course Walmart charges money, that's how their company works. That doesn't mean they can't try to reduce waste or inefficiency in their system.
Another thing too - I've found that other advertising platforms' ads are much more obnoxious. Google does a pretty decent job at keeping their adds from popping up in your face with technicolor, which I can appreciate.
You are exaggerating and your rant is not fair or relevant. They are honest with that sentence and this is potentially a big contribution to software world and Internet.
I wasn't diminishing the technology's importance in any way.
My point was actually the opposite: why spoil a good communication about technological breakthrough with such lazy and fake-sounding company propaganda?
"At Google, we think that internet users’ time is valuable, and that they shouldn’t have to wait long for a web page to load."
There is plenty of evidence of significant drop in users numbers with increasing page load time.
An advertising company has to ensure that their ads are not only displayed but that there is also a reasonable chance that they will be noticed. Otherwise it becomes a pointless exercise, advertisers leave and revenues drop.
Therefore what is really going on here is Google trying to squeeze in more advertising before annoying the users to the point of leaving the page.
Yes, I agree, advertising in itself is inimical to user's time and thus the above sentence can be seen as deeply
hypocritical.
I dunno. Ads on Google search don't delay your results in any noticeable way. Wouldn't it be great if all ads were like that, given that ads are currently the only viable option for a wide variety of sites to stay in business?
The answer is a sounding yes, but I was just criticising the hypocritical nature of such a sentence, which is totally gratuitous in a post made to communicate a technological advancement. That sounds as fake as corporate bullshit could be.
Why would anyone trust this kind of corporate BS? It baffles me that you even try to take what they say at face value.
For me, it's akin to state propaganda in non-democratic countries: everyone knows that whatever authorities say and what the truth is are two different things, so there's little point to even analysing the official message (except maybe for humor factor etc.)
If they thought my time was valuable, they would not put colored strips at the top of every screen that say "Switch to chrome" or "Switch to GMail" or even after I've switched to Chrome, "Chrome is not your default browser." If they thought my time was valuable, they wouldn't be wasting it with their constant desperate pleas to browse the internet in precisely the way they want. If they really valued my time, they wouldn't throw random context switches into every single one of their web properties for whatever is the product du jour.
If they wanted my page to load faster, they would work as hard on making their stupid website addons load fast as they worked on getting everyone on earth to install them. Running local mirrors of ajax.googleapis.com and fonts.googleapis.com, along with hijacking analytics and doubleclick and returning 0-byte files, is the best thing I ever did for my poor parents' satellite internetion connection.
Internet users' time (and site loading speeds) are very much second- or third-class items on google's list of things to give a shit about. I'm absolutely not questioning thees decisions; google is an advertising agency and must prioritize this over my convenience! But pretending like they're some kind of altruistic charity working for the common good is disingenuous and slightly offensive.
It also creates the problem where some of the more gullible people actually believe google, a huge faceless global company, gives a shit about anyone in particular, which is demonstrably untrue.
eh, between Chrome DevTools, PageSpeed Insights, SERP hits based on load time, SPDY/HTTP2, WebM, new compression algorithms, etc etc I would say Google has done more than a little to help reduce wait times for web pages.
Gmail has a loading bar because it has a massive storage backend, not because it's sending markup to your browser. Comparable products like HN favorite FastMail also have loading progress bars.
Yes actually the goal of all companies is to make profit, as much as possible. I know, it annoys me too, for example Apple says their mission is to "leave the world better than how we found it", but I'm skeptical about that ;-)
It can be used in several situations, one could be:
The acts of a person implies that person's motivation is not to use a product or comment about qualities a product but to bash the producer/creator using the context of the product. So ulterior motive is to punish creator, the actual product is only a medium. (grape: product - wineyard keeper: producer)
Sorry it sounded a bit harsh. I also agree that a dry and more technical blog entry would be better, but I think it is not a big deal considering the importance of the product.
"We hope that this format will be supported by major browsers in the near future, as the smaller compressed size would give additional benefits to mobile users, such as lower data transfer fees and reduced battery use."
Is this the same story everytime that G. invents something, implements it in Chrome and then gains advantage, because takes it as standard and other browser hadn't implement it yet?
There doesn't seem to be any technical downside to implementing it and likely significant upside. It's Apache-licensed which grants both copyright and patent rights that cover the code. The spec is published as a Internet Draft, scheduled to become an Informational RFC. What's not to like? I'm not one to drink the Google kool-aid, but seriously, what else would you like Google to do?
Firefox's Brotli support is already done. Chrome's isn't.
By the way, new features are generally created this way. They are added to browsers way before they are standardized. You see, convincing the other browser vendors isn't easy. You need very compelling arguments.
Take WebP, for example. It provides massive benefits. Especially if you can use lossy RGBA instead of PNG32 (easily 80% smaller). And yet, Mozilla shows little interest in implementing it.
You now have LZ4, Brotil, zstd, snappy, lzfse, lzma all pretty useful practical codec.
Brotil is interesting though. It can be an easy replacement for zlib at level 1 with fairly higher compression then zlib at similar speed.
On the other hand with with lzma it can handily beat lzma but with an even slower compression rate (from the levels that they published in the benchmark) but on the other hand with much higher decompression speed meaning its very good for distribution work. It would be interesting to see the compression ratio, time for levels between 1-9.
It's actually a much easier replacement for zlib then lzma for some. The benchmark shows only level 1,9 and 11. It seems that it can handily beat lzma but at the cost of compression speed (I wonder who use more memory). Then again its decompression speed is so much better making it a perfect choice for distribution.
What truly surprise me though is the work of 'Yann Collet'. A single person so handily beating google's snappy. Even his zstd work looks ground breaking. When I read couple of weeks ago that he was a part time hobby programmer I just didn't know how to be suitably impressed.
Am I reading right that apples lzfse is based of a mix of lz+zstd?