Hacker News new | past | comments | ask | show | jobs | submit login
Introducing Brotli: a new compression algorithm for the internet (google-opensource.blogspot.com)
364 points by rey12rey on Sept 22, 2015 | hide | past | favorite | 145 comments



It's fantastic to see so much fascinating work in compression these days after a fair amount of stagnation. I know there had been a lot of specialized but codec work but the recent discoveries are very much for wide use.

You now have LZ4, Brotil, zstd, snappy, lzfse, lzma all pretty useful practical codec.

Brotil is interesting though. It can be an easy replacement for zlib at level 1 with fairly higher compression then zlib at similar speed.

On the other hand with with lzma it can handily beat lzma but with an even slower compression rate (from the levels that they published in the benchmark) but on the other hand with much higher decompression speed meaning its very good for distribution work. It would be interesting to see the compression ratio, time for levels between 1-9.

It's actually a much easier replacement for zlib then lzma for some. The benchmark shows only level 1,9 and 11. It seems that it can handily beat lzma but at the cost of compression speed (I wonder who use more memory). Then again its decompression speed is so much better making it a perfect choice for distribution.

What truly surprise me though is the work of 'Yann Collet'. A single person so handily beating google's snappy. Even his zstd work looks ground breaking. When I read couple of weeks ago that he was a part time hobby programmer I just didn't know how to be suitably impressed.

Am I reading right that apples lzfse is based of a mix of lz+zstd?


Research in compression hardly ever stalled, it just wasn't getting much publicity because it wasn't coming from Google. For example, looking at Brotli it appears to be based on ideas very similar to those explored by the PAQ project [1] not several years ago.

[1] https://en.wikipedia.org/wiki/PAQ


The web has definitely had a problem with legacy restrictions driving thinking. We lost nearly a decade with Internet Explorer stagnation and I think that caused a lot of people to assume that the odds of something new going mainstream were too low.

Now we're in a very different place where people upgrade software much faster than in the past and that's likely to start showing up in projects like this where it's actually plausible to think of something like Brotli seeing widespread usage within 1-2 years.


On top of that Google can unilaterally upgrade the servers that serve the most popular site on the internet and ~40% of all browsers at the same time.


Ohh I know there has been a lot of work, exotic codecs for a long while. I'm not an expert but I have looked up whats out there fairly often. What I'm saying that most of them had been too slow or specialised to be used by most people. But in recent times those exotic codecs has been transformed, tuned to make a lot of practical codecs.


7zip's implementation of zip has been faster and more effective than zlib for a long time now.


I though lzfse is deflate with finite state entropy instead of huffman:

http://fastcompression.blogspot.de/2013/12/finite-state-entr...


FSE is implementation of this new ANS entropy coding, also used e.g. in lzturbo 1.2. It's surprising that even still using huffman, brotli is slowly approaching its performance:

http://encode.ru/threads/2313-Brotli?p=44970&viewfull=1#post...


I'm curious about the viability of the following approach. Can a generic static dictionary be made to be held on the client side such that said compressor/decompressor can utilize efficiently? This would prevent the need to send along the dictionary with the compressed package every time. Even at an 80/20 - 90/10 success rate (of course we'd have a fall back to worse case), wouldn't this be a great advancement and reduce massive network load? With modern hd sizes, many could spare a few gigabites which could be dialed up/down as necessary depending on how optimized you'd want it. I would think we could identify particular binary patterns based on the users xfer habits (eg downloading text vs application vs audio vs video) and have different dictionaries optimized to their usage (eg roku would just store dictionaries maximized for video usage)


Yup, that's called SDCH (shared dictionary compression for HTTP, colloquially "sandwich"), which was first proposed in 2009 by Google. Brotli supports SDCH, but it's an additional component.

1: https://engineering.linkedin.com/shared-dictionary-compressi...


That sounds a lot like sdch, which you can actually use today if you're serving to in Chrome or Opera. (Look for "Accept-Encoding: sdch".)


Brotli has a static dictionary, an appendix of the draft: https://tools.ietf.org/html/draft-alakuijala-brotli-01. It comes up a few times in the comments here: https://news.ycombinator.com/item?id=7894299 . It has words and phrases in various languages, bits of HTML and JavaScript, and miscellaneous wacky stuff--looks like the kind of thing you could generate from a Web crawl, say.


Is there anything out that let you have "pluggable" dictionaries? I have a client-server communication that often sends 90% of the same words. I'd like to build to build this dictionary on the server and ship it down to the clients.

I prototyped something myself and SHDC for the browser was the closet I could find.


Additional pointers:

– a more detailed post about Brotli: http://textslashplain.com/2015/09/10/brotli/

– a MSDN article on “compressing the Web”, explaining the difference between deflate and gzip, the limitations of browsers, and mentionning zopfli and brotli: http://blogs.msdn.com/b/ieinternals/archive/2014/10/21/http-...


This one goes to 11!

Seriously, the Brotli quality setting goes to eleven -- pure genius.

http://www.gstatic.com/b/brotlidocs/brotli-2015-09-22.pdf (Fig. 1)


For those not in the know:

https://www.youtube.com/watch?v=4xgx4k83zzc :-)


Since the recent innovation of middle-out compression there's been a lot of activity in this space.


pigz (parallel gzip) also has a -11 option


The pigz -11 option uses zopfli which is the algorithm referenced in the original article.


Currently being implemented in Firefox: https://bugzilla.mozilla.org/show_bug.cgi?id=366559


Note that this bug originally seems to have been called something like "Implement LZMA compression", and the early comments are about that; it's some way in that someone says "oh, and Google have this new Brotli thing which might be even better".


Mozilla can be quite slow with certain features; for example work on U2F support hasn't even begun one year after a ticket was made (https://bugzilla.mozilla.org/show_bug.cgi?id=1065729). I'm not exactly holding my breath.


They've already tried to land the code to add brotli support but had to back it out because of an Android build failure. They're moving pretty quick on this. It helps that they already had brotli for WOFF so there is less review required.


Note that this will only be supported over https, due to broken proxies:

https://bugzilla.mozilla.org/show_bug.cgi?id=366559#c92


I would be in favor of any new technology for the web requiring TLS. It provides both a carrot (of new features) and a stick (of falling behind competitors) for people to get off their asses and secure the web from a whole host of attacks.

Even if your page doesn't have sensitive information on it, an insecurely loaded page provides an attacker the avenue to inject potentially malicious code. This will be the case until the entire web is HTTPS-enabled.


Aren't there any security risks when using this over HTTPS, considering past attacks like BREACH and CRIME?


CRIME: TLS compression can reveal private headers, like auth cookies. Fixed by turning off TLS compression. Not applicable to HTTP because HTTP never had header compression.

BREACH: Response body compression of a page where there's (a) something attacker controlled, (b) something private and unchanging in the body can reveal that secret, and (c) response length is visible to an attacker. Doesn't require HTTPS.

If an attack applied, it would be one like BREACH. Which isn't surprising: this is a direct replacement for "Accept-Encoding: gzip / Content-Encoding: gzip" and so we should expect it to be in the same security situation.


I think as long as you're not compressing secrets and attacker controlled data together you're fine.


Good incentive for people to use https IMO.


> Unlike other algorithms compared here, brotli includes a static dictionary. It contains 13’504 words or syllables of English, Spanish, Chinese, Hindi, Russian and Arabic, as well as common phrases used in machine readable languages, particularly HTML and JavaScript.

They could at least have measured the other algorithms with the same dictionary for fairness.


"We hope that this format will be supported by major browsers" - if I'm not mistaken it should also be implemented by all major webservers. In that regard I'm hoping Opera open sources or sells their OMPD proxy platform. Someone made a chrome extension which interfaces with their network and the speed is amazing. OMPD is a tradeoff and breaks certain javascript and css and is different from Opera Turbo, I am not sure what compression Turbo uses currently. Does anyone know if Brotli gets implemented in the google server and proxy compression? http://browsingthenet.blogspot.nl/2014/09/chrome-data-compre...


> I'm hoping Opera open sources or sells their OMPD proxy platform.

They won't open source it: the compression tech has for a number of years accounted for a large proportion of Opera's (browser) income. I also don't actually know what OMPD is. Opera Mini server? That's tied sufficiently closely to Presto that it'll only ever get released if Presto does.

The more recent Opera Turbo and the Opera Mini 11 for Android "high-compression mode" (though I have no idea how that really differs to Opera Turbo! [edit: per @brucel it is Opera Turbo]) are certainly available for licensing; Yandex Browser supports Opera Turbo, for example.


> Just like Zopfli, the new algorithm is named after Swiss bakery products. Brötli means ‘small bread’ in Swiss German.

Makes sense.


Disappointing they aren't dissolving the umlaut and instead stripping it. 'Broetil' would be equivalent to 'Brötil', yet they remove it (and thus change the sound) to create 'Brotil'.


Keeping that long human linguistic tradition of information loss. Pretty on point for a compression scheme.


Brilliant. I love it. I'll pretend they did that on purpose :)


Well, maybe not for lossless compression..


That was the joke.


This way, very much umlaut-capable, but not at all swiss-capable Germans won't be tempted to pronounce it in a terribly wrong and insulting imitation of some swiss dialect. Serious international tensions will be avoided by clearly separating the name from the language it is derived from. If your company motto was "don't be evil", you surely would not want your compression algorithms to cause naval standoffs on Lake Constance!

Also, simply dropping those dots where they actually belong is refreshingly post-metal-umlaut.


First I thought you made a typo, but then it was consistent: Swiss German it is Brötli.


Oops, yes, L I not I L. I misread. My bad!


Muphry's Law strikes again.


They change the sound in transliterating it anyway. Would you prefer bro-tli or bro-ee-tli?


Er, no, the transliteration of "ö" to "oe" does not change the sound. This is considered a feature of the orthography of German, and you see this kind of thing in, e.g., crossword puzzles.


And even earlier the ö was transliterated from oe. The e was pulled on top of the o. The e was then changed to two dots. Probably because of laziness.

See: https://en.wikipedia.org/wiki/Germanic_umlaut#Orthography_an...


I meant transliteration to English. No English speaker will pronounce the o with the umlaut, so you have to pick how you want English speakers to pronounce your word from the two alternatives above. Bro-li sounds much better to me than bro-ee-li.


Except... 'oe' at the end of words is rare in English, but when we have it we pronounce it like 'o'

eg: hoe sloe (both rhyme with 'bro')

to get a 'bro-ee' effect you'd need a 'oey' or some other spelling


Are you saying you'd pronounce "broetli" as "brotli"? I think that puts you in a very small minority.


English speakers don't pronounce "oe" as a diphthong either.


Sometimes they do, sometimes they don't. It doesn't really have any consistent English pronunciation, since it appears in a bunch of different morphological contexts, and in loanwords from at least five or six different languages. Some examples: phoenix (1 syllable, 'ee'), Zoe (2 syllables, 'oh-ee'), Joe (1 syllable, 'oh'), Joel (either 1 or 2 syllables, depending on schwa insertion), canoe (1 syllable, 'oo'), Dostoevsky (2 syllables, 'oh-eh' or 'oh-yeh'), coed (2 syllables, 'oh-eh'), etc.

If I saw something like Broetli and didn't immediately recognize the Germanic origin, I could see myself easily misanalyzing it as Bro-etli and pronouncing it that way.


What is the pronunciation of Google's "brotli"?


It's a sensible name, considering the previous name, but there's still a part of me that's sad that they missed the opportunity to name it Nucleus.


Surely it's more like "breadlet"?


...and everyone knows breadlet compression > wavelet :)


Really like how big and influential companies are working on foundational technologies like compression, which smaller companies would have no chance of popularising. Earlier this summer Apple introduced lzfse. But this brotli seems even more amazing as its compression ratio seems to match that of lzma. Wonderful.


How does this do compressing trimesh binary data like this stream?

https://d3ijcvgxwtkjmf.cloudfront.net/a4c3c7313b7bdeb68ad46a...

Base: 6,779,000 bytes

GZip2 Normal: 2,296,362 bytes, Ultra: 2,258,967 bytes

LAMZ2 Normal: 921,600 bytes, Ultra: 920,147 bytes


Default and "ultra" give the same number: 1,513,459 bytes.


I wonder how it compares to Apple’s new proprietary LZFSE codec. Here was Apple’s WWDC slide – http://forums.macrumors.com/attachments/lzfse_1-png.565004/ – but I don’t think they published full details anywhere.


If Google wants me to personally adopt this outside of woff2 (woff2's major change is switch to Brotli over deflate), then they should submit a patch to nginx to add modules named brotli and brotli_static, and some command line gzip-like that does Brotli.


For brotli_static, I think a header can be easily added for serving pre-compressed content. And a dynamic version may be too slow.

Zstd is better for dynamic compression. Or libslz today for very fast compression with zlib compatibility.


Brotli at compression level 1 is as fast as zstd.


So it should be "Broetli", then...?


Yes and no. It is very common to simply strip umlauts and dashes when converting to ASCII compatible writing.

I even do it with my name most of the times. A Chinese official that sees Ø in my passport won't understand why I write 'OE', and might even start questioning if it is the same name, but 'O' never fails.

Airlines can read the gibberish in the bottom of my passport and see that the transliteration is actually 'OE', but most places don't have a scanner for that.


Can't wait for Gipfeli



Ruebli.


Muesli


In the spec [1], they published the dictionary used in hexadecimal form. Can someone explain why there are so many 6's in the data?

For comparison here's a chart of hex character and approximate [2] number of occurrences.

(0 - 15k) (1 - 10k) (2 - 19k) (3 - 14k) (4 - 15k) (5 - 16k) (6 - 62k) (7 - 32k) (8 - 10k) (9 - 11k) (a - 11k) (b - 7k) (c - 8k) (d - 12k) (e - 20k) (f - 9k)

[1]: http://www.ietf.org/id/draft-alakuijala-brotli-05.txt [2]: counted with find-in-page, didn't bother to only search the dict


The 6s and 7s correspond to ASCII lowercase letters. Try decoding the hex strings as UTF-8...


I don't like the name much so I'm calling it broccoli.


I have an actual question. Who can name all the decompressors that outperform Brotli both in compression ratio and decompression speed? As I see it, that's the ultimate goal, to boost I/O via transparent decompression.

Reading the PDF paper I see they use for benchmarking Xeon with price-tag in range of $560, how this can be reconciled with hot phrases like boosting browsing on mobile devices?!


Here are some benchmarks and discussion of data compression specialists: http://encode.ru/threads/2313-Brotli?p=44970&viewfull=1#post...

"Input: 812,392,384 bytes, HTML top 10k Alexa crawled (8,998 with HTML response)

Output: 219,591,148 bytes, 1.799 sec., 1.355 sec., tor-small -1

218,018,012 bytes, 1.951 sec., 1.464 sec., tor-small -1 -b800mb

210,647,258 bytes, 1.736 sec., 1.328 sec., qpress64 -L1T1

210,194,996 bytes, 0.333 sec., 0.371 sec., lz4 -1

194,233,793 bytes, 2.818 sec., 1.348 sec., qpress64 -L2T1

187,766,706 bytes, 7.966 sec., 1.059 sec., qpress64 -L3T1

173,904,470 bytes, 2.995 sec., 2.721 sec., tor-small -2

173,418,150 bytes, 3.132 sec., 2.843 sec., tor-small -2 -b800mb

169,476,113 bytes, 2.072 sec., 0.352 sec., lz4 -9

165,571,040 bytes, 2.931 sec., 2.820 sec., NanoZip - f

158,213,503 bytes, 1.855 sec., 0.980 sec., zstd

154,673,082 bytes, 3.213 sec., 2.445 sec., tor-small -3

154,166,902 bytes, 3.364 sec., 2.555 sec., tor-small -3 -b800mb

152,477,067 bytes, 2.128 sec., 0.973 sec., lzturbo -30 -p1

152,477,067 bytes, 2.132 sec., 0.971 sec., lzturbo -30 -p1 -b4

150,773,269 bytes, 2.151 sec., 1.047 sec., lzturbo -30 -p1 -b16

150,219,553 bytes, 2.332 sec., 1.204 sec., lzturbo -30 -p1 -b800

149,670,044 bytes, 7.825 sec., 2.567 sec., WinRAR - 1

149,642,742 bytes, 2.069 sec., 0.646 sec., zhuff_beta -c0 -t1 145,770,266 bytes, 4.586 sec., 2.285 sec., tor-small -4

141,951,602 bytes, 4.484 sec., 2.360 sec., tor-small -4 -b800mb

141,215,050 bytes, 2,751 sec., 0.938 sec., lzturbo -31 -p1 -b4

140,657,806 bytes, 5.037 sec., 2.544 sec., FreeArc - 1

140,211,060 bytes, 6,970 sec., 1.775 sec., bro -q 1

138,103,483 bytes, 118.023 sec., 2.394 sec., cabarc -m LZX:15

138,051,401 bytes, 2.761 sec., 1.001 sec., lzturbo -31 -p1 -b16

137,564,310 bytes, 18.762 sec., 4.299 sec., NanoZip - dp

137,211,547 bytes, 7.808 sec., 1.712 sec., bro -q 2

137,000,208 bytes, 3.763 sec., 3.852 sec., NanoZip - F

136,523,335 bytes, 2.830 sec., 1.094 sec., lzturbo -31 -p1

136,445,854 bytes, 50.932 sec., 2.344 sec., lzhamtest_x64 -m0 -d24 -t0 -b

136,337,495 bytes, 14.823 sec., 4.259 sec., NanoZip - d

135,723,691 bytes, 8.318 sec., 1.677 sec., bro -q 3

135,656,476 bytes, 2.972 sec., 1.153 sec., lzturbo -31 -p1 -b800

135,315,388 bytes, 51.436 sec., 2.371 sec., lzhamtest_x64 -m0 -t0 -b

135,287,357 bytes, 51.937 sec., 2.418 sec., lzhamtest_x64 -m0 -d29 -t0 -b

135,071,576 bytes, 21.650 sec., 4.251 sec., NanoZip - dP

132,819,515 bytes, 20.102 sec., 6.278 sec., 7-Zip - 1

131,871,664 bytes, 14.052 sec., 0.899 sec., lzturbo -32 -p1 -b4

131,401,865 bytes, 9,917 sec., 1.677 sec., bro -q 4

129,184,341 bytes, 8.305 sec., 2.692 sec., tor-small -5

127,355,215 bytes, 20.866 sec., 5.825 sec., 7-Zip - 2

127,045,472 bytes, 9.549 sec., 0.957 sec., lzturbo -32 -p1 -b16

126,139,033 bytes, 8.025 sec., 2.751 sec., tor-small -5 -b800mb

125,732,647 bytes, 10.642 sec., 2.618 sec., tor-small -6

125,454,769 bytes, 140.513 sec., 2.281 sec., cabarc -m LZX:18

123,169,077 bytes, 8.472 sec., 1.090 sec., lzturbo -32 -p1

123,093,411 bytes, 22.468 sec., 5.508 sec., 7-Zip - 3

122,564,329 bytes, 10.074 sec., 2.680 sec., tor-small -6 -b800mb

122,480,456 bytes, 19.411 sec., 1.645 sec., bro -q 5

121,068,548 bytes, 14.536 sec., 3.289 sec., FreeArc - 2

120,653,755 bytes, 16.107 sec., 2.552 sec., tor-small -7

119,969,489 bytes, 27.663 sec., 1.602 sec., bro -q 6

119,740,393 bytes, 27.370 sec., 5.259 sec., 7-Zip - 4

118,343,545 bytes, 24.112 sec., 2.123 sec., WinRAR - 2

118,139,032 bytes, 35.361 sec., 4.371 sec., NanoZip - Dp

117,500,327 bytes, 15.517 sec., 2.594 sec., tor-small -7 -b800mb

117,388,039 bytes, 23.546 sec., 2.524 sec., tor-small -8

116,526,595 bytes, 37.847 sec., 4.383 sec., NanoZip - DP

116,454,906 bytes, 35.232 sec., 4.351 sec., NanoZip - D

116,269,246 bytes, 25.888 sec., 6.589 sec., FreeArc - 3

116,217,001 bytes, 40.748 sec., 1.630 sec., bro -q 7

115,993,125 bytes, 192.929 sec., 2.199 sec., cabarc -m LZX:21

115,985,847 bytes, 386.192 sec., 0.850 sec., lzturbo -39 -p1 -b4

115,729,606 bytes, 35.504 sec., 2.095 sec., WinRAR - 3

115,163,486 bytes, 55.523 sec., 1.614 sec., bro -q 8

115,022,074 bytes, 49.863 sec., 2.084 sec., WinRAR - 5

114,602,026 bytes, 8.403 sec., 1.218 sec., lzturbo -32 -p1 -b800

114,345,025 bytes, 78,418 sec., 1.594 sec., bro -q 9

114,281,170 bytes, 22.925 sec., 2.575 sec., tor-small -8 -b800mb

113,354,128 bytes, 29.519 sec., 2.474 sec., tor-small -9

112,376,531 bytes, 177.077 sec., 1.923 sec., lzhamtest_x64 -m1 -d24 -t0 -b

111,848,802 bytes, 29.046 sec., 2.515 sec., tor-small -9 -b800mb

110,532,234 bytes, 40.580 sec., 2.496 sec., tor-small -10

110,177,215 bytes, 54.632 sec., 6.398 sec., FreeArc - 4

109,908,468 bytes, 40.292 sec., 2.501 sec., tor-small -10 -b800mb

109,522,695 bytes, 208.748 sec., 1.898 sec., lzhamtest_x64 -m2 -d24 -t0 -b

109,425,530 bytes, 436.824 sec., 0.893 sec., lzturbo -39 -p1 -b16

108,520,934 bytes, 58.227 sec., 2.521 sec., tor-small -11

108,520,934 bytes, 58.329 sec., 2.518 sec., tor-small -11 -b800mb

107,850,398 bytes, 266.166 sec., 2.562 sec., tor-small -12

107,842,909 bytes, 267.559 sec., 2.550 sec., tor-small -12 -b800mb

106,128,420 bytes, 271.607 sec., 1.850 sec., lzhamtest_x64 -m3 -d24 -t0 -b

105,933,030 bytes, 571,280 sec., 5.168 sec., lzturbo -49 -p1 -b4

105,692,791 bytes, 193.962 sec., 1.919 sec., lzhamtest_x64 -m1 -t0 -b

104,539,771 bytes, 316.307 sec., 1.833 sec., lzhamtest_x64 -m4 -d24 -t0 -b

104,094,380 bytes, 2313.780 sec., 1.693 sec., bro -q 10

104,053,219 bytes, 195.503 sec., 1.977 sec., lzhamtest_x64 -m1 -d29 -t0 -b

102,895,078 bytes, 148.997 sec., 4.850 sec., 7-Zip - 5

101,941,653 bytes, 237.364 sec., 1.889 sec., lzhamtest_x64 -m2 -t0 -b

100,898,120 bytes, 534.627 sec., 1.097 sec., lzturbo -39 -p1

100,159,922 bytes, 239.813 sec., 1.933 sec., lzhamtest_x64 -m2 -d29 -t0 -b

99,699,129 bytes, 625.001 sec., 4.902 sec., lzturbo -49 -p1 -b16

96,239,572 bytes, 347.011 sec., 1.893 sec., lzhamtest_x64 -m3 -t0 -b

95,197,295 bytes, 236.139 sec., 4.587 sec., 7-Zip - 9

94,133,011 bytes, 356.440 sec., 1.933 sec., lzhamtest_x64 -m3 -d29 -t0 -b

93,601,386 bytes, 431.884 sec., 1.899 sec., lzhamtest_x64 -m4 -t0 -b

92,303,359 bytes, 727.475 sec., 4.729 sec., lzturbo -49 -p1

91,310,894 bytes, 449.102 sec., 1.923 sec., lzhamtest_x64 -m4 -d29 -t0 -b

90,239,627 bytes, 680.976 sec., 1.170 sec., lzturbo -39 -p1 -b800

87,715,022 bytes, 314.169 sec., 5.428 sec., FreeArc - 9

82,891,405 bytes, 882.513 sec., 4.597 sec., lzturbo -49 -p1 -b800

77,286,010 bytes, 6497.059 sec., 7.715 sec., glza

Used: 7z 15.07 beta - Sep 17, 2015 (one thread)

rar 5.40 beta 4 - Sep 21, 2015 (one thread)

arc 0.67 - Mar 15, 2014 (one thread)

nz 0.09 - Nov 4, 2011 (one thread)

zhuff_beta 0.99 - Aug 11, 2014

cabarc 6.2.9200.16521 - Feb 23, 2013

lz4 1.4 - Sep 17, 2013

qpress64 1.1 - Sep 23, 2010

zstd 0.0.1 - Jan 25, 2015

tor-small 0.4a - Jun 2, 2008

lzturbo 1.2 - Aug 11, 2014

lzhamtest_x64 1.x dev - Sept 25, 2015 (own VS2015 compile)

glza 0.3a - Jul 15, 2015 "

lzturbo seems essentially better, a few other happen to be better.


Internet Scenario Benchmark (browser plugin simulation):

html8 : 100MB random html pages from a 2GB Alexa Top sites corpus. number of pages = 1178 average length = 84886 bytes.

The pages (length + content) are concatenated into a single html8 file, but compressed/decompressed separately. This avoids the cache scenario like in other benchmarks, where small files are processed repeatedly in the L1/L2 cache, showing unrealistic results.

size: 100,000,000 bytes. Single thread in memory benchmark cpu: Sandy bridge i7-2600k at 4.2 Ghz, all with gcc 5.1, ubuntu 15.04

      size  ratio%   C MB/s       D MB/s     MB=1.000.000
  15180334    15.2     0.43       482.07    brotli 11 v0.2.0
  15309122    15.3     2.27       127.23    lzma 9  v15.08
  16541706    16.5     2.07      1463.39    lzturbo 39  v1.3
  16921859    16.9     2.96       230.54    lzham 4  v1.0
  17153795    17.2     0.13       474.63    zopfli  v15-05
  17860382    17.9    43.51       495.78    zlib 9  v1.2.8
  18033576    18.0   135.62      1454.31    lzturbo 32  v1.3
 100000000   100.0  5984.00      6043.00    libc memcpy
LzTurbo compress 5 times and decompress 3 times faster than brotli.

LzTurbo decompress more than 6 times faster than lzham


Thanks, yes, very good roster, I am aware of these performers, maybe out there some hidden excellent tight/fast ones exist?!

In my view the two best are:

169,476,113 bytes, 2.072 sec., 0.352 sec., lz4 -9

115,985,847 bytes, 386.192 sec., 0.850 sec., lzturbo -39 -p1 -b4

LzTurbo benefits a lot from bigger blocks, in above example it uses 4MB block single-threadedly, and look how fast it is, if 16 threads are to be used what the outcome would be ... maybe 0.100 sec?!


Let's look at the most interesting there:

fastest encoding:

210,194,996 bytes, 0.333 sec., 0.371 sec., lz4 -1

fastest decoding:

169,476,113 bytes, 2.072 sec., 0.352 sec., lz4 -9

149,642,742 bytes, 2.069 sec., 0.646 sec., zhuff_beta -c0 -t1

gzip-alternative:

135,723,691 bytes, 8.318 sec., 1.677 sec., bro -q 3

114,602,026 bytes, 8.403 sec., 1.218 sec., lzturbo -32 -p1 -b800

prepacked:

104,094,380 bytes, 2313.780 sec., 1.693 sec., bro -q 10

90,239,627 bytes, 680.976 sec., 1.170 sec., lzturbo -39 -p1 -b800

best compression:

82,891,405 bytes, 882.513 sec., 4.597 sec., lzturbo -49 -p1 -b800

77,286,010 bytes, 6497.059 sec., 7.715 sec., glza

there are missing e.g. density, zpaq, and comments there suggest that brotli doesn't look that good for other than text data ...


You miss the point, your list is meaningful only in general compression cases, the Brotli thread (where we are) is all about boosting textual decompression while having/exploiting little resources (RAM mostly) as in the cases of Web Browsers.

As one (Jyrki) of co-authors commented: ``` For more clarity on the situation, you could compare LZMA, LZHAM and brotli at the same decoding memory use. Possibly values between 1-4 MB (window size 20-22) are the most relevant for the HTTP content encoding. Unlike in a compression benchmark, there are a lot of other things going on in a browser, and the allocated memory at decoding time is a critically scarce resource. ``` Source: http://google-opensource.blogspot.bg/2015/09/introducing-bro...


Let's dramatize the next 2 scenarios:

- 10Mbps or 1MB/s connection;

- 100Mbps or 10MB/s connection.

The goal is to receive in our browser those 812,392,384 bytes as quickly as possible, in first case the winner is 77,286,010 bytes compressor, in second the winner is 90,239,627 bytes compressor, yes?

In first case transfer_time + decompression_time = 77s + 7s = 84s

In second case transfer_time + decompression_time = 9s + 1s = 10s

Now, you see that even in Web Browsing scenario the best performer is not established, right?


I'm just wondering what it's like for a data compression expert whether algorithm or coding. It must be really wonderful and energising to see such interest and momentum in ones field.

I wonder if anyone working on the field would have a comment on that.


It's strange to see an "introducing" post for an algorithm that's already included in a deployed technology (WOFF2), especially one that doesn't point that out.


Call me a noob but I can't figure out how to build it...

https://github.com/google/brotli


$ git clone https://github.com/google/brotli.git

$ cd brotli

python extension:

$ python setup.py build

static lib:

$ cd enc

$ make

$ ar rvs brotli.a *.o


cd tools

make

Worked for me on Ubuntu 14.04. Not sure what packages are necessary. YMMV.


mhm, `cd tools; make` worked on OS X 10.10.5.


I wonder if the CRIME attack would still work under this compression algorithm


The fundamentals of the CRIME attack work with any compression algorithm. That's why HTTP 2.0 doesn't use compression for headers and sends deltas instead.


There's a fantastic site with benchmarks for almost all compressors ever: https://quixdb.github.io/squash-benchmark/

Strangely enough, no one compressor is better in all situations. :)


It doesn't look like Google's paper or the results on this site consider CloudFlare's zlib fork:

https://github.com/cloudflare/zlib

Previously discussed here:

https://news.ycombinator.com/item?id=9857784

And a recent performance comparison with Intel's patches:

https://www.snellman.net/blog/archive/2014-08-04-comparison-...


Thanks, this is tremendous. I was doing my own comparisons recently, but without a nice aggregated set of benchmarks to check against.


Such a compressor can not exist.


[flagged]


No, it's Hooli beating them to the punch.


What is that Pied Piper thing I constantly see referenced somewhere? Some SF or startup meme?


It's from a TV series called Silicon Valley.

https://en.wikipedia.org/wiki/Silicon_Valley_(TV_series)


What's the point of mentioning a fictional compression startup as a comment on the algorithm made by Google?

To tell the world that he also watched the series?


>What's the point of mentioning a fictional compression startup as a comment on the algorithm made by Google?

A joke, or a pop culture reference. Stating a shared experience that others may also gleam some enjoyment out of, or to create the illusion of a shared community/experience to help maintain the illusion that we are not just meaningless souls adrift in an indifferent, eternal void, who's incomprehensible size only re-enforces the meaninglessness of our existence.

But like I said. A joke.


It's a kind of spam for me, as I still, even after your explanation, don't see it more related as spam to eggs and bacon is. So please user csimy "could you do the egg bacon spam and sausage without the spam then?" (1)

1) a reference to Monty Python which in this context demonstrates the relativity of the "shared experience." Even if it gave the name to the electronic spam!


>It's a kind of spam for me

Then flag/downvote it. Commenting about how unproductive an unproductive discussion is just further propagates the original unproductive discussion [1].

[1] Reference (read parents): https://news.ycombinator.com/edit?id=10258024


You must be fun at parties.


One one hand I understand it, pop culture references has been a thing for ages and ages. On the other hand, I don't know because many take it to an extreme (see: reddit).


Yes, exactly. It's the sort of thing that tends to get downvoted here.


References to a shared culture are the universal language of camaraderie.

http://tailsteak.com/archive.php?num=29


Considering that Google's biggest press release of the year included a reference to that particular fictional world[1], and that we're talking about compression technology, I would say it's appropriate.

[1] http://bgr.com/2015/08/11/alphabet-easter-egg-google-hooli/


Being German I thought you meant the Pied Piper of Hamelin (https://en.wikipedia.org/wiki/Pied_Piper_of_Hamelin).


Any mention of "Pied Piper" in English is almost always a reference to that story.


I was not the one who wrote 'Pied Piper'. But I've seen the show, and I believe that is the one they named themselves after.




It's the name of the compression start-up in the show Silicon Valley.


Does it compress data middle-out?


> The higher data density is achieved by a 2nd order context modeling, re-use of entropy codes, larger memory window of past data and joint distribution codes.

They still haven't cracked middle out...


Russ Hanneman says: "ROI guys, ROI..."


[flagged]


Oh god, nobody can talk about compression any more without pop culture references. Can we stop it, please? We already know people here watch the show.


But it's actually, surprisingly, arguably, a valuable metric which, for whatever reason, was never done "in the real world" before, until the show requested it from actual researchers:

http://spectrum.ieee.org/view-from-the-valley/computing/soft...

I don't actually watch the show and this is the first I've heard of this "score", that article was surprising to me.


It's a terrible metric that fails basic dimensional analysis. Anytime you see a ratio of logs, you should already be suspicious. Say algorithm A is twice as fast as algorithm B. Then we'll have alpha r/R log(T)/log(t). The alpha r/R is constant, so the only interesting part is log(T)/log(t) = log(2t)/log(t) = log2/log(t) + 1. We can make this apparently unitless number take any value we want just by changing the unit used to measure time (or if a unit were fixed, by changing the amount of data used to test).

It's sort of vaguely superficially sensible, but the idea of being able to reduce comparison of compression algorithms (which fundamentally trade off speed for compression ratio) to a 1-dimensional value is laughable. Charles Bloom's Pareto frontier charts (http://cbloomrants.blogspot.com/2015/03/03-02-15-oodle-lz-pa...) are one of the more reasonable options.


The metric divides compression ratio by log compression time. (Ignore for now the normalization to a standard compressor.) This is:

    r / log(T)
That doesn't seem like a good metric, because it overvalues changes in T. For example, say we currently can manage 10% compression (ratio = 100/90 = 1.11) and it takes us 16ms. That's

    r / logT = 1.11 / log(16) = 0.2775
Now we have two proposals. One brings us to from 10% compression to 55% compression (ratio = 100/45 = 2.22) while the other one drops compression time to 4ms:

    r / logT = 2.22 / log(16) = 0.555
    r / logT = 1.11 / log( 4) = 0.555
But improving compression by 5x matters more than improving speed by 4x. Take this to the extreme: a compressor that exits immediately leaving its input unchanged has the best possible score here, despite being useless.


Sure, but "lol what's the weissman score HASHTAG SILICONVALLEYTHESHOW \m/" doesn't lead me to think that the poster was legitimately asking for the score so they could compare.


Yeah, it was a pretty painful post, but I would want to see the score. Its performance over gzip would be a more intuitive reference point than the '26% over Zopfli.'


"At Google, we think that internet users’ time is valuable, and that they shouldn’t have to wait long for a web page to load."

Says the company whose business model consists of convincing publisher to put ads in their website, thus slowing down page load times. Same company that lets you integrate Web elements like scripts and web fonts from their servers, once again making your webpage slower to load.

I know I'm exaggerating it a bit, but I really hate this company mission pitches that people in big companies constantly use to open their technical communications.


"At Walmart, we think that users' money is valuable, and that they shouldn't have to pay extra for their toilet paper."

Does the fact that Walmart charges money for their products make the above sentence disingenuous? Of course Walmart charges money, that's how their company works. That doesn't mean they can't try to reduce waste or inefficiency in their system.


Another thing too - I've found that other advertising platforms' ads are much more obnoxious. Google does a pretty decent job at keeping their adds from popping up in your face with technicolor, which I can appreciate.


You are exaggerating and your rant is not fair or relevant. They are honest with that sentence and this is potentially a big contribution to software world and Internet.


I wasn't diminishing the technology's importance in any way. My point was actually the opposite: why spoil a good communication about technological breakthrough with such lazy and fake-sounding company propaganda?


Well, considering the discussion it seems to have sparked, I would say it's not that irrelevant either.


"At Google, we think that internet users’ time is valuable, and that they shouldn’t have to wait long for a web page to load."

There is plenty of evidence of significant drop in users numbers with increasing page load time.

An advertising company has to ensure that their ads are not only displayed but that there is also a reasonable chance that they will be noticed. Otherwise it becomes a pointless exercise, advertisers leave and revenues drop.

Therefore what is really going on here is Google trying to squeeze in more advertising before annoying the users to the point of leaving the page.

Yes, I agree, advertising in itself is inimical to user's time and thus the above sentence can be seen as deeply hypocritical.


I dunno. Ads on Google search don't delay your results in any noticeable way. Wouldn't it be great if all ads were like that, given that ads are currently the only viable option for a wide variety of sites to stay in business?


The answer is a sounding yes, but I was just criticising the hypocritical nature of such a sentence, which is totally gratuitous in a post made to communicate a technological advancement. That sounds as fake as corporate bullshit could be.


Why would anyone trust this kind of corporate BS? It baffles me that you even try to take what they say at face value.

For me, it's akin to state propaganda in non-democratic countries: everyone knows that whatever authorities say and what the truth is are two different things, so there's little point to even analysing the official message (except maybe for humor factor etc.)


Which part of the sentence is not true?


None of it?

If they thought my time was valuable, they would not put colored strips at the top of every screen that say "Switch to chrome" or "Switch to GMail" or even after I've switched to Chrome, "Chrome is not your default browser." If they thought my time was valuable, they wouldn't be wasting it with their constant desperate pleas to browse the internet in precisely the way they want. If they really valued my time, they wouldn't throw random context switches into every single one of their web properties for whatever is the product du jour.

If they wanted my page to load faster, they would work as hard on making their stupid website addons load fast as they worked on getting everyone on earth to install them. Running local mirrors of ajax.googleapis.com and fonts.googleapis.com, along with hijacking analytics and doubleclick and returning 0-byte files, is the best thing I ever did for my poor parents' satellite internetion connection.

Internet users' time (and site loading speeds) are very much second- or third-class items on google's list of things to give a shit about. I'm absolutely not questioning thees decisions; google is an advertising agency and must prioritize this over my convenience! But pretending like they're some kind of altruistic charity working for the common good is disingenuous and slightly offensive.

It also creates the problem where some of the more gullible people actually believe google, a huge faceless global company, gives a shit about anyone in particular, which is demonstrably untrue.


eh, between Chrome DevTools, PageSpeed Insights, SERP hits based on load time, SPDY/HTTP2, WebM, new compression algorithms, etc etc I would say Google has done more than a little to help reduce wait times for web pages.


Some Google pages have load times now upwards on ten seconds on Firefox, this is to me, what makes this statement so hilarious.


I'm sure you are just about to produce a list of these pages.


Google+ on Firefox is a great example. If you use Firefox, it'll actually cause Firefox to seize and lock up for several seconds while it loads.


GMail is the first website I ever saw that had a 'loading' progress bar.


Gmail has a loading bar because it has a massive storage backend, not because it's sending markup to your browser. Comparable products like HN favorite FastMail also have loading progress bars.


Yes actually the goal of all companies is to make profit, as much as possible. I know, it annoys me too, for example Apple says their mission is to "leave the world better than how we found it", but I'm skeptical about that ;-)


Your comment reminded me of an old Turkish saying: "Looks like your intention is not to eat the grape but to beat the vineyard keeper."


Sorry but I don't get it :D


It can be used in several situations, one could be:

The acts of a person implies that person's motivation is not to use a product or comment about qualities a product but to bash the producer/creator using the context of the product. So ulterior motive is to punish creator, the actual product is only a medium. (grape: product - wineyard keeper: producer)

Sorry it sounded a bit harsh. I also agree that a dry and more technical blog entry would be better, but I think it is not a big deal considering the importance of the product.


Obviously the reality is to serve ads more efficiently. But is it bad if it helps users in the end as well?


Can't help but wonder if they stole this from some anxious college dropout who left their company to form their own compression startup.

;)


"We hope that this format will be supported by major browsers in the near future, as the smaller compressed size would give additional benefits to mobile users, such as lower data transfer fees and reduced battery use."

Is this the same story everytime that G. invents something, implements it in Chrome and then gains advantage, because takes it as standard and other browser hadn't implement it yet?


No, this was first proposed for inclusion in Firefox two years ago: https://groups.google.com/forum/#!topic/mozilla.dev.platform...

This wasn't developed and deployed in secret. According to the comments in https://bugzilla.mozilla.org/show_bug.cgi?id=366559, the GitHub repository has been public since at least November 2014.

You may not like Google, but insinuating evil plans every time they do something cool isn't helping anyone.


Also, it's used in W3C Draft Standard for WOFF2 fonts:

http://www.w3.org/TR/WOFF2/


There doesn't seem to be any technical downside to implementing it and likely significant upside. It's Apache-licensed which grants both copyright and patent rights that cover the code. The spec is published as a Internet Draft, scheduled to become an Informational RFC. What's not to like? I'm not one to drink the Google kool-aid, but seriously, what else would you like Google to do?


Firefox's Brotli support is already done. Chrome's isn't.

By the way, new features are generally created this way. They are added to browsers way before they are standardized. You see, convincing the other browser vendors isn't easy. You need very compelling arguments.

Take WebP, for example. It provides massive benefits. Especially if you can use lossy RGBA instead of PNG32 (easily 80% smaller). And yet, Mozilla shows little interest in implementing it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: