Updating the Git protocol for SHA-256

kelnos · on June 20, 2020

One of the comments mentioned that there was a suggestion (presumably rejected) to "rotate" the first character of the hex string for the SHA256 hashes by 16 characters, so 0 becomes g, 1 becomes h, etc. (that way the SHA256 hashes would be unambiguously not SHA1 hashes, even when abbreviated).

This made me think... why are we using long, unwieldy base-16 hex strings at all? Why not use an alphabetic (non-numeric) base-46 string: 20 lowercase letters ([g-z]), 26 capital letters ([A-Z])? Then the new SHA256 hash strings end up being shorter than the old SHA1 strings, and there is no overlap with the [0-9a-f] range of the base-16 strings.

If you wanted to even it out to 64 characters, you could create a "modified base-64" that doesn't use [0-9a-f] and instead uses more special characters (though for convenience you'd want to choose characters that are shell-safe and possibly even URL-safe, which might make this not work). Alternatively you could use a subset for a base-32 representation.

The downside -- perhaps a significant one? -- is that you can't use standard tools like `sha256sum` or the representation conversion functions into the stdlib of many languages to generate these hashes; it would require custom code. Not sure if that's a concern, though.

mkl · on June 20, 2020

> This made me think... why are we using long, unwieldy base-16 hex strings at all? Why not use an alphabetic (non-numeric) base-46 string: 20 lowercase letters ([g-z]), 26 capital letters ([A-Z])? Then the new SHA256 hash strings end up being shorter than the old SHA1 strings, and there is no overlap with the [0-9a-f] range of the base-16 strings.

Are you sure? SHA1 hashes are 40 hex digits, and SHA256 hashes are 64 hex digits. But in base 46, 2^256-1 = 3ZS4A7V5Ki0LWg1f3Of06YNfgQXCA2P0Q6RACKhEIWQXe07, which is 47 base-46 digits long, so still longer than SHA1. (This is not your base 46, since it's using 0..9, A..Z, a..j.)

  import gmpy2
  gmpy2.digits(2**256-1, 46)
    -> '3ZS4A7V5Ki0LWg1f3Of06YNfgQXCA2P0Q6RACKhEIWQXe07'
  gmpy2.digits(gmpy2.mpz('f'*64, 16), 46) #same, but more clearly the "maximum" hash

edit: more code to try kelnos's proposed digits:

  gmpy_digits = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
  new_digits = 'ghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
  ''.join(new_digits[gmpy_digits.index(d)] for d in gmpy2.digits(gmpy2.mpz('f'*64, 16), 46))
    -> 'jPIkqnLlAYgBMWhVjEVgmODVWGNsqiFgGmHqsAXuyMGNUgn'

I think hex strings are probably still better, as there's less ambiguity, and 47 characters isn't much shorter than 64 for practical purposes.

mathnmusic · on June 20, 2020

One of the mistakes Git made was that the hashes don't describe what algorithm was used to generate them. That makes backward compatibility and incremental upgrades harder. MultiHash is a solution for such issues: https://github.com/multiformats/multihash

heavenlyblue · on June 20, 2020

You don’t need a specialised solution for this, just prefix the v1 of your hashing algorithm with some constant and then you can easily extend it.

tzs · on June 20, 2020

Right. To get 256 bits down to 40 digits to match the length of the current hashes you need to use base 85. To get it shorter you need base 95. There are 94 non-space printing characters in ASCII and I doubt anyone wants to allow spaces so that’s probably out.

kelnos · on June 24, 2020

Yes, good point, my mental math does not check out.

Not sure what you mean by there being "less ambiguity" with hex strings though. Less ambiguous than what, and how?

alkonaut · on June 20, 2020

If the current hashes are case insensitive (I have never seen a parser that wouldn’t accept both abad1dea and ABAD1DEA (And doing so would be a bad idea imho) then the uppercase A-F would overlap with the sha1 hashes.

A “g” prefix and then either hex or base35 or something similar should do it. It’s pretty annoying without a prefix since without it you can’t say whether a numeric string such as a date actually indicates a hash.

user5994461 · on June 20, 2020

>>> This made me think... why are we using long, unwieldy base-16 hex strings at all?

Surprised you got ten answers to this comment and nobody was able to explain.

Hexadecimal encoding is widely used because computers work with bytes (8 bits). An hexadecimal character is 4 bits, so every byte is two hex characters.

It's trivial to encode (even by hand), it has no issue of padding. It can truly transit through anything that accepts alphanumeric characters. It's awesome.

edit: The most common alternative is base64 that encodes blocks of 3 bytes to 4 characters. Base64 has a bazillion issues due to padding and special characters +/=. If I had a dollar for every system I've seen using broken base64, I could buy a car. Saving a bit of space (200% in hex VS 134% size in base64) is almost never worth an order of magnitude more complexity. Do not invent your own encoding algorithm.

ptx · on June 20, 2020

In this specific context though, nobody is going to be manually calculating the binary hash of the source tree and encoding it, so this advantage doesn't seem very relevant.

The proposed solution would still be alphanumeric, so it would share that advantage.

notriddle · on June 20, 2020

The problem isn't with the algorithm encoding it. The problem is with other software mangling it.

Base 64 has punctuation, it's case sensitive, and trimming a = off the end will (counterintuitively) result in a longer payload. All of this makes screwing around with commit hashes more complicated.

kelnos · on June 24, 2020

> Hexadecimal encoding is widely used because computers work with bytes (8 bits). An hexadecimal character is 4 bits, so every byte is two hex characters.

Yes, I'm aware (that electrical & computer engineering degree I have is occasionally good for something ;) ).

But the dominating factor in hashing a bunch of data is the hash function itself, not converting it into something displayable. Converting between binary and hex vs. binary and some other base string is optimizing the wrong part of the process.

And who is converting SHA1 hashes between hex and binary by hand? That doesn't seem like a thing people do.

mkl · on June 20, 2020

I think you've kind of missed the point, which is that a larger base would mean shorter hash strings.

Using a small base like 16 means the hashes are very long, and awkward to use in full. Using a larger base will make the hash shorter when it's in text form. A text representation of a 256 bit hash in base 16 is 64 characters long, but in base 46 it is only 47 characters long. Using the whole alphabet in lower and upper case, and the digits, would produce base 62, where a 256 bit hash would be 43 digits long.

Still a bit big and unwieldy IMHO, so sticking with hexadecimal seems sensible given its other advantages.

Dylan16807 · on June 20, 2020

Case is really annoying to deal with. The string form is for humans to read and type, among other things.

So that leaves you somewhere around base 32, which is fine but isn't a huge benefit over hex, and hex has the benefit of being standard.

kelnos · on June 24, 2020

> Case is really annoying to deal with. The string form is for humans to read and type, among other things.

If anyone is frequently typing out SHA1 hex strings by hand, that would be news to me.

wegs · on June 20, 2020

Good idea!

Your downside is a major concern, but it would take all of 5 minutes to patch sha256sum to have a new output format, and another 3 months to have a PR approved. Ditto for most libraries, except where written into standards (and I'm not sure if that's anywhere). I'd expect Python 3.9 to not be a problem either. So it's not a big deal to address the concern.

If you want to do this well, though, the trick would be to come up with a human-friendly format. You can't use both '0' and 'O.' You can't use '1,' 'l,' and 'I.' Invest the time to do HCI work.

And, ideally, you'd have some form of internal check-summing, so you can easily flag most typos.

Performance is NOT a concern. Some folks here mentioned the some encoding formats are table lookups. In most cases, generating or verifying a cryptographically-secure hash is many orders of magnitude slower than an encode/decode of the key into any format. And if you're not generating/verifying, in most cases, it doesn't make any sense to encode/decode (just pass around the encoded key).

marton78 · on June 20, 2020

The human friendly format you're suggesting is Douglas Crockford' Base 32 encoding.

dlkinney · on June 20, 2020

I disagree that his encoding is friendly.

I understand the compromises he made in order to make it less ambiguous, but regular users come across hashes sometimes, and unless you know it's his encoding, they're (often) smart enough to know that some characters look like others. So, when they go to recite the encoded value, they're still going to struggle "is that a 0 or an O?" etc.

We may know it doesn't make a difference, but someone who doesn't know it's explicitly his encoding (and the parameters he transposes) is still going to get hung up.

Lastly, the removal of "u" only prevents 1.5 offensive words from showing up, but I guess he chose it because he had to get rid of some letter to make it base32.

I think a base ~32 by removing all ambiguous characters and adding a non-separator punctuation mark would be ideal, frankly.

If we're worried about swear words then also remove the vowels and add 4 more punctuation mark.

IMHO

wegs · on June 21, 2020

That's a !@#$ good idea for avoiding swear words :)

yencabulator · on June 20, 2020

The larger alphabet codes are unwieldy to type/read out loud. Past 10+26=36 you run out of decent symbols (making case significant is not user fiendly), and even in that range you have confusing pairs like 1iI 0o.

My personal favorite is z-base-32. It has none of those issues, and is much more compact than hex. It's the perfect in-between.

http://philzimmermann.com/docs/human-oriented-base-32-encodi...

kzrdude · on June 20, 2020

Base-32 just encodes one more bit per symbol (5 instead of the 4 per symbol that hex does), so the improvement is rather minor.

yencabulator · on June 22, 2020

52 characters vs 64 characters for a 32-byte public key is a decent improvement if you ask me. Or, more to the point, aiming for anything denser makes the alphabet less friendly.

VMG · on June 20, 2020

the more characters you have in your alphabet the higher the risk of pairs that are difficult to tell appart

  0Oo
  1l

etc

ryanlol · on June 20, 2020

https://en.wikipedia.org/wiki/Base58

VMG · on June 20, 2020

Base58 has certain issues (lowercase-uppercase ambiguity) which is why we now have https://github.com/bitcoin/bips/blob/master/bip-0173.mediawi...

xurukefi · on June 20, 2020

Can you elaborate on the ambiguity problem of Base58? I always thought that base58 was designed with this in mind as it does not include visually ambiguous letters such as "0", "O", "I", "l", etc..

VMG · on June 20, 2020

That is true, but if somebody unfamiliar with the encoding writes down the characters or communicates them over the phone they have consider the case for every letter. That is a poor user experience.

yiyus · on June 20, 2020

None of the two examples you showed would apply to the proposed non-numeric base 46, since 0 and 1 would not be allowed.

Although it is true the potential risk may be higher, it is easy to find counterexamples where this potential risk does not matter. The 26 capital letters are easy to distinguish and allow you to use a base-26 system, while a base-2 system where the only two elements are 1 and l (or 0 and O) would be more difficult to use, for example.

VMG · on June 20, 2020

The proposed base46 encoding has

  Oo
  Il

john-shaffer · on June 20, 2020

> If you wanted to even it out to 64 characters, you could create a "modified base-64"...

My go-to encoding is base64url in RFC 4648. It's simply base64 with - instead of + and _ instead of /. In a case like this where the length is known, you can omit the trailing "="s so no escaping is necessary.

https://tools.ietf.org/html/rfc4648#section-5

grizzles · on June 20, 2020

Great suggestion. This is one of my design pet peeves for anything related to banking / bill payments numbers that are long and unwieldy to type. The (eg. invoice) number 563466130 in base 35 is "9bh18y". The same number as 9BH18Y and so much easier to type. Base 35 for sha256 would however add 11 additional characters compared to the sha1 output length tho.

lmilcin · on June 20, 2020

Same reason we use other "inefficient" protocols like json or HTTP and not "superior" binary alternatives.

the8472 · on June 20, 2020

http2 is in fact binary, it's only converted to a text representation for human convenience when needed, the same can be done with other protocols.

lmilcin · on June 20, 2020

HTTP 2 only recently reached 50% adoption.

panpanna · on June 20, 2020

So what do you think of CBOR being used in new IETF standards?

lmilcin · on June 20, 2020

I don't think anything about it because I do not care. Just because binary protocols exist it does not undermine convenience of text protocols.

You are free to use CBOR for your configuration files for your next application. Then come back and tell us of your experiences.

Git uses text hashes in many, many places. This makes it convenient for the users which is very important design goal. I am pretty sure that replacing this with binary hashes would be insignificant performance improvement.

globular-toast · on June 20, 2020

My gut instinct is to use a power-of-2 base because it makes decoding a lot easier (and can be done in parallel). But I doubt that the hash codec is a big enough overhead to be concerned about.

dependenttypes · on June 20, 2020

Alternatively you can prepend the name of the hash used

capableweb · on June 20, 2020

And now you basically reinvented multihashes :)

https://multiformats.io/multihash/

dependenttypes · on June 20, 2020

I do not like multihash. I personally prefer base64 hashes and I also do not like how they include the length.

moonchild · on June 20, 2020

> the SHA256 hashes would be unambiguously not SHA1 hashes, even when abbreviated

Not if your hash happens to not include an 'h'.

kelnos · on June 24, 2020

Not sure I follow. If the SHA1 hashes use [0-9a-f], and the SHA256 hashes use [h-zA-Z] (or something along those lines), there is never any overlap in characters.

rnestler · on June 20, 2020

The current SHA-1 representation uses hexadecimal, so only contains 0-9 and a-f.

supakeen · on June 20, 2020

What is unwieldy about hex strings? If you'd want to solve this perhaps a prefix would be easier?

richbradshaw · on June 20, 2020

If you store them as strings instead of binary (which I'm guessing 99% of hashes are), they take up more DB space currently, and passing them round is slower, e.g. as etags etc.

speedgoose · on June 20, 2020

I would guess that once compressed, binary represented as hex strings don't take much more space than base64 strings. Probably still a bit more than the compressed binary though.

dependenttypes · on June 20, 2020

You won't gain anything by compressing (binary) hashes.

speedgoose · on June 20, 2020

Oh yes indeed.

Dylan16807 · on June 20, 2020

This is about printing, not storage.

m463 · on June 20, 2020

I wish hashes were more human-readable/human-writable.

maybe use a scheme where the prefix of the hash could be mapped to a common word.

So with a contrived example based on 100 and:

https://en.wikipedia.org/wiki/Most_common_words_in_English

the former hash 9221920 would break off "92" from 21920 and look in the word list to allow any of the following:

  9221920            old style hash could say just 92
  92-new-21920       or use 92 + 92nd most common word, "new"
  new-21920          or skip 92 which is redundant with "new"
  new                or just use new alone

obviously 100 and base 10 would have collisions, so use a larger number and constricted word lengths to find 256 or 4096 or 65535 words.

tom_mellior · on June 20, 2020

If you used a dictionary of 100 words, you would get a prefix collision every 100 commits, i.e., within days or weeks for even moderately busy projects. Using a larger list would just shift the problem by a bit. You could use tuples of words (like https://en.wikipedia.org/wiki/What3words for example) to avoid the problem. But then you'd be much longer than current Git hash prefixes.

And why? What would be the use? The two things I do with Git hashes: (a) copy and paste, for searching for something; (b) occasionally visually comparing hashes. Copy and paste works well, in fact in pretty much any terminal or other tool double click will select the whole hash, which wouldn't be as simple if the hash were multiple words with separators. Visual comparison might be marginally easier with familiar words, but for me it's a niche use case, so I don't think optimizing for it would be useful.

m463 · on June 21, 2020

my last line addressed 100 words.

I also was trying to augment the current system, not replace it. so the "92" and the word "new" were equivalent and you could use one, or the other or both.

The basic idea I'm putting forward, is that I think it would be easier, on the command line, than copy-paste (which requires a mouse)

I can type words pretty fast, and I can keep a word in memory easily.

Remember it extends the current model without changing it.

garmaine · on June 20, 2020

Way, way bigger downside is that you've made encoding and decoding much more complicated than a simple table lookup.

dependenttypes · on June 20, 2020

Base32 and base64 both can easily be implemented using a lookup table.

garmaine · on June 20, 2020

Not the suggested base-46.

chmaynard · on June 20, 2020

This article by John Coggeshall is an example of great technical writing. I look forward to reading more of his work.

The more I learn about Git, the more it blows my mind. Torvalds must have designed Git in his head during the time he was using BitKeeper. When he decided to implement Git, he went to work and in a few weeks he had it working well enough to use for Linux kernel source control. Torvalds is either a genius or merely a superstar computer scientist and software engineer. Choose any two.

beagle3 · on June 20, 2020

Everyone is forgetting “monotone”, a DVCS developed by Graydon Hoare (who later went to create Rust) of which the Git model is essentially a simplification (and much much less related to BitKeeper)

There was a lot of experimentation with VCSes at the time: arch -> tla -> baz ; bzr ; monotone ; cdv ; mercurial ; git ; even sun was finally getting merge tracking.

Monotone had the best structure but a slow and clunky implementation. It was adopted and simplified (in different ways) by git and hg; eventually git won by virtue of being way faster and demonstrably effective for a project the size of the Linux kernel (and Linus’ celebrity status didn’t hurt either)

But graydon hoare deserves a lot more credit than he received - Linus did credit him at the time, but no one seems to remember.

emmelaich · on June 20, 2020

Definitely monotone; Linus acknowledges it. But it was way too slow.

Surely Git is influenced by Venti and Fossil from Plan9. Though I'm not sure it's ever mentioned.

https://en.wikipedia.org/wiki/Venti

https://en.wikipedia.org/wiki/Fossil_(file_system)

nix23 · on June 20, 2020

>Everyone is forgetting “monotone”

I miss monotone! What a beautiful single-file (sqlite) VCS if it would be still maintained i would use it for small projects, but nowadays i use fossil and bitkeeper...and yeah git if i have to ;)

SQLite · on June 20, 2020

Author of SQLite and Fossil here - I have not forgotten monotone! Indeed, Fossil itself is inspired by monotone documentation (though not the implementation). This is credited at https://fossil-scm.org/fossil/doc/trunk/www/history.md

dschulz · on June 20, 2020

Dr. Richard Hipp? Thanks for all your work. The world is a significantly better place thanks to you.

nix23 · on June 20, 2020

Thank you very much for that great piece's of Software!! Both Fossil and SQLite.

the_pwner224 · on June 20, 2020

> but nowadays i use fossil and bitkeeper

Do you know of any http-accessible BitKeeper host (commercial or self hostable)? I tried out BK some months ago but wasn't able to find any way to put my code up on the internet, aside from BkBits.net, which appears to be abandoned and gave a HTTP 5xx error when I tried to sign up.

nix23 · on June 20, 2020

It works again...but for production...nah probably not, they just need more peoples who work on bk itself and on a hosting platform like BkBits....its a bit a sad state for such a great product.

But yes you can host bkbits for yourself, but i have no clue in witch repos it is, maybe ask here:

https://users.bitkeeper.org/

Sorry for being not a big help

cpach · on June 20, 2020

I just want to chip in to say that I think darcs also deserves a honorable mention :)

beagle3 · on June 21, 2020

Oh, most definitely - forgot to put darcs in that list - it was definitely influential in the huge “diffs vs snapshots” debate.

magnio · on June 20, 2020

Reading the first Git version, i.e. its initial commit, makes learning git easier for me. Albeit very simple and barebone, its ideas are pretty fundamental to Git and differentiate it from many other VCSs.

alblue · on June 20, 2020

I wrote a series of blog posts covering how the intervals of git works; I posted a summary here for reference:

https://alblue.bandlem.com/2011/12/git-tip-of-week-finale.ht...

If you’re interested in how commits, trees and objects are represented then it could be interesting for you.

chmaynard · on June 20, 2020

“The thing that differentiates scientists is purely an artistic ability to discern what is a good idea, what is a beautiful idea, what is worth spending time on, and most importantly, what is a problem that is sufficiently interesting, yet sufficiently difficult, that it hasn't yet been solved, but the time for solving it has come now.”

-- Savas Dimopoulos

nujabe · on June 20, 2020

I'm almost certain if it weren't Torvalds, it would be someone else who creates something like Git. He just had the need and skills to create it. Like Bill Gates he was just a competent person, at the right place and time.

nordsieck · on June 20, 2020

> I'm almost certain if it weren't Torvalds, it would be someone else who creates something like Git. He just had the need and skills to create it. Like Bill Gates he was just a competent person, at the right place and time.

I think it's fair to point out that Torvalds didn't come up with Git on his own - he was influenced by a good amount of prior art, not the least of which was Bitkeeper.

However, I think calling him a "competent person, at the right place and time" undersells him to a substantial degree.

1. He's a world class software developer.

2. He's a world class tech lead (There are certainly criticisms about his communication style, which I don't think are unfair. Here, I am judging him by the artifacts created under his leadership).

3. He was the write lock for an extremely large distributed team for more than a decade.

It is not surprising to me that the 2nd most famous distributed source control tool - Mercurial - was also initially created by a Linux kernel developer.

bch · on June 20, 2020

> It is not surprising to me that the 2nd most famous distributed source control tool - Mercurial - was also initially created by a Linux kernel developer.

Because Linux was in a crisis and hot off an excellent experience w bitkeeper? Or something more profound?

nordsieck · on June 20, 2020

> Because Linux was in a crisis and hot off an excellent experience w bitkeeper? Or something more profound?

I'm sure the crisis was part of it.

But also, the way Linux development is organized is very weird compared to how code is developed in most places. "Distributed" is not quite right, but it's the best word I've got. A source control tool like Perforce just wouldn't fit linux development very well, whereas large companies at the time were perfectly happy with it.

quicklime · on June 20, 2020

“Decentralized”?

I agree with you that Linux development is weird compared to most places. Most places have a designated central repo that acts as the source of truth, whereas the Linux developers don’t really have this.

This is possibly an unpopular opinion, but I think that when there is an official main repo, git is far more complicated than it needs to be. It’s confusing to a lot of developers because of all this unnecessary complexity. Most teams could probably be served better by a simpler system.

Koromix · on June 20, 2020

I share this opinion as well. For centralized-repo style, which is what most teams need, the Git model has three problems:

- It is hard to use. There's a reason most teams have a single or two Git "experts" to help everyone else resolve repository / merge / pull / whatever problems that happen regularly. The onboarding experience for developers unfamilir with Git is very bad, and many of them end up never able to use it. If you have non-developers (e.g. artists), it is even worse.

- Binary files. Yes, there is Git LFS, but there are many small integration problems when you try to use it, and the simple fact that Git needs something special to manage binary files is a problem. And no, storing binary files outside the repository is NOT a good solution. In games for example, the binary files and the engine usually evole together and if you want to checkout an old version, the code and the assets need to stay in sync. You want to version the game, not just the engine.

- Partial (subtree) clones and permissions. There's a reason monorepos are popular: they're much simpler to manage. Unfortunaly, because Git barely does partial clones and cannot manage permissions (inside a repository) at all, monorepos are much more limited than in SVN or P4. That's why many projects end up divided in multiple repos, but this brings all sorts of problems and a large administrative overhead for all developers.

mpawelski · on June 20, 2020

The point of being hard to use is valid. But I would focus more on being hard to learn instead. Once you know it you forgot about it 99% of times, for other 1% it's good to have this "git expert" on a team ;)

For other points there seems to be ongoing process [0][1] and I think in one year or more having huge monorepos with binary files should not be such a big problem or require additional tools. It should be build in core git.

Permissions seems to be partially "solved" by git hosting sites like Github. But I agree it's a pity we don't have anything build in core git for this :(

[0] https://github.blog/2020-01-17-bring-your-monorepo-down-to-s... [1] https://git-scm.com/docs/partial-clone

nix23 · on June 20, 2020

Give bitkeeper a try, it's now FOSS

-Simple clean interface

-You can work the linux way or with a central-repo-style

-Binary Asset Manager build in

-Nested (submodules done right)

Koromix · on June 20, 2020

Interesting, thanks for the tip! I'm gonna give it a go during the weekend.

bch · on June 20, 2020

Thanks for the thoughtful reply and kicking off a great discussion about gits strong points and pain points. I see too many discussions that are just “fans” who seem to think Linus/Linux can do no wrong, without acknowledging what Linux development is, and how probably 99% of us will never be there, but still live w the pain points. Like suffering through supercar gas mileage and repair costs for a commuter/grocery-getter, but “McLaren can’t be wrong, can they?”

edoceo · on June 20, 2020

Maybe their extensive experience with both big-system code and distributed team problems?

Dylan16807 · on June 20, 2020

If? Is Mercurial not like Git?

addicted · on June 20, 2020

Yes, and according to Wikipedia their initial releases were only 2 weeks apart.

Further, you also had Bazaar whose initial release was a month earlier and was itself based on another tool that was released a year before.

For whatever reason, creating distributed alternatives to SVN was a very popular thing to do at the time.

Volt · on June 21, 2020

They were really meant to be FLOSS alternatives to BitKeeper, not SVN.

sradman · on June 20, 2020

> ... he was just a competent person, at the right place and time.

Right skill set, right place, right time. The decentralized team workflow of Open Source projects demanded a decentralized version control system (DVCS). A number of DVCS systems emerged with varying trade-offs.

It is interesting, but non obvious, which set of trade-offs gain greater adoption. Git versus Mercurial is a case study of this process.

riquito · on June 20, 2020

> "He just had the need and skills to create it"

Yeah, lucky Linus

/sarcasm

A bit undeserving, don't you think?

kohtatsu · on June 20, 2020

I'd love to see an in-depth comparison of the contributions to technology from Jobs, Gates, and Torvalds.

nix23 · on June 20, 2020

I dont think there is much:

-Jobs was management

-Gates made basic and management

-Torvald made a VCS a Kernel a Divecomputer-programm and a new way to develop software

rumanator · on June 20, 2020

> he was just a competent person, at the right place and time.

If you honestly have that pov then it must be very hard to explain how no one else in the whole history of the world has ever managed to pull anything similar to Git.

It's one thing to point out we all stand on the shoulders of giants, but it's an entirely different thing to claim any random person with average skills pulls this off. I mean where's your Git?

langitbiru · on June 20, 2020

10x engineer? :)

IshKebab · on June 20, 2020

This is going to be such a pain for clients to support. Especially because there's not really a good specification for git - the official implementation is the specification. For my client I think I will just have to wait until this is officially enabled before implementating it. Guess it has to be done though.

I don't understand the bit about distinguishing hashes by length only. If you're going to change it why not add an identifier prefix? Then the logic can be "if length == 128 { its sha-1 } else { read first byte to determine hash algorithm }".

dec0dedab0de · on June 20, 2020

Wouldnt an easier solution be to keep the sha-1 to work exactly as is, but add additional hashes to verify the files havent been maliciously modified to keep the same hash. It seems almost impossible to craft a file that has different hashing algorithms equal the same thing. Assuming older versions dont explode when new data is added, it should stay backwards compatible.

The sha-1 and short hashes would be the name that is displayed and used for checking out commits, but not relied upon for security.

cletus · on June 20, 2020

I've never seen an adequate answer to this but it boggles my mind that this problem wasn't foreseen.

When Git came out we'd already seen a move from MD5 hashes to SHA-1 because MD5 was no longer deemed secure. Of course this was going to happen again.

So why wasn't the protocol and the repository designed to have multiple (hash algorithm, hash) pairs?

At this point it becomes a client and server issue if you support a given protocol or not so you have this handshake to agree on what you're using (eg a server may reject SHA-1), much like we have with SSH.

This seems like it would've been trivial to allow for in the beginning. Or am I missing something?

corty · on June 20, 2020

It would have been possible to include the ability to change algorithms in a handshake (agility).

But general current consensus in the cryptography and security community (which I don't agree with) is: "Agility is bad, hmkay". The reasoning is largely because there have been plenty of attacks based on downgrading protocols like TLS to broken crypto by an attacker. Also, the handshake mechanism necessarily enlarges the attack surface of your software, usually in a pre-auth part which is very problematic.

The counter-argument is of course that a lack of agility leads to lots of pain later on, when you need agility because your one chosen algorithm is broken.

There is also the added argument for not using SHA-1 in git: SHA-1 was already starting to "smell" when git was first implemented. Picking something else back then would at least have moved the current problems farther into the future, when SHA-2 might start to be problematic.

hannob · on June 20, 2020

There are good reasons why agility is bad. It never works the way people imagine.

The fantasy is people think "in case of a sudden algorithm break we can quickly switch to a better algorithm".

This is wrong because:

a) there are no sudden algorithm breaks. You always know way in advance before something is broken. (In the case of SHA-1 there was more than a decade between early warnings and the first attack.) In the case of Git the warnings about SHA-1 are older than Git itself. The problem wasn't lack of agility, the problem was choosing an algorithm that was already known to be bad.

b) if you have agility it means people will continue supporting bad algorithms and use agility as an excuse.

b) happened exactly in TLS. Padding oracles were known, but people chose to ignore it. In TLS 1.2 they included better algorithms (GCM), but chose to keep the bad ones (CBC/HMAC). Then more padding oracle attacks happened. Then people said "oh we have agility, we can disable CBC and ... well nobody supports GCM yet... let's use this other broken thing we have called RC4". Then attacks on RC4 came in.

If by the time people designed TLS 1.2 they would've said "we know RC4 is weak, we know CBC/HMAC (as used in TLS) is weak, we will just support GCM" none of that mess would've happened.

corty · on June 20, 2020

The changeover to TLS1.2 only ever could have worked because of agility. Without at least protocol version agility (which implies cryptographic agility), one couldn't talk TLS and its successor on the same port, because there wouldn't be any other way to distinguish them. A new TCP port assignment would have been possible, but a total and utter disaster. Countless firewalls, appliances, routers, middleboxes, rulesets and similar stuff assume that HTTPS is on port 443, changing that would at least take a decade.

That the allowed set of ciphers didn't change enough with TLS1.2 may be right, but doesn't matter for why we do need agility.

Agility is necessary and TLS is the poster-child for why we need agility in such protocols. Whether the same applies for git, where the network protocol isn't that important and ingrained is of course debatable, I think you are right there. But what git has to do now is create agility by cramming it in somewhere, using cludges, hacks and whatever might be possible without breaking too much. It remains to be seen if the cludge will be free of security problems.

hannob · on June 21, 2020

The "fun fact" here is that TLS version negotiation does not really work. Which is why in TLS 1.3 it still technically is using a TLS 1.2 handshake with an extension adding the new version.

But appart from that: Having the ability to have a new version is a completely different concept from what people usually call "cryptographic agility". The latter is usually used to describe a system where you can pick and choose from a variety of algorithms, good and bad ones so everyone gets what they want. Which TLS unfortunately still does - TLS 1.3 could be so much simpler if they just removed any algorithm choices.

cyphar · on June 20, 2020

Given the plethora of downgrade attacks against TLS that have been found over the decades, I'd really hope that there is a better poster-child for cipher agility. If the canonical example of "cipher agility working as intended" is TLS, then I want off this rollercoaster ride.

corty · on June 20, 2020

TLS agility is definitely not the poster-child for a good implementation. I do not know if there is any cipher agility implementation I would really call good.

However, TLS is the prime example for why we desperately need agility anyways. Just look at the myriads of ways things broke just by trying to get TLS1.3 to work. And now imagine that totally without any agility mechanism. We would still plan for the big introduction of TLS1.3 a few years into the future.

upofadown · on June 20, 2020

>The fantasy is people think "in case of a sudden algorithm break we can quickly switch to a better algorithm".

That isn't really what people are afraid of. They are afraid of the time, money and wasted effort that comes from having to completely rewrite everything. These cryptographic hashes are sort of a worst case. They often end up being part of a defined interface of some sort like a cryptographic identity. So starting over from scratch can have a high social cost as well.

I think that when considering the lessons from the TLS case that people are forgetting that many applications do not work at all like TLS. You have to have the wisdom to throw out wisdom that does not apply when designing these things. GIT is an excellent example of such an application.

cyphar · on June 20, 2020

You don't have to rewrite everything though. The rest of Git doesn't need to be rewritten if you change hash functions, nor does the rest of WireGuard if you change encryption or key exchange constructions. Newer constructions of similar crypto primitives can be used in the same way to solve the same problem.

The key question is should a single version be able to handle both the old (broken) construction and the new (not broken) construction? There is fairly good evidence that allowing that leads to lots of potential for security issues due to all sorts of downgrade tricks that force users to switch to the broken version (or even worse, use the fact there are two versions to break the crypto as happened with DROWN). However, not allowing it has obvious consequences for usability (Python 3 comes to mind as a non-crypro example where having a hard cutover really didn't work).

hannob · on June 20, 2020

This was pointed out to Linus right at the beginning of GIt development [1]. SHA-256 was available and SHA-1 weaknesses were known.

Unfortunately he chose to ignore it.

[1] https://www.metzdowd.com/pipermail/cryptography/2017-Februar...

robto · on June 20, 2020

I don't think you are, and the IPFS ecosystem has already run into this problem. In IPFS, content is addressed by its hash, but different protocols are better suited to different encodings - browsers speak http, which is case insensitive.

And so they've worked out self-describing hashes! https://multiformats.io/multihash/

The basic idea is that you include the name of the algorithm as part of the hash. This allows changing the hash algorithm without breaking backwards compatibility.

It's a cool standard with implementations[0] for many languages. I don't know if it was considered for git, but it does seem likely that this issue will come up again before the end of time :)

[0]https://github.com/multiformats/multihash

masklinn · on June 20, 2020

> So why wasn't the protocol and the repository designed to have multiple (hash algorithm, hash) pairs?

yagni? The design without a real use case could well have been unsuitable, and git has managed to live for 15 years before this issue became a real concern. That’s a pretty good run.

zamalek · on June 20, 2020

SHA-1 wasn't intended to be cryptographic in Git. In Git, SHA is used to facilitate "content addressable storage". It probably would have been better if something like Murmur were used, because it would have been very clear that the function was not being used in a cryptographic sense.

There was, and is, no need for it to change. The correct solution is to sign commits with something designed for that (PGP).

Switching to SHA-256 is a dangerous change, because people will believe that it is secure in ways that it is not.

SHA-256 will reduce the likelihood of collisions. However, if SHA-1 is used for CAS, there is a higher chance of a meteorite falling through your roof than finding a collision through normal use.

cesarb · on June 20, 2020

> SHA-1 wasn't intended to be cryptographic in Git. In Git, SHA is used to facilitate "content addressable storage". It probably would have been better if something like Murmur were used, because it would have been very clear that the function was not being used in a cryptographic sense.

For "content addressable storage" using hashes to work, you cannot have different files with the same hash, otherwise getting that file (through its contents hash) could return a different file. The only way to make sure there are no different files with the same hash is to use a cryptographic hash, otherwise some joker will create two files with different contents but the same hash (as happened with md5, and later with sha1; luckily, these sha1 files did not have the internal blob header git uses, and git was changed to detect and reject files created through that technique).

> The correct solution is to sign commits with something designed for that (PGP).

Git commits contain only the hash of the tree, the hash of the parent(s), and some other information like dates, authors, and commit message (see it for yourself with "git cat-file commit HEAD"). That is, signing a commit gains nothing if the hash of the tree (and all hashes nested within it) is not a cryptographic hash. You might think the answer is "just sign all the tree contents", but that would be really slow (a checkout of the Linux kernel tree seems to be over one gigabyte nowadays; compare with a commit, which can be as small as 250 bytes).

rudolph9 · on June 20, 2020

I wish more people would adopt (and better formalize) multihash and related projects https://github.com/multiformats/multihash

It’s a really simple good idea that has developed in the ipfs world.

rudolph9 · on June 20, 2020

Perhaps I missed it but what are we doing to future proof the hash algorithm to when we need to transition away from sha256?

Seems like we’re going to just have the same problem in another decade if we don’t address this now?

matt_kantor · on June 21, 2020

> In closing, it is worth noting that one of the reasons this transition has been so hard is that the original Git implementation was not designed to swap out hashing algorithms. Much of the work put in to the SHA-256 implementation has been walking back this initial design flaw. With these changes almost complete, it not only provides an alternative to SHA-1, but also makes Git fundamentally indifferent to the hashing algorithm used. This should make Git more adaptable in the future should the need to replace SHA-256 with something stronger arise.

That's the last paragraph of the article.

miohtama · on June 20, 2020

I remebember during DVCS wars, an argument was raised for Bazaar/Mercurial that they supported pluggable hash algorithms. Git was hard coded to SHA. Now this design decisions is harder to get changed.

qwename · on June 20, 2020

Other than using a stronger hashing algorithm that produces longer hashes, would there be any advantage in storing two or more separate hashes of an object? The extra hashes could be from a different hash function, or a hash of the reversed bits/bytes.

I wonder about the difficulty of producing collisions for a single 256-bit hash function versus two 128-bit hash functions, four 64-bit hash functions and so on.

boneitis · on June 20, 2020

It looks like you've already got the jist of this from the second and third link in a grandchild post. Anyway!

For a bit of a hands-on view, there is a terrific cryptography coding exercise that explores this idea[0], and you take it all the way in actually breaking the construction (although the exercise is concerned with collisions, rather than preimages). The linked exercise has you homebake your own de-fanged hash functions, so that you can run solution code within your lifetime (as well as giving you a closer view of the machinery).

Of course, the exercise applies to fairly raw, Merkle-Damgard-like constructions.

What I took away from the challenge was that I would feel better using a single, longer hash than two shorter, concatenated hashes, because you are bottlenecking your construction to your stronger hash (if they differ in strength).

An auditor or attacker with white box access can attack your smaller hash (if your two hash functions differ in bit-size) to generate a 2^(n/2)-way collision (that is, if it's feasible to collide the hash at all, there is a clever trick to generate a massive number of messages that all collide to the same output), then birthday attack the bigger hash using the messages from that collision-pool. (Importantly, this mention of "n" specifically refers to the bit-size of the larger hash, a.k.a., the bottleneck.)

Although there might be something about it at first glance that makes the messages feel not random enough (in order for the birthday principle to hold), it's far easier to tell you to just do the exercise in order to grok why it works than for me to figure out how to explain it :)

This exercise is the first of an awesome triple-whammy[1][2] of enlightening hash function challenges.

[0] https://cryptopals.com/sets/7/challenges/52 [1] https://cryptopals.com/sets/7/challenges/53 [2] https://cryptopals.com/sets/7/challenges/54

TicklishTiger · on June 20, 2020

That reminds me of my phone. It is kinda special. Instead of an eleven digit number, I have eleven one digit phone numbers. When I write them down, I usually just writen them all in one row. So the written version of my eleven phone numbers look just like a normal eleven digit phone number.

qwename · on June 20, 2020

I will try to show the difference between one 256-bit hash function and two 128-bit hash functions.

Let h_0(x) be a 256-bit hash function. Let h_1(x) and h_2(x) be 128-bit hash functions.

Let h_3(x) = h_1(x) + h_2(x), h_4(x) = h_2(x) + h_1(x), where '+' is the concatenation operator. Thus h_3(x) and h_4(x) are also 256-bit hash functions.

Given an object x_1, find x_2 that satisfies one of the following conditions:

(1) h_0(x_1) = h_0(x_2)

(2) h_3(x_1) = h_3(x_2) AND h_4(x_1) = h_4(x_2)

Finding a collision for a single 256-bit hash function would be trying to solve (1), but finding collisions for two 128-bit hash functions would be trying to solve (2).

Also note that (2) is not equivalent to finding collisions for two 256-bit hash functions.

I think the misconception comes from the fact that the multiple hashes form an unordered set, not an ordered concatenation.

If there's any mistake in my logic please feel free to point them out.

Edit: (2) can be simplified into h_1(x_1) = h_1(x_2) AND h_2(x_1) = h_2(x_2), as concatenation of the hashes does not turn h_3 and h_4 into "real" 256-bit hash functions.

eknkc · on June 20, 2020

This makes sense. However the implementation detail of h_0(x) could also be foo(x) + bar(x) or whatever. At the end, you have a function that returns a fixed number of bytes.

I doubt any has function that we use today changes the algorithm halfway through its operation though. So your proposal might have a benefit in security by obscurity way.

qwename · on June 20, 2020

I suppose I missed the assumption that h_0, h_1 and h_2 cannot be trivially decomposed into concatenations of other hash functions, whereas h_3 and h_4 can.

Also, there's no obscurity since h_1 and h_2 are the actual targeted hash functions, h_3 and h_4 are just to show that concatenation doesn't change the actual problem or make it harder.

SftwreEngnr · on June 20, 2020

What?

sophiebits · on June 20, 2020

A function that concatenates the results of two 64-bit hash functions _is_ a 128-bit hash function.

Just as a string of individual digits makes a larger number.

hobofan · on June 20, 2020

I mean, technically yes, but it it would be a 128-bit hash function with the security properties of a 64-bit hash function, so it offers little advantage over just using a 64bit hash (which I think was also the point you were trying to make, I think?).

However, that doesn't really address the original question on how much harder cracking 2x64bit hash would be than cracking a single 128bit hash would be.

My best guess there would be that it's really quantify as you start opening up more dimensions besides the number of bits. The gain would mostly come from protection against other properties from one of the algorithms like a potentially hidden backdoor or a undiscovered mathematical weakness. So as long as the strength of the individual hash function holds up, it probably makes sense to diversify between hash functions. E.g. SHA3-256 + BLAKE3-256, probably offers better long-term security properties than using SHA3-512.

dependenttypes · on June 20, 2020

> but it it would be a 128-bit hash function with the security properties of a 64-bit hash function

This is not true. Consider two hash functions f and g

    f(x) = md5(x)[0..63]
    g(x) = md5(x)[64..127]

and a third function

   h(x) = f(x) || g(x)

where || is concat

So no, concating multiple smaller hash functions is not any weaker than using a single big one.

hobofan · on June 20, 2020

So your point is that if you take the output of the _same_ 128bit hash function twice, split it into 64bit parts and then put it back together, you still have the full properties of the 128bit hash function? Well, no shit.

I have to admit that I'm not the greatest cryptography whiz, but I can't image that this holds up for _independent_ hash functions, where you should be able to more cheaply run a preimage attack against two 64bit hash functions than one 128bit hash function.

gruez · on June 20, 2020

>I have to admit that I'm not the greatest cryptography whiz, but I can't image that this holds up for _independent_ hash functions, where you should be able to more cheaply run a preimage attack against two 64bit hash functions than one 128bit hash function.

performance is probably an issue. SHA256 has 256 bits and SHA1 has 160 bits (1.6x more bits), but SHA256 isn't 1.6x slower, it's only 38% slower. benchmarks used: https://www.cryptopp.com/benchmarks.html

Back to the original question of "how secure are 2x 64bit hashes compared to 1x 128 bit hash?", I can't imagine how it could be any more secure, considering that if it were more secure, you could just make your 128 bit hash function be the concatenation of the two 64 bit hash functions. It might be equally secure, but I'm not sure why you'd use it over a properly designed 128 bit hash function.

dependenttypes · on June 20, 2020

> where you should be able to more cheaply run a preimage attack against two 64bit hash functions than one 128bit hash function.

Try doing that in my example.

freemint · on June 22, 2020

Your hash functions are contrustructed in a way such that the concatenation has the security properties of a 128 bit hash function (because your construction is equivalent md5). I am not sure that results holds for all concatenations of hash functions. Or for SHA concatenations.

I appreciate you making me think deeper about hash functions. Nice construction.

edoceo · on June 20, 2020

My guess: To paths to the same number string are the same number. ie: two 64 bit numbers is a 128 bit number in disguise.

hannob · on June 20, 2020

Hash security is a complicated beast. There were some research results that concatenating multiple hash functions isn't much more secure than the best of the concatenated functions. It's not a good way to produce more secure hashes.

Also please note that length is only one measurement of hash security. The fact that SHA1 is weak has more reasons. If it was a good hash function it would still have a security of 80 bits, which would not be a comfortable security margin, but still kinda not really broken for real. But further weaknesses reduced the attack complexity.

Likewise if you think about SHA256 this is not just the 256-bit-instead-of-160-bit-version of SHA1. It's a different function (although based on simliar constructions) without the weaknesses of SHA1.

qwename · on June 20, 2020

After finding the right search terms (concatenating hash functions), I found a few stackexchange discussions about this, which lead me to other methods like truncating a stronger hash function[1], and chaining hash functions[2]. Apparently TLS already concatenates MD5 and SHA1 [2][3].

Given that the article is about collision attacks and not preimage resistance, that was my main thought when thinking of the issue. I'll leave it to the experts to figure out what's the best for cryptographic hash functions.

[1] https://crypto.stackexchange.com/questions/9435/is-truncatin...

[2] https://crypto.stackexchange.com/questions/270/guarding-agai...

[3] https://en.wikipedia.org/wiki/Cryptographic_hash_function#Co...

hannob · on June 20, 2020

> Apparently TLS already concatenates MD5 and SHA1 [2][3].

Luckily TLS no longer does such things. This was a bad workaround in old versions of TLS. Which they then replaced by a "you can use secure or insecure hash functions, you choose" in TLS 1.2 (which is hard to excuse - at the time TLS 1.2 was written the weaknesses in SHA1 and MD5 were well known). In TLS 1.3 finally they did the right thing and only support secure hash functions.

sargun · on June 20, 2020

I wonder why SHA-256, and not BLAKE2(b/s) or BLAKE3. BLAKE3 is significantly faster.

slavik81 · on June 20, 2020

SHA-256 is faster when hardware accelerated, and was judged to be a more conservative choice. You can find their evaluation here: https://public-inbox.org/git/20180609224913.GC38834@genre.cr...

JoshTriplett · on June 20, 2020

That post was made in mid 2018, and BLAKE3 wasn't published until late 2019; I don't think it was considered.

SHA256 with hardware acceleration might be competitive with BLAKE2, but not with BLAKE3 (performance graph at https://raw.githubusercontent.com/BLAKE3-team/BLAKE3/master/... ). BLAKE3 is also capable of being heavily parallelized without changing its output, as well as quickly updated without rehashing entire objects (if you keep some partial hashes around). Overall, it'd be quite desirable.

Hopefully all the work to abstract the choice of hash algorithm will make it relatively straightforward to prototype the use of other hashes.

zimmerfrei · on June 20, 2020

They did consider and ruled out KangarooTwelve, which has basically the same properties of BLAKE3 while being derived from Keccak (therefore, with more chances of ending up as a US federal standard).

JoshTriplett · on June 20, 2020

KangarooTwelve didn't have the same performance as BLAKE3 though.

zimmerfrei · on June 20, 2020

It roughly does. Note that it is not present in the comparison chart you posted...

oconnor663 · on June 24, 2020

I'd agree that it roughly does. It's omitted from that chart mainly because it's not very common, but also because the input length measured there would makes K12's throughout appear lower than it should. The BLAKE3 paper contains more detailed graphs across a spectrum of input lengths.

dependenttypes · on June 20, 2020

I think that part of the performance gain of BLAKE3 is due to the fact that it uses a Merkle tree or something. SHA-3 has ParallelHash. One could use that with KangarooTwelve potentially.

I would really like an AES-based hash function as AES-NI is much faster compared to chacha20. There was a function called echo that worked like that during the SHA-3 competition but it was dropped before the final round.

bawolff · on June 20, 2020

Does performance matter in this context? Git isn't exactly hashing terabytes of data.

JoshTriplett · on June 20, 2020

It does; I've encountered some cases (ingesting large amounts of data into git) where the limiting factor on performance is a combination of SHA-1 and DEFLATE.

hobofan · on June 20, 2020

You don't think that GitHub has to hash petabytes of their users data?

altmind · on June 20, 2020

because the speed is not the only concern. also, sha2 is already hardware accelerated, and may still be faster than the rest of the hashes.

gentleman11 · on June 21, 2020

I am working on a game and version control in the gaming world is definitely weird to somebody coming from web dev. Almost nobody in this world uses git, they all use perforce, svn, and plastic. Apparently got doesn’t deal with binary assets well; even with lfs, it’s hard to find hosting for your 100-500GB repositories that also supports file locking (another binary-specific feature needed for games that many hosts don’t seem to support for some reason).

I wanted to try lfs, but where do you actually go to find cheap hosting for a 100-150GB repo?

kzrdude · on June 20, 2020

When browsing lwn in firefox, "unable to connect" errors are relatively common. Is this just the server being aggressive with prioritizing which requests to serve?

austincheney · on June 20, 2020

I couldn’t see in the article, but why settle on SHA2-256? Why not something bigger or newer such as SHA2-512 or SHA3-256? Or why not a variable length hash like SHAKE256?

chmaynard · on June 20, 2020

I believe this is discussed in https://lwn.net/Articles/811068/

kazinator · on June 20, 2020

Is there a way to opt out of this other than pinning your git executables to an old version or forking?

I want to make sure all my git repos are forever readable by the original git they were produced with, even if newer git versions are used for producing commits.

The article makes some troubling comments like:

> countless repositories need to be transitioned from SHA-1 to SHA-256.

Absolutely never doing this.

bluesign · on June 20, 2020

maybe some naive question but why not hash content with SHA256 then hash the result with SHA1?

umvi · on June 20, 2020

That's like saying "why not choose a 20 character password and just use the first 8 characters of it?"

rurban · on June 20, 2020

So brian finally got closer? Waiting for a long time