It can’t just store the hash of the lines otherwise it would drop lines in case ...

n4r9 · on May 29, 2019

It depends on the implementation, but typically hash tables are used to store the elements and values of associative arrays:

https://www.gnu.org/software/gawk/manual/html_node/Array-Int...

I suspect that it's designed so that hash collisions are impossible until you get to an unrealistic number of characters per line.

michaelmior · on May 29, 2019

I doubt it's designed to silently break in some cases. Unrealistic isn't realistic until one day it is and that is a bad day. I suppose it could just throw an error in the case of a hash collision, but I doubt it.

n4r9 · on May 29, 2019

But what does it do, then? The page I links states that it uses a hash table. Hash tables apply a hash function to the key. Hash functions map arbitrary input data onto data of a fixed size. It's inevitable that collisions will occur. ~~even if you use some sort of clever workaround in the case of collisions, eventually you use up all the available outputs.~~ (my bad)

I'm not claiming that it will silently break! I'd be very interested in exploring the internals a little more and finding out how hard it is to get a collision in various implementations and how they behave subsequently.

EDIT: I've read chasil's comment and agree that it must be storing raw keys in the array. I guess awk uses separate chaining or something to get around hash collisions.