Are you sure this is the case? Perhaps it's only storing the hash of the lines? ...

chasil · on May 29, 2019

The AWK program:

awk '{a[$0]++}; END{for(b in a) print b, a[b]}'

...will print every unique line in a file with the count. Obviously, that could not be done if the array index was a hash - the array index is the entire line, and the array value is the count.

The original program moves the maintenance of the array into the implicit conditional "pattern," and only prints when the array entry does not yet exist.

hk__2 · on May 29, 2019

It can’t just store the hash of the lines otherwise it would drop lines in case of hash collision.

n4r9 · on May 29, 2019

It depends on the implementation, but typically hash tables are used to store the elements and values of associative arrays:

https://www.gnu.org/software/gawk/manual/html_node/Array-Int...

I suspect that it's designed so that hash collisions are impossible until you get to an unrealistic number of characters per line.

michaelmior · on May 29, 2019

I doubt it's designed to silently break in some cases. Unrealistic isn't realistic until one day it is and that is a bad day. I suppose it could just throw an error in the case of a hash collision, but I doubt it.

n4r9 · on May 29, 2019

But what does it do, then? The page I links states that it uses a hash table. Hash tables apply a hash function to the key. Hash functions map arbitrary input data onto data of a fixed size. It's inevitable that collisions will occur. ~~even if you use some sort of clever workaround in the case of collisions, eventually you use up all the available outputs.~~ (my bad)

I'm not claiming that it will silently break! I'd be very interested in exploring the internals a little more and finding out how hard it is to get a collision in various implementations and how they behave subsequently.

EDIT: I've read chasil's comment and agree that it must be storing raw keys in the array. I guess awk uses separate chaining or something to get around hash collisions.