Are you sure this is the case? Perhaps it's only storing the hash of the lines? If not then how do you do that?
IME this one-liner can churn through 100MB of log lines in a second. Other solutions like powershell's "select-object -unique" totally choke on the file.
...will print every unique line in a file with the count. Obviously, that could not be done if the array index was a hash - the array index is the entire line, and the array value is the count.
The original program moves the maintenance of the array into the implicit conditional "pattern," and only prints when the array entry does not yet exist.
I doubt it's designed to silently break in some cases. Unrealistic isn't realistic until one day it is and that is a bad day. I suppose it could just throw an error in the case of a hash collision, but I doubt it.
But what does it do, then? The page I links states that it uses a hash table. Hash tables apply a hash function to the key. Hash functions map arbitrary input data onto data of a fixed size. It's inevitable that collisions will occur. ~~even if you use some sort of clever workaround in the case of collisions, eventually you use up all the available outputs.~~ (my bad)
I'm not claiming that it will silently break! I'd be very interested in exploring the internals a little more and finding out how hard it is to get a collision in various implementations and how they behave subsequently.
EDIT: I've read chasil's comment and agree that it must be storing raw keys in the array. I guess awk uses separate chaining or something to get around hash collisions.
IME this one-liner can churn through 100MB of log lines in a second. Other solutions like powershell's "select-object -unique" totally choke on the file.