The tradeoff of this solution is it stores all (unique) lines in memory. If you have a large file with a lot of unique lines you might prefer using `sort -u`, although it doesn’t keep the order.
Are you sure this is the case? Perhaps it's only storing the hash of the lines? If not then how do you do that?
IME this one-liner can churn through 100MB of log lines in a second. Other solutions like powershell's "select-object -unique" totally choke on the file.
...will print every unique line in a file with the count. Obviously, that could not be done if the array index was a hash - the array index is the entire line, and the array value is the count.
The original program moves the maintenance of the array into the implicit conditional "pattern," and only prints when the array entry does not yet exist.
I doubt it's designed to silently break in some cases. Unrealistic isn't realistic until one day it is and that is a bad day. I suppose it could just throw an error in the case of a hash collision, but I doubt it.
But what does it do, then? The page I links states that it uses a hash table. Hash tables apply a hash function to the key. Hash functions map arbitrary input data onto data of a fixed size. It's inevitable that collisions will occur. ~~even if you use some sort of clever workaround in the case of collisions, eventually you use up all the available outputs.~~ (my bad)
I'm not claiming that it will silently break! I'd be very interested in exploring the internals a little more and finding out how hard it is to get a collision in various implementations and how they behave subsequently.
EDIT: I've read chasil's comment and agree that it must be storing raw keys in the array. I guess awk uses separate chaining or something to get around hash collisions.
Doesn't sort require keeping lines in memory as well?
In fact, doesn't sort keep all lines in memory, whereas this awk solution just keeps the unique lines.
I don't believe that - how does it spill to disk? TMP directory? You can still sort if you don't have any disk space (until you run out of swap?) from what I recall.
> Or: use a 64 bit machine for a bigger address space
There are people who are still on 32-bit hardware for serious work?
(Even if I was still on that, I'd probably just fire up a RV64 virtual machine (with swap space added within the VM, of course) simply to access the convenience of a larger address space when needed.)
For values of serious work. Dedicated embedded devices may still run 32 bit CPUs. Much SOHO network kit has ridiculously small storage and memory (8 MB flash, 64 MB RAM), making memory-intensive operations, such as, say, deduplicating a 250,000 item spam and malware domains blocklist, challenging.
These also have awk, almost always via Busybox, though using OpenWRT other versions are installable.
BusyBox v1.28.4 () built-in shell (ash)
_______ ________ __
| |.-----.-----.-----.| | | |.----.| |_
| - || _ | -__| || | | || _|| _|
|_______|| __|_____|__|__||________||__| |____|
|__| W I R E L E S S F R E E D O M
-----------------------------------------------------
OpenWrt 18.06.2, r7676-cddd7b4c77
-----------------------------------------------------
root@modem:~# uname -a
Linux modem 4.9.152 #0 SMP Wed Jan 30 12:21:02 2019 mips GNU/Linux
root@modem:~# free
total used free shared buffers cached
Mem: 59136 38484 20652 1312 2520 13096
-/+ buffers/cache: 22868 36268
Swap: 0 0 0
root@modem:~# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 2560 2560 0 100% /rom
tmpfs 29568 1268 28300 4% /tmp
tmpfs 29568 44 29524 0% /tmp/root
tmpfs 512 0 512 0% /dev
/dev/mtdblock5 3520 1772 1748 50% /overlay
overlayfs:/overlay 3520 1772 1748 50% /
root@modem:~# which awk
/usr/bin/awk
root@modem:~# ls -l `which awk`
lrwxrwxrwx 1 root root 17 Jan 30 12:21 /usr/bin/awk -> ../../bin/busybox
root@modem:~# opkg list | grep awk
gawk - 4.2.0-2 - GNU awk
root@modem:~#
... though in this case, adblock runs on a larger and more capable Turris Omnia (8 GB flash, 2 GB RAM). The hourly sort still shows up on system load average plots.
The Turris does just fine. It's equivalent, mostly, to a mid-oughts desktop.
The flexibility of keeping the adblock processing self-contained, rather than processing this on another box and rigging an update mechanism, is appealing.
The 'huge" file is 6MB. That's not immense, but it taxes (overly constrained, IMO) typical SOH router resources.
The flexibility afforded, for pennies to a few dollars, of, say, > 1 GB storage and 500 MB RAM, is tremendous.
I'm not sure what Busybox's sort does, though on an earlier iteration on a mid-oughts Linksys WRT54g router running dd-wrt, sorting was infeasible. I hadn't tried the awk trick.
It did have a Busybox awk though, which proved useful.