Hacker News new | past | comments | ask | show | jobs | submit login

The tradeoff of this solution is it stores all (unique) lines in memory. If you have a large file with a lot of unique lines you might prefer using `sort -u`, although it doesn’t keep the order.



Are you sure this is the case? Perhaps it's only storing the hash of the lines? If not then how do you do that?

IME this one-liner can churn through 100MB of log lines in a second. Other solutions like powershell's "select-object -unique" totally choke on the file.


The AWK program:

awk '{a[$0]++}; END{for(b in a) print b, a[b]}'

...will print every unique line in a file with the count. Obviously, that could not be done if the array index was a hash - the array index is the entire line, and the array value is the count.

The original program moves the maintenance of the array into the implicit conditional "pattern," and only prints when the array entry does not yet exist.


It can’t just store the hash of the lines otherwise it would drop lines in case of hash collision.


It depends on the implementation, but typically hash tables are used to store the elements and values of associative arrays:

https://www.gnu.org/software/gawk/manual/html_node/Array-Int...

I suspect that it's designed so that hash collisions are impossible until you get to an unrealistic number of characters per line.


I doubt it's designed to silently break in some cases. Unrealistic isn't realistic until one day it is and that is a bad day. I suppose it could just throw an error in the case of a hash collision, but I doubt it.


But what does it do, then? The page I links states that it uses a hash table. Hash tables apply a hash function to the key. Hash functions map arbitrary input data onto data of a fixed size. It's inevitable that collisions will occur. ~~even if you use some sort of clever workaround in the case of collisions, eventually you use up all the available outputs.~~ (my bad)

I'm not claiming that it will silently break! I'd be very interested in exploring the internals a little more and finding out how hard it is to get a collision in various implementations and how they behave subsequently.

EDIT: I've read chasil's comment and agree that it must be storing raw keys in the array. I guess awk uses separate chaining or something to get around hash collisions.


Doesn't sort require keeping lines in memory as well? In fact, doesn't sort keep all lines in memory, whereas this awk solution just keeps the unique lines.


No, most implementations of sort use a bounded memory buffer and an external-memory algorithm, spilling to disk.


I don't believe that - how does it spill to disk? TMP directory? You can still sort if you don't have any disk space (until you run out of swap?) from what I recall.


Your recollection is either limited in what implementations you have encountered, or faulty.

* https://unix.stackexchange.com/a/450900/5132


The man page for sort has parameters to adjust the location of the temporary files.

These are only used when the allowed memory buffer is exhausted.


Even if sort overcomes the memory limitation by using disk space, it's O(n log n), whereas de-duplication through a hash is O(n).

You might be better off switching to another scripting language that has some database API for storing key/value pairs on disk.

Or: use a 64 bit machine for a bigger address space, and add temporary swap files so you have more virtual memory.


> Or: use a 64 bit machine for a bigger address space

There are people who are still on 32-bit hardware for serious work?

(Even if I was still on that, I'd probably just fire up a RV64 virtual machine (with swap space added within the VM, of course) simply to access the convenience of a larger address space when needed.)


For values of serious work. Dedicated embedded devices may still run 32 bit CPUs. Much SOHO network kit has ridiculously small storage and memory (8 MB flash, 64 MB RAM), making memory-intensive operations, such as, say, deduplicating a 250,000 item spam and malware domains blocklist, challenging.

These also have awk, almost always via Busybox, though using OpenWRT other versions are installable.

    BusyBox v1.28.4 () built-in shell (ash)

      _______                     ________        __
     |       |.-----.-----.-----.|  |  |  |.----.|  |_
     |   -   ||  _  |  -__|     ||  |  |  ||   _||   _|
     |_______||   __|_____|__|__||________||__|  |____|
              |__| W I R E L E S S   F R E E D O M
     -----------------------------------------------------
     OpenWrt 18.06.2, r7676-cddd7b4c77
     -----------------------------------------------------
    root@modem:~# uname -a
    Linux modem 4.9.152 #0 SMP Wed Jan 30 12:21:02 2019 mips GNU/Linux
    root@modem:~# free
   total       used       free     shared    buffers     cached
    Mem:         59136      38484      20652       1312       2520      13096
    -/+ buffers/cache:      22868      36268
    Swap:            0          0          0
    root@modem:~# df
    Filesystem           1K-blocks      Used Available Use% Mounted on
    /dev/root                 2560      2560         0 100% /rom
    tmpfs                    29568      1268     28300   4% /tmp
    tmpfs                    29568        44     29524   0% /tmp/root
    tmpfs                      512         0       512   0% /dev
    /dev/mtdblock5            3520      1772      1748  50% /overlay
    overlayfs:/overlay        3520      1772      1748  50% /
    root@modem:~# which awk
    /usr/bin/awk
    root@modem:~# ls -l `which awk`
    lrwxrwxrwx    1 root     root            17 Jan 30 12:21 /usr/bin/awk -> ../../bin/busybox
    root@modem:~# opkg list | grep awk
    gawk - 4.2.0-2 - GNU awk
    root@modem:~#
... though in this case, adblock runs on a larger and more capable Turris Omnia (8 GB flash, 2 GB RAM). The hourly sort still shows up on system load average plots.


Are you going to sort huge files on a router?

Will the BusyBox version of sort use files when there isn't enough RAM, will you have a big enough read/write flash partition for that?


The Turris does just fine. It's equivalent, mostly, to a mid-oughts desktop.

The flexibility of keeping the adblock processing self-contained, rather than processing this on another box and rigging an update mechanism, is appealing.

The 'huge" file is 6MB. That's not immense, but it taxes (overly constrained, IMO) typical SOH router resources.

The flexibility afforded, for pennies to a few dollars, of, say, > 1 GB storage and 500 MB RAM, is tremendous.

I'm not sure what Busybox's sort does, though on an earlier iteration on a mid-oughts Linksys WRT54g router running dd-wrt, sorting was infeasible. I hadn't tried the awk trick.

It did have a Busybox awk though, which proved useful.


Sorting is O(n log(n)) but you still have to make a second pass at the end to remove duplicates, making it O(n), isn't it?


This is mentioned in the article, together with a method to keep the order by decorating with the line number before sorting.


The sort/uniq tradeoff is mentioned but not the awk one.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: