Hacker News new | past | comments | ask | show | jobs | submit login

You need "gawk -M" for this for bignum support, so visited[$0]++ doesn't wrap back to zero, otherwise it is not correct for huge files with huge numbers of duplicates.

The portable one-liner that doesn't suffer from integer wraparound is actually

   awk '!($0 in seen) { seen[$0]; print }'
which can be golfed a bit:

   awk '!($0 in s); s[$0]'
$0 in s tests whether the line exists in the s[] assoc array. We negate that, so we print if it doesn't exist.

Then we unconditionally execute s[$0]. This has an undefined value that behaves like Boolean false. In awk if we mention an array location, it materializes, so this has the effect that "$0 in s" is now true, though s[$0] continues to have an undefined value.




> huge files with huge numbers of duplicates

At least on the stock MacOS awk, you can get up to 2^53 before arithmetic breaks (doesn't wrap, just doesn't go up any more which means the one-liner still works.)

    > echo '2^53-1' | bc
    9007199254740991
    > seq 1 10 | awk 'BEGIN{a[123]=9007199254740991;b=a[123]}{a[123]++}END{print a[123],b,a[123]-b}'
    9007199254740992 9007199254740991 1
Even with one character per line, you'd need an 18PB file before you got to this limit, afaict.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: