Hacker News new | past | comments | ask | show | jobs | submit login

That does not make any sense. If it is memory intensive depends on awk, not on the person being familiar with it.

Sois it memory intensive or not?




The example AWK script will build an array of every unique line of text in the file.

If the file is large, and mostly unique, then assume that a substantial portion of the file will be loaded into memory.

If this is larger than the amount of ram, then portions of the active array will be paged to the swap space, then will thrash the drive as each new line is read forcing a complete rescan of the array.

This is very handy for files that fit in available ram (and zram may help greatly), but it does not scale.


I don't know how awk (or this particular implementation) works, but it could be done such that comparing lines is only necessary when there is a hash collision, and also, finding all prior lines having a given hash need not require a complete rescan of the set of prior lines - e.g. for each hash, keep a list of the offsets of each corresponding prior line. Furthermore, if that 'list' is an array sorted by the lines' text, then whenever you find the current line is unique, you also know where in the array to insert its offset to keep that array sorted - or use a trie or suffix tree.


Sure, you only need to compare when there's a hash collision, but you still need to keep all the lines in memory for later comparison.


Sure (though they could be in a compressed form, such as a suffix tree), but that wasn't the issue I was addressing.


AWK was the first "scripting" language to implement associative arrays, which they claim they took from SNOBOL4.

Since then, perl and php have also implemented associative arrays. All three can loop over the text index of such an array and produce the original value, which a (bijective) hash cannot do.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: