Removing duplicate lines from files keeping the original order with Awk

asicsp · on May 29, 2019

I have a collection of such one-liners, for duplicates including how to form key for multiple fields, see [1]

[1] https://github.com/learnbyexample/Command-line-text-processi...

asauce · on May 29, 2019

Wow this is great. I want to become more competent with text processing in the command line and this looks like a great place to start.

Thanks for linking this repo!

seriousaccount · on May 29, 2019

Wow! Thanks for this! This is exactly what I was looking for :)

laz_arus · on May 29, 2019

This repo is awesome, great work.

omh · on May 29, 2019

Awk is wonderful. It's an odd way to write programs, but for quick one-off processing tasks it almost can't be beaten.

Somewhat related blog post which I like to refer people to: "Command-line Tools can be 235x Faster than your Hadoop Cluster" https://adamdrake.com/command-line-tools-can-be-235x-faster-...

chasil · on May 29, 2019

Why do you find it odd? I find it to be the very best introduction to C-style control structures.

Chapter 2 of The AWK Programming Language has incredible benefits for a novice.

https://archive.org/download/pdfy-MgN0H1joIoDVoIC7/The_AWK_P...

jolmg · on May 29, 2019

They probably refer to how it has the top-level as

  <line-condition> { <code> }

That's unusual enough among programming languages to call it odd. Being able to do stuff like

  if (/some-pattern/) { ...

and have the regex be evaluated like a condition where it matches with the current line implicitly is also pretty unique.

funkymike · on May 29, 2019

  if (/some-pattern/) { ...

This isn't really unique when you consider perl.

  while (<>) {
    if (/pattern/) {

This does the same. Awk simply has the implicit loop.

jolmg · on May 29, 2019

Yes, I think Perl based that on Awk, but then those are the only 2 languages I know that support something like that. That's still very unique. On the implicit loop, along with Ruby, they're the only 3 languages I know that support something like that. That's also pretty unique, and Awk is the only one that has the implicit loop as a requirement.

naniwaduni · on May 29, 2019

The implicit loop is uncommon among general-purpose langauges, but very common among filtering-centric languages like awk and grep and sed. For a more recent example, see jq. perl 5 may be close to the only mainstream general-purpose language to have embraced that paradigm though.

sdegutis · on May 29, 2019

Article's like OPs about removing duplicate lines make me want to learn awk.

But it feels like it would be a net loss based on how seldom I currently need to write one-off scripts.

Based on experience, I'd probably have a perfect use-case for it every 2-3 years.

tehlike · on May 29, 2019

Or if you know you can achieve some tasks with one of these tools, you might suddenly realize more tasks than you imagined might be solved with them.

dancek · on May 29, 2019

Do you consider learning a tool you won't use a waste of time?

IMHO knowing what kinds of tools exist and how they are used for different tasks is enormously useful. Most software projects require me to create a set of tools to solve problems in a certain space efficiently. In any long-living non-trivial project there will be feature requests you couldn't have anticipated in the beginning. They tend to be painful if your program is just a bunch of features hacked together. But if you take a tools-first approach, the unexpected features can often be solved with what you have.

Of course, time is limited and you can't learn everything. But learning one of every different kind of tool is a very good use of time.

EDIT: Note that I'm not claiming you'll build a web app with awk. I'm saying you might write code that can be used similarly to awk in some abstract sense, and that might be a core part of a web app.

sametmax · on May 29, 2019

Even if you do, it's a better investment to learn a general high level language with strong scripting capabilities, but that is also good at many things else.

Sure, the days you'll need awk, you'll take 15 minutes instead of 2 writing your script. So what ?

But the rest of the year, you'll have a more versatile toolbox at your disposal for automatic things, testing, prototype network processes, make quick web sites or API, and explore data sets.

That being said, I can see the point of learning awk because, well, it's fun.

ryl00 · on May 29, 2019

> Even if you do, it's a better investment to learn a general high level language with strong scripting capabilities, but that is also good at many things else.

And we already have that: it's called perl. :)

sametmax · on May 29, 2019

I tried very hard not to name a specific language so that the point is not cancelled by some lang war.

sansnomme · on May 29, 2019

Are there any pipeline tools for command line stream processing? Because when you have several terabytes of data you can't exactly afford to restart due to a stray comma in your CSV file.

creatornator · on May 31, 2019

If you have a stray comma in your multi-TB CSV file, you probably don't _want_ it to keep going. You risk misinterpreting the mistake and having a grossly malformed output... There's no way to reliably and elegantly recover from something like that. Validation should preferably happen before processing

augustk · on May 29, 2019

And here is the ungolfed version:

awk '{ if (! visited[$0]) { print $0; visited[$0] = 1 } }'

mtzet · on May 29, 2019

Yes. Examples like the one linked make it appear that awk is write-only language, which it really isn't.

'awk' is really a very beautiful little language. It's concise enough to solve many tasks in a single line, making it easy to use interactively while still being able to grow to moderately-sized scripts. It's not supposed to replace a full-blown scripting language like Python, but for processing files line-by-line it's superb.

augustk · on May 29, 2019

Or maybe better:

awk '! visited[$0] { print $0; visited[$0] = 1 }'

asicsp · on May 29, 2019

yet another version:

    awk '!($0 in seen); {seen[$0]}'

zufallsheld · on May 29, 2019

Much more readable and understandable.

RBerenguel · on May 29, 2019

You can always write AWK in a file and read the script with -f, making it fully readable (and AWK is quite a pleasantly readable and surprisingly versatile language to write at that point)

asicsp · on May 29, 2019

to add to this, if you've coded one-liner first, you can convert to script using -o option

for ex:

    awk -o '{ORS = NR%2 ? " " : RS} 1'

gives (default output file is awkprof.out)

    {
        ORS = (NR % 2 ? " " : RS)
    }

    1 {
        print $0
    }

RBerenguel · on May 29, 2019

I wasn't aware of this, might come handy for "one liner edge cases".

nerdponx · on May 29, 2019

That I did not know, great tip, thank you!

kazinator · on May 29, 2019

You need "gawk -M" for this for bignum support, so visited[$0]++ doesn't wrap back to zero, otherwise it is not correct for huge files with huge numbers of duplicates.

The portable one-liner that doesn't suffer from integer wraparound is actually

   awk '!($0 in seen) { seen[$0]; print }'

which can be golfed a bit:

   awk '!($0 in s); s[$0]'

$0 in s tests whether the line exists in the s[] assoc array. We negate that, so we print if it doesn't exist.

Then we unconditionally execute s[$0]. This has an undefined value that behaves like Boolean false. In awk if we mention an array location, it materializes, so this has the effect that "$0 in s" is now true, though s[$0] continues to have an undefined value.

zimpenfish · on May 29, 2019

> huge files with huge numbers of duplicates

At least on the stock MacOS awk, you can get up to 2^53 before arithmetic breaks (doesn't wrap, just doesn't go up any more which means the one-liner still works.)

    > echo '2^53-1' | bc
    9007199254740991
    > seq 1 10 | awk 'BEGIN{a[123]=9007199254740991;b=a[123]}{a[123]++}END{print a[123],b,a[123]-b}'
    9007199254740992 9007199254740991 1

Even with one character per line, you'd need an 18PB file before you got to this limit, afaict.

hjk05 · on May 29, 2019

On a previous post people were complaining that math wasn’t as clear as code. I’d argue that this this is exactly the kind of code-like clarity math notation provides you. I makes perfect sense, but only after 2 full pages describing what’s going on in one line.

dataflow · on May 29, 2019

Where I run into trouble with awk is gawk incompatibilities with the implementation on Mac. The gawk manual really sucks at telling you what exactly is an extension to the language, and I haven't been able to find a good source -- you just have to either guess and check, or cross-check against other ones' manuals (like BSD). Otherwise it's an amazing tool...

akx · on May 29, 2019

I'd suggest just installing `gawk` from Homebrew and using it instead of having to guess which sort of crippled `awk` you're lucky to be using.

The same thing stands for GNU coreutils (`brew install coreutils`); here's macOS `cut` vs GNU `cut` as a quick example.

    ~ $ cut --help
    cut: illegal option -- -
    usage: cut -b list [-n] [file ...]
           cut -c list [file ...]
           cut -f list [-s] [-d delim] [file ...]    

    ~ $ gcut --help
    Usage: gcut OPTION... [FILE]...
    Print selected parts of lines from each FILE to standard output.    

    With no FILE, or when FILE is -, read standard input.    

    Mandatory arguments to long options are mandatory for short options too.
      -b, --bytes=LIST        select only these bytes
      -c, --characters=LIST   select only these characters
      -d, --delimiter=DELIM   use DELIM instead of TAB for field delimiter
      -f, --fields=LIST       select only these fields...
      -n                      (ignored)
          --complement        complement the set of selected bytes, characters or fields
      -s, --only-delimited    do not print lines not containing delimiters
          --output-delimiter=STRING  use STRING as the output delimiter
                                the default is to use the input delimiter
      -z, --zero-terminated    line delimiter is NUL, not newline
          --help     display this help and exit
          --version  output version information and exit

fanf2 · on May 30, 2019

Looking at the Mac OS X source code https://opensource.apple.com/source/awk/awk-24/src/ its version of awk is the “one true awk” maintained by Brian Kernighan. For some reason Apple have deleted the README but you can find a copy at https://svnweb.freebsd.org/base/head/contrib/one-true-awk/

avar · on May 29, 2019

You should be looking at the POSIX specification, and assuming anything GNU awk documents on top of that is an extension: http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk...

dataflow · on May 29, 2019

Honestly that page isn't great at showing up in search results, but in any case -- POSIX can be too restrictive. I don't specifically recall an example for awk, but when every implementation supports something then it's a de-facto standard. If you follow POSIX then you become over-restricted compared to any shell you'll actually encounter. It'd be really nice if there was a page that showed the actual common features between various implementations of POSIX tools, not just the POSIX official features...

kazinator · on May 29, 2019

> that page isn't great at showing up in search results

I have twenty years of experience in getting that page to show up in search results. :)

Currently, a good way to get to it is these search terms:

  posix issue 7

that actually takes us to the newer version; the above is issue 6.

dataflow · on May 30, 2019

> I have twenty years of experience in getting that page to show up in search results. :)

Haha! I love that you acknowledge this because usually people just ignore all the experience they have in getting to the right page and make it look like you're dumb for not being able to find it. Thanks for the pointer! :-)

kazinator · on May 31, 2019

Before POSIX merged with the Single Unix Specification, I used to search using "single unix spec version 2" type queries; and that one brings us back to 1997:

http://pubs.opengroup.org/onlinepubs/7908799/

Ha, that didn't use frames yet! Totally forgot about that.

avar · on May 29, 2019

It's the first search result for "awk posix" for me and the fourth for "awk standard".

But yes, sometimes POSIX or the C standard or whatever is too restrictive, but it's still a good starting point for figuring out how to write portable code.

You'll know anything it doesn't cover is implementation-specific, and then either decide it's not worth it to pursue it, or if it is figure out whether the implementations you're targeting support the feature.

asicsp · on May 29, 2019

I thought the gawk book/documentation [1] did a good job of mentioning differences between various implementations, do you have an example?

You might find this [2] helpful (oops, seems like it got deleted, see [3] - thanks @bionoid)

[1] https://www.gnu.org/software/gawk/manual/gawk.html

[2] https://www.reddit.com/r/awk/comments/4omosp/differences_bet...

[3] https://archive.is/btGky

dataflow · on May 29, 2019

> do you have an example?

Sure, try this:

  echo 1 2 | awk '{ print gensub(/1/, "3", "g", $1); }'

The logical thing for them to do would be to mention in bold and/or big and/or red font under gensub's documentation that it's an extension (e.g. try nawk), whereas looking through it I don't see any mention at all: https://www.gnu.org/software/gawk/manual/html_node/String-Fu...

If I may rant about this for a bit, GNU software manuals are generally rather awful (though they're neither alone in this nor is it impossible to find exceptions). They frequently make absolutely zero effort to display important information more prominently and unimportant information less so (if you're even lucky enough that they tell you the important information in the first place). Like if passing --food will accidentally blow up a nuke in your hometown, you can expect that if they documented it at all, they just casually buried it in the middle of some random paragraph. Their operating assumption seems to be that if you can't be bothered to spend the next 4 hours reading a novel before writing your one-liner then it's just obviously your fault for sucking so much.

Liquid_Fire · on May 29, 2019

While I agree it should be more obvious, it does say in the opening section:

> Those functions that are specific to gawk are marked with a pound sign (‘#’). They are not available in compatibility mode (see section Command-Line Options)

dataflow · on May 29, 2019

Oh dear lord. I've looked at that page probably twenty times in the past year and still not seen the note about that pound sign. Thanks for pointing it out. Man it's infuriating.

bionoid · on May 29, 2019

[2] is available here:

https://www.removeddit.com/r/awk/comments/4omosp/differences...

Archive for posterity: http://archive.is/btGky

RBerenguel · on May 29, 2019

I was there as well, and eventually decided to just always `brew install gawk` and alias `awk` to `gawk`, because more often than not I want to rely on gawk extensions (gawk has includes, for instance!).

To be fair though, every time I have read the manual for a gawk function it clearly says "this is a gawk extension" for non-standard implementations (case in point, delete[1], which is now POSIX, although Mac's awk is too old to have that implemented).

[1] https://www.gnu.org/software/gawk/manual/html_node/Delete.ht...

rwbhn · on May 29, 2019

brew install gawk

dataflow · on May 29, 2019

Yes, but I don't control every computer my code runs on...

aidos · on May 29, 2019

This is a nice little run through of a real life example.

Once you realise that awk's model is 'match pattern { do actions; }' everything makes a whole lot more sense.

sohkamyung · on May 29, 2019

Awk also supports BEGIN and END actions that take place at the start and at the end of execution.

BEGIN might be used to initialise Awk variables or print initial messages, while END can be used to print a summary of actions at the end.

See [1] https://www.grymoire.com/Unix/Awk.html#uh-1

RBerenguel · on May 29, 2019

You can also use either BEGIN or END as the only entrypoints, essentially using AWK as you'd use any other programming language. Yes, sometimes this defeats the point of what AWK excels at, but it's good to know.

lifeisstillgood · on May 29, 2019

There was a twitter thread ages ago where someone had written a collection of (php?) utilities - and the twitterer posted a laughing slap down saying why write a utility when this one liner and that one liner will do?

There was a lot of push back - and this article is a good example of why

If I wanted to remove duplicate lines from a file I would almost certainly not use awk

I have never spent the time to get good enough with the whole new and different languge of awk, and am unlikely to need to (my large scale file processing needs seem small, and if I do it's almost always in context of other processing chains - so a normal languge like python would be the natural choice

I could whip up something like this in python in a less time than it would take to google the answer, read up why the syntax works that way and verify I have not mistyped anything on a few test files.

Basically using awk takes me out of my comfort zone - for a one off task it loses me time, for a production like repeat task I am going to reach for a slew of other solutions.

I mean the title of this page loses the exclamation mark - and it took me two goes to spot it.

oftenwrong · on May 29, 2019

For many years my AWK knowledge was limited to basic '{print $1}' style usage. I never bothered to learn more. I tended to use perl when I needed a custom text-processing operation. Later, as perl became less a part of my working life, I began using ruby instead - they are pretty similar in spirit.

One day, nearly 20 years after it was published, I picked up a used copy of The AWK Programming Language by Aho, Kernighan, and Weinberger. Yes, they are credited in that order on the cover... I suspect intentionally. I only read the first N chapters, but it was enough. I used AWK many times within the following month, and I continue to use AWK on a daily basis. When the task is complicated, I will still use ruby, but often enough AWK is easier.

The point: you think "Why would I learn X when I can use Y?", but you won't really know the answer until you learn X. If I had never learned perl, python, ruby, AWK, shell script, vi macros, then I would probably be editing files by hand (!) like I sometimes catch developers actually doing (!!!). For a person who doesn't know these tools, that might actually be the path of least resistance. Investing some time here and there to learn new tools pays off in the future in ways that are unpredictable.

jacobolus · on May 29, 2019

The basic imperative Python version is much easier to remember and read though, even for not-that-experienced Python programmers. I would expect laypeople to be able to more-or-less figure out what it is supposed to do.

  seen = set()
  with open(filename, "r") as file:
    for line in file:
      if line not in seen:
        print(line)
        seen.add(line)

Often (at least in my experience) this kind of operation is either (a) part of some larger automated data processing pipeline for which it’s really nice to have version control, tests, ... or (b) part of some interactive data exploration by a programmer sitting at a repl somewhere, not just a one-off action we want to apply to one file from the command line.

In those contexts, the Python (or Ruby or Clojure or whatever general-purpose programming language) version is easy to type out more-or-less bug-free from memory, debug when it fails, slot into the rest of the project, modify as part of a team with varied experience, etc. etc.

skinner_ · on May 29, 2019

One advantage is that

  seen.add(line)

can be changed to

  seen.add(hash(line))

which can be significantly more memory efficient for files with long lines.

jacobolus · on May 29, 2019

Or perhaps better, if needs change the seen = set() object can be swapped out for any alternative object seen = foo that provides foo.__contains__ and foo.add methods.

This could involve saving previously seen lines in a radix tree, adding multiple layers of caching, saving infrequently seen lines to disk or over the network, etc. as appropriate for the use case.

asicsp · on May 29, 2019

I don't get your point, it seems like you do not often use cli text processing tools

Just like Python, there are users who use cli and are comfortable using grep/sed/awk/sort/etc

nerdponx · on May 29, 2019

The point is that the people who do use such tools tend to have a derisive attitude towards those who don't, and that the derisiveness is completely unwarranted.

chii · on May 29, 2019

It's very likely that those people know both python and awk, and this their attitude of superiority is not unwarranted.

It's much faster to type out the awk line then to write the same in python.

oblio · on May 29, 2019

> It's very likely that those people know both python and awk, and this their attitude of superiority is not unwarranted.

Ok....

> It's much faster to type out the awk line then to write the same in python.

Is there some sort of speed-typing award that's being handed out that I'm missing? If there isn't, why would they feel superior?

We're all† smug pricks, but that's no cause for celebration. And 99% of the time we're not even justified in our smugness.

† All = a huge chunk of IT people, developers especially.

nerdponx · on May 29, 2019

No, if you're a dick you're still a dick.

I know both Python and Awk. Do I go around telling people "stop using your preferred tool, even though it's efficient enough and works fine, use this other esoteric one instead"? Hell no.

oftenwrong · on May 29, 2019

And what if instead of "stop using your preferred tool" the person says "there's this other tool I use and I find it makes these sorts of tasks easier; You might like it, too."

nerdponx · on May 29, 2019

That's different, and obviously not a case of the "derisive attitude" I pointed out. Just because you don't do it, doesn't mean people don't do it.

kdmccormick · on May 29, 2019

To me, this sounds like coming up with an ad hominem argument to rationalize not learning something that one finds different and challenging. For the record, I do not know awk, but that's because I just haven't taken the time to learn it yet, not because I (ironically) believe it's only for people who think they're better than me.

lifeisstillgood · on May 29, 2019

Perhaps what I am trying to say is

One-off one liners dashed off without syntax errors speaks of long and deep usage of a command line tool. That's cool. But continuing to use those one liners worries me for reasons not to do with skill

I would worry about the manual versus automation being used here. I can think of many cases where a sed/awk solution will work really well - but they almost always will be part of a larger developed and supported pipeline.

But if you using the one liner for anything not trivial you are still doing too much manual work

trying to be even shorter - if awk is your tool great! But ... at some point (and that point is much closer today than previously) anything we do needs a suite of tools we have hacked together and rewritten and passed around - from log file analysis to whatever.

And while awk can absolutely play a role in those tools, I doubt very much that anyone is good enough to make the one liners on the fly.

An quick example might be "show me all the logs for the request sent by user X in the last five minutes off the front web servers but ignore the heartbeat from that app marketing put out and ..."

I want that in my path, alongside everything else I and others working on the systems think useful.

Yes hack together your tools with any language you like. Put them in a seperate repo with all the linting turned off

But don't try and one liner them from scratch.

kdmccormick · on May 29, 2019

I agree. My perception is that tools like awk are best used for one-off tasks, whereas anything part of a greater pipeline should be written in a more readable/maintainable language.

I was just pushing back against the sentiment that awk is undesirable because of attitutes that its users may have, which I don't think you were expressing :)

nerdponx · on May 29, 2019

No? As I said elsewhere, I know and use Awk. I just don't have an attitude about it.

oftenwrong · on May 29, 2019

>The point is that the people who do use such tools tend to have a derisive attitude towards those who don't,

This has not been my experience.

ashelmire · on May 29, 2019

I don’t think you need to learn every cli tool in depth. I don’t know awk, but I recognize its power. However, I’ve got a set of tools I find easier to use, a few that I’ve written, and Ruby or Python at the command line. I’m sure awk could replace many cli tools, but at the cost of learning a new language (and cognitive load with each use) it hasn’t been worth it.

lifeisstillgood · on May 29, 2019

Thank you for putting on three sentences my multi-paragraphs :-)

rusk · on May 29, 2019

Got to love awk. My weapon of choice for ad hoc arbitrary text processing and data analysis. I’ve tried to replace it with more modern tools time and again but nothing else really comes close in that domain.

RBerenguel · on May 29, 2019

Completely agree. Even Python, which has a very low barrier to entry to "read file, possibly csv, do something" has a barrier to entry. Column projections are one-liners in AWK, and aggregates and/or some stats can be a couple of lines in an AWK script proper.

I've been replacing some ad-hoc bash scripts (nothing fancy, just a few if conditions and some formatting of outputs for a deployment) with some AWK, and it's so much handier to write (after 10 years I still can't remember if syntax) and read (it's a proper programming language) than bash

edit: wrong markdown style

vidarh · on May 29, 2019

Interesting Ruby (MRI anyway) has command line options to make it act pretty similar to awk:

-n adds an implicit "while gets ... end" loop. "-p" does the same but prints the contents of $_ at the end. "-e" lets you put an expression on the command line. "-F" specified the field separator like for awk. "-a" turns on auto-split mode when you use it with -n or -p, which basically adds an implicit "$F = $_.split to the while gets .. end loops.

So "ruby -[p or n]a -F[some separator] -e ' [expression gets run once every loop]'" is good for tasks that are suitable for "awk-like" processing but where you may need access to other functionality than what awk provides..

asicsp · on May 29, 2019

I'd say it is more similar to perl than awk for options like -F -l -a -n -e -0 etc. And perl borrowed stuff from sed, awk, etc

I have a collection for ruby one-liners too [1]

[1] https://github.com/learnbyexample/Command-line-text-processi...

rusk · on May 30, 2019

Sed and Awk had a child and named her Perl. When Perl grew up she underwent an epigenetic shift and became Ruby!

vram22 · on May 29, 2019

You probably know it, but in case not, and for others who might not know: Ruby was influenced by Perl, and Perl was influenced by awk. (Both Ruby and Perl were influenced by other languages too.) And (relevant to this thread) Perl was influenced by C, sed, and Unix shell too.

https://en.wikipedia.org/wiki/Ruby_(programming_language)

https://en.wikipedia.org/wiki/Perl

See "Influenced by" sections at both above pages.

rusk · on May 29, 2019

Thanks, I actually looked at trying to make ruby do my awk work a few years ago. I'll take a look again.

wernsey · on May 29, 2019

Seconded.

Just the other day I helped some colleagues clean up a text file using an awk one-liner. It seemed like magic to them.

(Even though the one-liner turned out to be a bit more difficult to write than I thought at first due to '\r' characters in the input file)

vram22 · on May 29, 2019

>(Even though the one-liner turned out to be a bit more difficult to write than I thought at first due to '\r' characters in the input file)

Although you solved it, another way is that one can always pipe the input through a filter like dos2unix first. Very easy to write and versions can be found/written for/in many languages. Essentially, you just have to read each character from stdin and write it to stdout, unless it is a '\r', a.k.a. Carriage Return a.k.a. ASCII character 13, in which case you don't write it.

I've often found that beginners these days don't know what carriage return, line feed, etc. are, and their ASCII codes. Basic but important stuff for text processing.

https://www.google.com/search?q=ascii+table

https://en.wikipedia.org/wiki/Carriage_return

oftenwrong · on May 29, 2019

See also, 'nauniq' (non-adjacent uniq), an implementation of this text-processing task as a full utility, with some convenient options for reducing memory usage:

https://metacpan.org/pod/distribution/App-nauniq/script/naun...

julienfr112 · on May 29, 2019

I was wondering how awk work internally. Does it compile the script then run it ? Is it bytecode or lower level ? A finite state machine ?

benhoyt · on May 29, 2019

They parse the script to a parse tree (abstract syntax tree) and then either interpret that directly (tree-walking interpreter) or compile to bytecode and then execute that. The original awk ("one true awk") uses a simple tree-walking interpreter, as does my own GoAWK implementation. gawk and mawk are slightly faster and compile to bytecode first.

If you're interested, you can read more about how GoAWK works and performs here: https://benhoyt.com/writings/goawk/

srean · on May 29, 2019

Did gawk change to bytecode interpreter recently ? As far as I can recall it used to be an AST walker. Wonder if I am misremembering things. Mawk used tobe more than 'slightly' faster than gawk. Maybe the recent change to bytecode have brought their perf characteristics closer.

julienfr112 · on May 29, 2019

Nice work !

CodeArtisan · on May 29, 2019

With GNU awk, scripts are compiled to bytecode then interpreted with a big switch/case loop.

edit: https://git.savannah.gnu.org/cgit/gawk.git/tree/interpret.h

chasil · on May 29, 2019

The mawk version of the language will output C, if I remember correctly. It is the fastest AWK, and supports some of the GNU extensions.

"The One True AWK" from Brian Kernighan (that is still the system AWK in OpenBSD) switched from a yacc implementation to a custom parser sometime within the last decade (fairly recently).

Busybox also has an awk; I'm not sure what they do.

GNU awk is elsewhere reported to be interpreted bytecode.

snaky · on May 29, 2019

Dennis Ritchie about yacc history details

> In some ways the interesting thing is that the parser (probably for B, couldn't have been C based on radiocarbon dating evidence) was tiny and dead simple using recursive descent for most parts, a precedence table for expressions. But out of the intellectual culture-meets-culture encounter, an enduring tool was created.

https://yarchive.net/comp/handwritten_parse_tables.html

fanf2 · on May 30, 2019

Looks like OpenBSD awk still uses yacc, like other copies of bwk‘s one true awk https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/awk/#d...

reacweb · on May 29, 2019

I have taken as rule to use awk only for trivial tasks and to switch to perl as soon as the syntax is slightly beyond my usual use cases. In perl, I would do:

    perl -nle 'print unless exists $h{$_};$h{$_}++' < your_file

chasil · on May 29, 2019

There are many AWK implementations that are obscenely portable. A case in point is busybox AWK. Another might be http://unxutils.sourceforge.net

A complex AWK script is very, very easy to move to a platform that lacks any AWK parsers. It can easily be done on Windows, without administrative rights, by the placement of a single .EXE - Perl can do many things, but that is not one of them (to the best of my knowledge).

asicsp · on May 29, 2019

perl borrowed stuff from awk, so you could also do

    perl -ne 'print if !$seen{$_}++'

rlonstein · on May 29, 2019

Play a round of perl golf?

    perl -pne '$_=$#$_++?$_:""'

I'm rusty at this but shaved off six chars, five if you count the 'p' added to switches.

reacweb · on May 29, 2019

I have never seen this $#$var trick and google is not the friend of perl operators. Do you have any explanation ?

    perl -ne 'print if!++$#$_'

seems to work also

showdead · on May 29, 2019

If you have array @foo in perl, $#foo is the index of the last element of the array, which is just the size of the array -1. So if @foo is undefined, $#foo is -1.

Using a variable instead of 'foo' is a symbolic reference, so this is effectively using the symbol table as the associative array. This means that this solution also gets it wrong if your file contains a line that matches the name of a built-in variable in perl. That would be tough to debug!

If your file contains

    This is the first line of the file

then during execution of

    ++$#$_

the result is the same as if you had written

    ++$#{This is the first line of the file}

So the variable @{This is the first line of the file} goes from undefined to an array of length 1, turning $#{This is the first line of the file} to 0.

Incidentally, this is why the snippet fails to work for a line repeated more than once: for each occurrence of the expression, the value returned is in the sequence -1, 0, 1, 2, 3, ... so it is only false for the second occurrence.

Using preincrement instead of postincrement means the values returned are 0, 1, 2, 3, ... which means that inverting the test makes it false for every occurrence after the first.

asicsp · on May 29, 2019

edit: this is failing if a line is repeated more than once, see @showdead's excellent explanation

the shortest I've got so far is

    perl -lnE'say if!++$#$_'

---

I don't understand what's happening with $#$_ but seems like something I should look into, thanks :)

you could remove n as p is used and would be same no. of characters as

    perl -ne 'print if $#$_++'

you could save one more by removing space between e switch and single quote

jancsika · on May 29, 2019

Me: Wow, that associative array looks very powerful. Is there a way to leverage it to do something useful like convert curl-obtained JSON array of file patches from Github's API to the mbox format that `git am` expects?

Unix: No, that JSON data is too structured. But if you have a more error-prone format like CSV I can show you a neat trick to filter your bowlers by number of spares.

JdeBP · on May 29, 2019

You have conflated the awk tool with an entire operating system. The operating system of course has many tools, and there are many more tools that one can further add to it, dealing with a range of file formats.

PyroLagus · on May 29, 2019

Well, there's jq, which is perfect for that job, but unfortunately it's not a standard utility, and I'm not sure it ever will be, because we have to stick only with stuff invented in the 90's.

0815test · on May 29, 2019

Luckily for us, Unix is easy to enhance by adding new "skills" to it - a quick search brought me to https://ilya-sher.org/2018/04/10/list-of-json-tools-for-comm...

james_s_tayler · on May 29, 2019

can't u just process JSON data with jq?

https://stedolan.github.io/jq/

jancsika · on May 29, 2019

I'd still need to find a way to massage the data into the mbox format because that is the ancient format that git understands.

I'm not saying that there isn't a way to do that. Only that it can only be done poorly with a big ugly (and probably buggy) spaghetti script that looks nothing like what the expressive demo suggests it should look like.

james_s_tayler · on May 29, 2019

If you're getting a bunch of stuff with curl from GitHub can't you just use curl to get the patches directly from github?

Append .patch to the end of a pr or commit and it spits out the mbox formatted patch.

https://github.com/jiphex/mbox/commit/f139c575e306a1691a31d8...

jancsika · on May 29, 2019

If you do that with curl it will redirect to the login page. Github obviously wants me to use their API.

I assume this wasn't always the case as the use case I'm referencing is a build script I'm debugging.

james_s_tayler · on May 29, 2019

I just tried running curl on it now and it came back fine without redirecting me.

For a private repo you just need to set the correct options to curl.

curl -Lk --cookie "user_session=your_session_cookie_here" https://github.com/your_org/your_project/pull/123.patch

jancsika · on May 29, 2019

And what is the unix tool I use to automatically retrieve the session cookie and paste it there?

james_s_tayler · on May 29, 2019

Naturally, you use curl. You just instruct curl to save the cookies that the login process sends back to you. Then you reuse those cookies on the subsequent request.

Found an example here.

https://gist.github.com/d48/3501047

detaro · on May 29, 2019

The example link above works for me, interestingly enough.

lazyjones · on May 29, 2019

https://github.com/dominictarr/JSON.sh

https://github.com/step-/JSON.awk

ori_b · on May 29, 2019

Gron converts json into something processable.

hk__2 · on May 29, 2019

The tradeoff of this solution is it stores all (unique) lines in memory. If you have a large file with a lot of unique lines you might prefer using `sort -u`, although it doesn’t keep the order.

triangleman · on May 29, 2019

Are you sure this is the case? Perhaps it's only storing the hash of the lines? If not then how do you do that?

IME this one-liner can churn through 100MB of log lines in a second. Other solutions like powershell's "select-object -unique" totally choke on the file.

chasil · on May 29, 2019

The AWK program:

awk '{a[$0]++}; END{for(b in a) print b, a[b]}'

...will print every unique line in a file with the count. Obviously, that could not be done if the array index was a hash - the array index is the entire line, and the array value is the count.

The original program moves the maintenance of the array into the implicit conditional "pattern," and only prints when the array entry does not yet exist.

hk__2 · on May 29, 2019

It can’t just store the hash of the lines otherwise it would drop lines in case of hash collision.

n4r9 · on May 29, 2019

It depends on the implementation, but typically hash tables are used to store the elements and values of associative arrays:

https://www.gnu.org/software/gawk/manual/html_node/Array-Int...

I suspect that it's designed so that hash collisions are impossible until you get to an unrealistic number of characters per line.

michaelmior · on May 29, 2019

I doubt it's designed to silently break in some cases. Unrealistic isn't realistic until one day it is and that is a bad day. I suppose it could just throw an error in the case of a hash collision, but I doubt it.

n4r9 · on May 29, 2019

But what does it do, then? The page I links states that it uses a hash table. Hash tables apply a hash function to the key. Hash functions map arbitrary input data onto data of a fixed size. It's inevitable that collisions will occur. ~~even if you use some sort of clever workaround in the case of collisions, eventually you use up all the available outputs.~~ (my bad)

I'm not claiming that it will silently break! I'd be very interested in exploring the internals a little more and finding out how hard it is to get a collision in various implementations and how they behave subsequently.

EDIT: I've read chasil's comment and agree that it must be storing raw keys in the array. I guess awk uses separate chaining or something to get around hash collisions.

omh · on May 29, 2019

Doesn't sort require keeping lines in memory as well? In fact, doesn't sort keep all lines in memory, whereas this awk solution just keeps the unique lines.

ot · on May 29, 2019

No, most implementations of sort use a bounded memory buffer and an external-memory algorithm, spilling to disk.

chii · on May 29, 2019

I don't believe that - how does it spill to disk? TMP directory? You can still sort if you don't have any disk space (until you run out of swap?) from what I recall.

JdeBP · on May 29, 2019

Your recollection is either limited in what implementations you have encountered, or faulty.

* https://unix.stackexchange.com/a/450900/5132

chasil · on May 29, 2019

The man page for sort has parameters to adjust the location of the temporary files.

These are only used when the allowed memory buffer is exhausted.

kazinator · on May 29, 2019

Even if sort overcomes the memory limitation by using disk space, it's O(n log n), whereas de-duplication through a hash is O(n).

You might be better off switching to another scripting language that has some database API for storing key/value pairs on disk.

Or: use a 64 bit machine for a bigger address space, and add temporary swap files so you have more virtual memory.

0815test · on May 29, 2019

> Or: use a 64 bit machine for a bigger address space

There are people who are still on 32-bit hardware for serious work?

(Even if I was still on that, I'd probably just fire up a RV64 virtual machine (with swap space added within the VM, of course) simply to access the convenience of a larger address space when needed.)

dredmorbius · on May 29, 2019

For values of serious work. Dedicated embedded devices may still run 32 bit CPUs. Much SOHO network kit has ridiculously small storage and memory (8 MB flash, 64 MB RAM), making memory-intensive operations, such as, say, deduplicating a 250,000 item spam and malware domains blocklist, challenging.

These also have awk, almost always via Busybox, though using OpenWRT other versions are installable.

    BusyBox v1.28.4 () built-in shell (ash)

      _______                     ________        __
     |       |.-----.-----.-----.|  |  |  |.----.|  |_
     |   -   ||  _  |  -__|     ||  |  |  ||   _||   _|
     |_______||   __|_____|__|__||________||__|  |____|
              |__| W I R E L E S S   F R E E D O M
     -----------------------------------------------------
     OpenWrt 18.06.2, r7676-cddd7b4c77
     -----------------------------------------------------
    root@modem:~# uname -a
    Linux modem 4.9.152 #0 SMP Wed Jan 30 12:21:02 2019 mips GNU/Linux
    root@modem:~# free
   total       used       free     shared    buffers     cached
    Mem:         59136      38484      20652       1312       2520      13096
    -/+ buffers/cache:      22868      36268
    Swap:            0          0          0
    root@modem:~# df
    Filesystem           1K-blocks      Used Available Use% Mounted on
    /dev/root                 2560      2560         0 100% /rom
    tmpfs                    29568      1268     28300   4% /tmp
    tmpfs                    29568        44     29524   0% /tmp/root
    tmpfs                      512         0       512   0% /dev
    /dev/mtdblock5            3520      1772      1748  50% /overlay
    overlayfs:/overlay        3520      1772      1748  50% /
    root@modem:~# which awk
    /usr/bin/awk
    root@modem:~# ls -l `which awk`
    lrwxrwxrwx    1 root     root            17 Jan 30 12:21 /usr/bin/awk -> ../../bin/busybox
    root@modem:~# opkg list | grep awk
    gawk - 4.2.0-2 - GNU awk
    root@modem:~#

... though in this case, adblock runs on a larger and more capable Turris Omnia (8 GB flash, 2 GB RAM). The hourly sort still shows up on system load average plots.

kazinator · on May 29, 2019

Are you going to sort huge files on a router?

Will the BusyBox version of sort use files when there isn't enough RAM, will you have a big enough read/write flash partition for that?

dredmorbius · on May 29, 2019

The Turris does just fine. It's equivalent, mostly, to a mid-oughts desktop.

The flexibility of keeping the adblock processing self-contained, rather than processing this on another box and rigging an update mechanism, is appealing.

The 'huge" file is 6MB. That's not immense, but it taxes (overly constrained, IMO) typical SOH router resources.

The flexibility afforded, for pennies to a few dollars, of, say, > 1 GB storage and 500 MB RAM, is tremendous.

I'm not sure what Busybox's sort does, though on an earlier iteration on a mid-oughts Linksys WRT54g router running dd-wrt, sorting was infeasible. I hadn't tried the awk trick.

It did have a Busybox awk though, which proved useful.

creatornator · on May 31, 2019

Sorting is O(n log(n)) but you still have to make a second pass at the end to remove duplicates, making it O(n), isn't it?

ot · on May 29, 2019

This is mentioned in the article, together with a method to keep the order by decorating with the line number before sorting.

hk__2 · on May 29, 2019

The sort/uniq tradeoff is mentioned but not the awk one.

founderling · on May 29, 2019

Awk ' visited[$0]++' or how badly HNs automatic title formatter messes up awk commands :)

Hint: You can edit it after you posted.

laz_arus · on May 29, 2019

Done, thanks for the hint :)

founderling · on May 29, 2019

You are !done yet.

kreetx · on May 29, 2019

I think you meant !done[$bug]++ ?

laz_arus · on May 29, 2019

I just saw the missing '!' and now I can't edit the title

laz_arus · on May 29, 2019

mitnk · on May 29, 2019

The man page of (n)awk [0][1] is surprisingly short and readable.

[0] `man awk` on mac

[1] online version https://www.mankier.com/1/nawk

[2] gawk's man page works great as a reference https://www.mankier.com/1/gawk

aabbcc1241 · on May 30, 2019

If I know this, I wouldn't make https://github.com/beenotung/uniqcp

triangleman · on May 29, 2019

So then, what is the one liner to preserve the filename rather than get a new deduped.txt?

Also how do you apply that command to the next file using shell history?

gcmeplz · on May 29, 2019

Use sponge! https://linux.die.net/man/1/sponge

  awk '!n[$0]++' fileName | sponge fileName

dredmorbius · on May 29, 2019

    dedupe.awk <file >file.tmp && mv file.tmp file

Multifile versions vary, I'd prefer listing them out, alternatively you could read from a command output (ls, find, etc.) with a 'while read; do ... done' loop:

    for f in file1 file2 file3
    do
        dedupe.awk <$f > ${f}.tmp && mv ${f}.tmp $f
    done

If you want to apply to specific files on an ad hoc basis, you could wrap the whole thing in a shell function with filename or list as a parameter.

Or 'gawk -i' as suggested.

Properly using tempfile would also be an improvement.

Vogtinator · on May 29, 2019

Or using gawk, gawk -i inplace 'foo {bar}' file

pvaldes · on May 29, 2019

open the file with emacs

menu edit -> select all

press 'escape' and 'x' keys together and write: delete-duplicate-lines

done

_lwad · on May 29, 2019

This is very memory intensive. Which may not matter if the data volume is small enough. But it is also a bit hard to understand, at least not so obvious at first sight. For most use cases sort -u would be ideal and way simpler to understand, if you don't mind having an ordered file at output.

zimpenfish · on May 29, 2019

> if you don't mind having an ordered file at output.

I needed to remove duplicates from a sequenced CSV file yesterday but couldn't figure out the flags for "remove duplicates, output sorted by field 1 asciily, 2 numerically, 3 numerically".

The AWK version worked perfectly.

_lwad · on May 29, 2019

The awk version does not order it. If your output was ordered the way you wanted, it was because the input already was.

mitnk · on May 29, 2019

> This is very memory intensive.

Only for ones not familiar with awk.

It would make a lot of sense after you understand how awk works (as the article explains).

anc84 · on May 29, 2019

That does not make any sense. If it is memory intensive depends on awk, not on the person being familiar with it.

Sois it memory intensive or not?

chasil · on May 29, 2019

The example AWK script will build an array of every unique line of text in the file.

If the file is large, and mostly unique, then assume that a substantial portion of the file will be loaded into memory.

If this is larger than the amount of ram, then portions of the active array will be paged to the swap space, then will thrash the drive as each new line is read forcing a complete rescan of the array.

This is very handy for files that fit in available ram (and zram may help greatly), but it does not scale.

mannykannot · on May 29, 2019

I don't know how awk (or this particular implementation) works, but it could be done such that comparing lines is only necessary when there is a hash collision, and also, finding all prior lines having a given hash need not require a complete rescan of the set of prior lines - e.g. for each hash, keep a list of the offsets of each corresponding prior line. Furthermore, if that 'list' is an array sorted by the lines' text, then whenever you find the current line is unique, you also know where in the array to insert its offset to keep that array sorted - or use a trie or suffix tree.

michaelmior · on May 29, 2019

Sure, you only need to compare when there's a hash collision, but you still need to keep all the lines in memory for later comparison.

mannykannot · on May 29, 2019

Sure (though they could be in a compressed form, such as a suffix tree), but that wasn't the issue I was addressing.

chasil · on May 29, 2019

AWK was the first "scripting" language to implement associative arrays, which they claim they took from SNOBOL4.

Since then, perl and php have also implemented associative arrays. All three can loop over the text index of such an array and produce the original value, which a (bijective) hash cannot do.

Scriptor · on May 29, 2019

I think they're talking about your machine's memory, not human memory.

beefsack · on May 29, 2019

`sort` would also be memory intensive would it not?

chasil · on May 29, 2019

No, sort will use intermediate temporary files instead of exhausting your ram.