Yes, I think Perl based that on Awk, but then those are the only 2 languages I know that support something like that. That's still very unique. On the implicit loop, along with Ruby, they're the only 3 languages I know that support something like that. That's also pretty unique, and Awk is the only one that has the implicit loop as a requirement.
The implicit loop is uncommon among general-purpose langauges, but very common among filtering-centric languages like awk and grep and sed. For a more recent example, see jq. perl 5 may be close to the only mainstream general-purpose language to have embraced that paradigm though.
Do you consider learning a tool you won't use a waste of time?
IMHO knowing what kinds of tools exist and how they are used for different tasks is enormously useful. Most software projects require me to create a set of tools to solve problems in a certain space efficiently. In any long-living non-trivial project there will be feature requests you couldn't have anticipated in the beginning. They tend to be painful if your program is just a bunch of features hacked together. But if you take a tools-first approach, the unexpected features can often be solved with what you have.
Of course, time is limited and you can't learn everything. But learning one of every different kind of tool is a very good use of time.
EDIT: Note that I'm not claiming you'll build a web app with awk. I'm saying you might write code that can be used similarly to awk in some abstract sense, and that might be a core part of a web app.
Even if you do, it's a better investment to learn a general high level language with strong scripting capabilities, but that is also good at many things else.
Sure, the days you'll need awk, you'll take 15 minutes instead of 2 writing your script. So what ?
But the rest of the year, you'll have a more versatile toolbox at your disposal for automatic things, testing, prototype network processes, make quick web sites or API, and explore data sets.
That being said, I can see the point of learning awk because, well, it's fun.
> Even if you do, it's a better investment to learn a general high level language with strong scripting capabilities, but that is also good at many things else.
Are there any pipeline tools for command line stream processing? Because when you have several terabytes of data you can't exactly afford to restart due to a stray comma in your CSV file.
If you have a stray comma in your multi-TB CSV file, you probably don't _want_ it to keep going. You risk misinterpreting the mistake and having a grossly malformed output... There's no way to reliably and elegantly recover from something like that. Validation should preferably happen before processing
Yes. Examples like the one linked make it appear that awk is write-only language, which it really isn't.
'awk' is really a very beautiful little language. It's concise enough to solve many tasks in a single line, making it easy to use interactively while still being able to grow to moderately-sized scripts. It's not supposed to replace a full-blown scripting language like Python, but for processing files line-by-line it's superb.
You can always write AWK in a file and read the script with -f, making it fully readable (and AWK is quite a pleasantly readable and surprisingly versatile language to write at that point)
You need "gawk -M" for this for bignum support, so visited[$0]++ doesn't wrap back to zero, otherwise it is not correct for huge files with huge numbers of duplicates.
The portable one-liner that doesn't suffer from integer wraparound is actually
awk '!($0 in seen) { seen[$0]; print }'
which can be golfed a bit:
awk '!($0 in s); s[$0]'
$0 in s tests whether the line exists in the s[] assoc array. We negate that, so we print if it doesn't exist.
Then we unconditionally execute s[$0]. This has an undefined value that behaves like Boolean false. In awk if we mention an array location, it materializes, so this has the effect that "$0 in s" is now true, though s[$0] continues to have an undefined value.
At least on the stock MacOS awk, you can get up to 2^53 before arithmetic breaks (doesn't wrap, just doesn't go up any more which means the one-liner still works.)
On a previous post people were complaining that math wasn’t as clear as code. I’d argue that this this is exactly the kind of code-like clarity math notation provides you. I makes perfect sense, but only after 2 full pages describing what’s going on in one line.
Where I run into trouble with awk is gawk incompatibilities with the implementation on Mac. The gawk manual really sucks at telling you what exactly is an extension to the language, and I haven't been able to find a good source -- you just have to either guess and check, or cross-check against other ones' manuals (like BSD). Otherwise it's an amazing tool...
I'd suggest just installing `gawk` from Homebrew and using it instead of having to guess which sort of crippled `awk` you're lucky to be using.
The same thing stands for GNU coreutils (`brew install coreutils`); here's macOS `cut` vs GNU `cut` as a quick example.
~ $ cut --help
cut: illegal option -- -
usage: cut -b list [-n] [file ...]
cut -c list [file ...]
cut -f list [-s] [-d delim] [file ...]
~ $ gcut --help
Usage: gcut OPTION... [FILE]...
Print selected parts of lines from each FILE to standard output.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-b, --bytes=LIST select only these bytes
-c, --characters=LIST select only these characters
-d, --delimiter=DELIM use DELIM instead of TAB for field delimiter
-f, --fields=LIST select only these fields...
-n (ignored)
--complement complement the set of selected bytes, characters or fields
-s, --only-delimited do not print lines not containing delimiters
--output-delimiter=STRING use STRING as the output delimiter
the default is to use the input delimiter
-z, --zero-terminated line delimiter is NUL, not newline
--help display this help and exit
--version output version information and exit
Honestly that page isn't great at showing up in search results, but in any case -- POSIX can be too restrictive. I don't specifically recall an example for awk, but when every implementation supports something then it's a de-facto standard. If you follow POSIX then you become over-restricted compared to any shell you'll actually encounter. It'd be really nice if there was a page that showed the actual common features between various implementations of POSIX tools, not just the POSIX official features...
> I have twenty years of experience in getting that page to show up in search results. :)
Haha! I love that you acknowledge this because usually people just ignore all the experience they have in getting to the right page and make it look like you're dumb for not being able to find it. Thanks for the pointer! :-)
Before POSIX merged with the Single Unix Specification, I used to search using "single unix spec version 2" type queries; and that one brings us back to 1997:
It's the first search result for "awk posix" for me and the fourth for "awk standard".
But yes, sometimes POSIX or the C standard or whatever is too restrictive, but it's still a good starting point for figuring out how to write portable code.
You'll know anything it doesn't cover is implementation-specific, and then either decide it's not worth it to pursue it, or if it is figure out whether the implementations you're targeting support the feature.
The logical thing for them to do would be to mention in bold and/or big and/or red font under gensub's documentation that it's an extension (e.g. try nawk), whereas looking through it I don't see any mention at all: https://www.gnu.org/software/gawk/manual/html_node/String-Fu...
If I may rant about this for a bit, GNU software manuals are generally rather awful (though they're neither alone in this nor is it impossible to find exceptions). They frequently make absolutely zero effort to display important information more prominently and unimportant information less so (if you're even lucky enough that they tell you the important information in the first place). Like if passing --food will accidentally blow up a nuke in your hometown, you can expect that if they documented it at all, they just casually buried it in the middle of some random paragraph. Their operating assumption seems to be that if you can't be bothered to spend the next 4 hours reading a novel before writing your one-liner then it's just obviously your fault for sucking so much.
While I agree it should be more obvious, it does say in the opening section:
> Those functions that are specific to gawk are marked with a pound sign (‘#’). They are not available in compatibility mode (see section Command-Line Options)
Oh dear lord. I've looked at that page probably twenty times in the past year and still not seen the note about that pound sign. Thanks for pointing it out. Man it's infuriating.
I was there as well, and eventually decided to just always `brew install gawk` and alias `awk` to `gawk`, because more often than not I want to rely on gawk extensions (gawk has includes, for instance!).
To be fair though, every time I have read the manual for a gawk function it clearly says "this is a gawk extension" for non-standard implementations (case in point, delete[1], which is now POSIX, although Mac's awk is too old to have that implemented).
You can also use either BEGIN or END as the only entrypoints, essentially using AWK as you'd use any other programming language. Yes, sometimes this defeats the point of what AWK excels at, but it's good to know.
There was a twitter thread ages ago where someone had written a collection of (php?) utilities - and the twitterer posted a laughing slap down saying why write a utility when this one liner and that one liner will do?
There was a lot of push back - and this article is a good example of why
If I wanted to remove duplicate lines from a file I would almost certainly not use awk
I have never spent the time to get good enough with the whole new and different languge of awk, and am unlikely to need to (my large scale file processing needs seem small, and if I do it's almost always in context of other processing chains - so a normal
languge like python would be the natural choice
I could whip up something like this in python in a less time than it would take to google the answer, read up why the syntax works that way and verify I have not mistyped anything on a few test files.
Basically using awk takes me out of my comfort zone - for a one off task it loses me time, for a production like repeat task I am going to reach for a slew of other solutions.
I mean the title of this page loses the exclamation mark - and it took me two goes to spot it.
For many years my AWK knowledge was limited to basic '{print $1}' style usage. I never bothered to learn more. I tended to use perl when I needed a custom text-processing operation. Later, as perl became less a part of my working life, I began using ruby instead - they are pretty similar in spirit.
One day, nearly 20 years after it was published, I picked up a used copy of The AWK Programming Language by Aho, Kernighan, and Weinberger. Yes, they are credited in that order on the cover... I suspect intentionally. I only read the first N chapters, but it was enough. I used AWK many times within the following month, and I continue to use AWK on a daily basis. When the task is complicated, I will still use ruby, but often enough AWK is easier.
The point: you think "Why would I learn X when I can use Y?", but you won't really know the answer until you learn X. If I had never learned perl, python, ruby, AWK, shell script, vi macros, then I would probably be editing files by hand (!) like I sometimes catch developers actually doing (!!!). For a person who doesn't know these tools, that might actually be the path of least resistance. Investing some time here and there to learn new tools pays off in the future in ways that are unpredictable.
The basic imperative Python version is much easier to remember and read though, even for not-that-experienced Python programmers. I would expect laypeople to be able to more-or-less figure out what it is supposed to do.
seen = set()
with open(filename, "r") as file:
for line in file:
if line not in seen:
print(line)
seen.add(line)
Often (at least in my experience) this kind of operation is either (a) part of some larger automated data processing pipeline for which it’s really nice to have version control, tests, ... or (b) part of some interactive data exploration by a programmer sitting at a repl somewhere, not just a one-off action we want to apply to one file from the command line.
In those contexts, the Python (or Ruby or Clojure or whatever general-purpose programming language) version is easy to type out more-or-less bug-free from memory, debug when it fails, slot into the rest of the project, modify as part of a team with varied experience, etc. etc.
Or perhaps better, if needs change the seen = set() object can be swapped out for any alternative object seen = foo that provides foo.__contains__ and foo.add methods.
This could involve saving previously seen lines in a radix tree, adding multiple layers of caching, saving infrequently seen lines to disk or over the network, etc. as appropriate for the use case.
The point is that the people who do use such tools tend to have a derisive attitude towards those who don't, and that the derisiveness is completely unwarranted.
I know both Python and Awk. Do I go around telling people "stop using your preferred tool, even though it's efficient enough and works fine, use this other esoteric one instead"? Hell no.
And what if instead of "stop using your preferred tool" the person says "there's this other tool I use and I find it makes these sorts of tasks easier; You might like it, too."
To me, this sounds like coming up with an ad hominem argument to rationalize not learning something that one finds different and challenging. For the record, I do not know awk, but that's because I just haven't taken the time to learn it yet, not because I (ironically) believe it's only for people who think they're better than me.
One-off one liners dashed off without syntax errors speaks of long and deep usage of a command line tool. That's cool. But continuing to use those one liners worries me for reasons not to do with skill
I would worry about the manual versus automation being used here. I can think of many cases where a sed/awk solution will work really well - but they almost always will be part of a larger developed and supported pipeline.
But if you using the one liner for anything not trivial you are still doing too much manual work
trying to be even shorter - if awk is your tool great! But ... at some point (and that point is much closer today than previously) anything we do needs a suite of tools we have hacked together and rewritten and passed around - from log file analysis to whatever.
And while awk can absolutely play a role in those tools, I doubt very much that anyone is good enough to make the one liners on the fly.
An quick example might be "show me all the logs for the request sent by user X in the last five minutes off the front web servers but ignore the heartbeat from that app marketing put out and ..."
I want that in my path, alongside everything else I and others working on the systems think useful.
Yes hack together your tools with any language you like. Put them in a seperate repo with all the linting turned off
I agree. My perception is that tools like awk are best used for one-off tasks, whereas anything part of a greater pipeline should be written in a more readable/maintainable language.
I was just pushing back against the sentiment that awk is undesirable because of attitutes that its users may have, which I don't think you were expressing :)
I don’t think you need to learn every cli tool in depth. I don’t know awk, but I recognize its power. However, I’ve got a set of tools I find easier to use, a few that I’ve written, and Ruby or Python at the command line. I’m sure awk could replace many cli tools, but at the cost of learning a new language (and cognitive load with each use) it hasn’t been worth it.
Got to love awk. My weapon of choice for ad hoc arbitrary text processing and data analysis. I’ve tried to replace it with more modern tools time and again but nothing else really comes close in that domain.
Completely agree. Even Python, which has a very low barrier to entry to "read file, possibly csv, do something" has a barrier to entry. Column projections are one-liners in AWK, and aggregates and/or some stats can be a couple of lines in an AWK script proper.
I've been replacing some ad-hoc bash scripts (nothing fancy, just a few if conditions and some formatting of outputs for a deployment) with some AWK, and it's so much handier to write (after 10 years I still can't remember if syntax) and read (it's a proper programming language) than bash
Interesting Ruby (MRI anyway) has command line options to make it act pretty similar to awk:
-n adds an implicit "while gets ... end" loop. "-p" does the same but prints the contents of $_ at the end. "-e" lets you put an expression on the command line. "-F" specified the field separator like for awk. "-a" turns on auto-split mode when you use it with -n or -p, which basically adds an implicit "$F = $_.split to the while gets .. end loops.
So "ruby -[p or n]a -F[some separator] -e ' [expression gets run once every loop]'" is good for tasks that are suitable for "awk-like" processing but where you may need access to other functionality than what awk provides..
You probably know it, but in case not, and for others who might not know: Ruby was influenced by Perl, and Perl was influenced by awk. (Both Ruby and Perl were influenced by other languages too.) And (relevant to this thread) Perl was influenced by C, sed, and Unix shell too.
>(Even though the one-liner turned out to be a bit more difficult to write than I thought at first due to '\r' characters in the input file)
Although you solved it, another way is that one can always pipe the input through a filter like dos2unix first. Very easy to write and versions can be found/written for/in many languages. Essentially, you just have to read each character from stdin and write it to stdout, unless it is a '\r', a.k.a. Carriage Return a.k.a. ASCII character 13, in which case you don't write it.
I've often found that beginners these days don't know what carriage return, line feed, etc. are, and their ASCII codes. Basic but important stuff for text processing.
See also, 'nauniq' (non-adjacent uniq), an implementation of this text-processing task as a full utility, with some convenient options for reducing memory usage:
They parse the script to a parse tree (abstract syntax tree) and then either interpret that directly (tree-walking interpreter) or compile to bytecode and then execute that. The original awk ("one true awk") uses a simple tree-walking interpreter, as does my own GoAWK implementation. gawk and mawk are slightly faster and compile to bytecode first.
Did gawk change to bytecode interpreter recently ? As far as I can recall it used to be an AST walker. Wonder if I am misremembering things. Mawk used tobe more than 'slightly' faster than gawk. Maybe the recent change to bytecode have brought their perf characteristics closer.
The mawk version of the language will output C, if I remember correctly. It is the fastest AWK, and supports some of the GNU extensions.
"The One True AWK" from Brian Kernighan (that is still the system AWK in OpenBSD) switched from a yacc implementation to a custom parser sometime within the last decade (fairly recently).
Busybox also has an awk; I'm not sure what they do.
GNU awk is elsewhere reported to be interpreted bytecode.
> In some ways the interesting thing is that the parser (probably for B, couldn't have been C based on radiocarbon dating evidence) was tiny and dead simple using recursive descent for most parts, a precedence table for expressions. But out of the intellectual culture-meets-culture encounter, an enduring tool was created.
I have taken as rule to use awk only for trivial tasks and to switch to perl as soon as the syntax is slightly beyond my usual use cases. In perl, I would do:
There are many AWK implementations that are obscenely portable. A case in point is busybox AWK. Another might be http://unxutils.sourceforge.net
A complex AWK script is very, very easy to move to a platform that lacks any AWK parsers. It can easily be done on Windows, without administrative rights, by the placement of a single .EXE - Perl can do many things, but that is not one of them (to the best of my knowledge).
If you have array @foo in perl, $#foo is the index of the last element of the array, which is just the size of the array -1. So if @foo is undefined, $#foo is -1.
Using a variable instead of 'foo' is a symbolic reference, so this is effectively using the symbol table as the associative array. This means that this solution also gets it wrong if your file contains a line that matches the name of a built-in variable in perl. That would be tough to debug!
If your file contains
This is the first line of the file
then during execution of
++$#$_
the result is the same as if you had written
++$#{This is the first line of the file}
So the variable @{This is the first line of the file} goes from undefined to an array of length 1, turning $#{This is the first line of the file} to 0.
Incidentally, this is why the snippet fails to work for a line repeated more than once: for each occurrence of the expression, the value returned is in the sequence -1, 0, 1, 2, 3, ... so it is only false for the second occurrence.
Using preincrement instead of postincrement means the values returned are 0, 1, 2, 3, ... which means that inverting the test makes it false for every occurrence after the first.
Me: Wow, that associative array looks very powerful. Is there a way to leverage it to do something useful like convert curl-obtained JSON array of file patches from Github's API to the mbox format that `git am` expects?
Unix: No, that JSON data is too structured. But if you have a more error-prone format like CSV I can show you a neat trick to filter your bowlers by number of spares.
You have conflated the awk tool with an entire operating system. The operating system of course has many tools, and there are many more tools that one can further add to it, dealing with a range of file formats.
Well, there's jq, which is perfect for that job, but unfortunately it's not a standard utility, and I'm not sure it ever will be, because we have to stick only with stuff invented in the 90's.
I'd still need to find a way to massage the data into the mbox format because that is the ancient format that git understands.
I'm not saying that there isn't a way to do that. Only that it can only be done poorly with a big ugly (and probably buggy) spaghetti script that looks nothing like what the expressive demo suggests it should look like.
Naturally, you use curl. You just instruct curl to save the cookies that the login process sends back to you. Then you reuse those cookies on the subsequent request.
The tradeoff of this solution is it stores all (unique) lines in memory. If you have a large file with a lot of unique lines you might prefer using `sort -u`, although it doesn’t keep the order.
Are you sure this is the case? Perhaps it's only storing the hash of the lines? If not then how do you do that?
IME this one-liner can churn through 100MB of log lines in a second. Other solutions like powershell's "select-object -unique" totally choke on the file.
...will print every unique line in a file with the count. Obviously, that could not be done if the array index was a hash - the array index is the entire line, and the array value is the count.
The original program moves the maintenance of the array into the implicit conditional "pattern," and only prints when the array entry does not yet exist.
I doubt it's designed to silently break in some cases. Unrealistic isn't realistic until one day it is and that is a bad day. I suppose it could just throw an error in the case of a hash collision, but I doubt it.
But what does it do, then? The page I links states that it uses a hash table. Hash tables apply a hash function to the key. Hash functions map arbitrary input data onto data of a fixed size. It's inevitable that collisions will occur. ~~even if you use some sort of clever workaround in the case of collisions, eventually you use up all the available outputs.~~ (my bad)
I'm not claiming that it will silently break! I'd be very interested in exploring the internals a little more and finding out how hard it is to get a collision in various implementations and how they behave subsequently.
EDIT: I've read chasil's comment and agree that it must be storing raw keys in the array. I guess awk uses separate chaining or something to get around hash collisions.
Doesn't sort require keeping lines in memory as well?
In fact, doesn't sort keep all lines in memory, whereas this awk solution just keeps the unique lines.
I don't believe that - how does it spill to disk? TMP directory? You can still sort if you don't have any disk space (until you run out of swap?) from what I recall.
> Or: use a 64 bit machine for a bigger address space
There are people who are still on 32-bit hardware for serious work?
(Even if I was still on that, I'd probably just fire up a RV64 virtual machine (with swap space added within the VM, of course) simply to access the convenience of a larger address space when needed.)
For values of serious work. Dedicated embedded devices may still run 32 bit CPUs. Much SOHO network kit has ridiculously small storage and memory (8 MB flash, 64 MB RAM), making memory-intensive operations, such as, say, deduplicating a 250,000 item spam and malware domains blocklist, challenging.
These also have awk, almost always via Busybox, though using OpenWRT other versions are installable.
BusyBox v1.28.4 () built-in shell (ash)
_______ ________ __
| |.-----.-----.-----.| | | |.----.| |_
| - || _ | -__| || | | || _|| _|
|_______|| __|_____|__|__||________||__| |____|
|__| W I R E L E S S F R E E D O M
-----------------------------------------------------
OpenWrt 18.06.2, r7676-cddd7b4c77
-----------------------------------------------------
root@modem:~# uname -a
Linux modem 4.9.152 #0 SMP Wed Jan 30 12:21:02 2019 mips GNU/Linux
root@modem:~# free
total used free shared buffers cached
Mem: 59136 38484 20652 1312 2520 13096
-/+ buffers/cache: 22868 36268
Swap: 0 0 0
root@modem:~# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 2560 2560 0 100% /rom
tmpfs 29568 1268 28300 4% /tmp
tmpfs 29568 44 29524 0% /tmp/root
tmpfs 512 0 512 0% /dev
/dev/mtdblock5 3520 1772 1748 50% /overlay
overlayfs:/overlay 3520 1772 1748 50% /
root@modem:~# which awk
/usr/bin/awk
root@modem:~# ls -l `which awk`
lrwxrwxrwx 1 root root 17 Jan 30 12:21 /usr/bin/awk -> ../../bin/busybox
root@modem:~# opkg list | grep awk
gawk - 4.2.0-2 - GNU awk
root@modem:~#
... though in this case, adblock runs on a larger and more capable Turris Omnia (8 GB flash, 2 GB RAM). The hourly sort still shows up on system load average plots.
The Turris does just fine. It's equivalent, mostly, to a mid-oughts desktop.
The flexibility of keeping the adblock processing self-contained, rather than processing this on another box and rigging an update mechanism, is appealing.
The 'huge" file is 6MB. That's not immense, but it taxes (overly constrained, IMO) typical SOH router resources.
The flexibility afforded, for pennies to a few dollars, of, say, > 1 GB storage and 500 MB RAM, is tremendous.
I'm not sure what Busybox's sort does, though on an earlier iteration on a mid-oughts Linksys WRT54g router running dd-wrt, sorting was infeasible. I hadn't tried the awk trick.
It did have a Busybox awk though, which proved useful.
Multifile versions vary, I'd prefer listing them out, alternatively you could read from a command output (ls, find, etc.) with a 'while read; do ... done' loop:
for f in file1 file2 file3
do
dedupe.awk <$f > ${f}.tmp && mv ${f}.tmp $f
done
If you want to apply to specific files on an ad hoc basis, you could wrap the whole thing in a shell function with filename or list as a parameter.
Or 'gawk -i' as suggested.
Properly using tempfile would also be an improvement.
This is very memory intensive. Which may not matter if the data volume is small enough. But it is also a bit hard to understand, at least not so obvious at first sight. For most use cases sort -u would be ideal and way simpler to understand, if you don't mind having an ordered file at output.
> if you don't mind having an ordered file at output.
I needed to remove duplicates from a sequenced CSV file yesterday but couldn't figure out the flags for "remove duplicates, output sorted by field 1 asciily, 2 numerically, 3 numerically".
The example AWK script will build an array of every unique line of text in the file.
If the file is large, and mostly unique, then assume that a substantial portion of the file will be loaded into memory.
If this is larger than the amount of ram, then portions of the active array will be paged to the swap space, then will thrash the drive as each new line is read forcing a complete rescan of the array.
This is very handy for files that fit in available ram (and zram may help greatly), but it does not scale.
I don't know how awk (or this particular implementation) works, but it could be done such that comparing lines is only necessary when there is a hash collision, and also, finding all prior lines having a given hash need not require a complete rescan of the set of prior lines - e.g. for each hash, keep a list of the offsets of each corresponding prior line. Furthermore, if that 'list' is an array sorted by the lines' text, then whenever you find the current line is unique, you also know where in the array to insert its offset to keep that array sorted - or use a trie or suffix tree.
AWK was the first "scripting" language to implement associative arrays, which they claim they took from SNOBOL4.
Since then, perl and php have also implemented associative arrays. All three can loop over the text index of such an array and produce the original value, which a (bijective) hash cannot do.
[1] https://github.com/learnbyexample/Command-line-text-processi...