Awk: `Begin { ` Part 1

mauvehaus · on Oct 30, 2020

Standard awk warning: it's tempting to try to use awk on csv files. You'll even get good results on simple csv files that leave you encouraged to go further. Don't.

Csv is not standardized and the quoting rules are weird (and not standardized).

If you can live with a certain amount of loss of fidelity in your output, you can get away with using awk. If you want a coarse prototype, use awk.

If you need robust, production-grade handling of csv files, use (or write) something else.

Csv files are a little bit like like dates: superficially simple, with lots of corner cases. Largely for the same reason: lack of standardization.

That said, awk is awesome. It's small enough to fit in your brain, unlike Perl (maybe yours is larger than mine?). It's also pretty universally available, with few massive incompatibilities between versions, unlike shell (provided you avoid the gawk-specific features). I love it.

mdifrgechd · on Oct 30, 2020

I have used awk a lot on csv files. I basically agree with your comments, for me the use case is preliminary exploration, not really much more. There is also the issue of one liners not really ending themselves to good coding practice generally, but it sure is great to dive into a file quickly.

What I came here to say is that when I have a csv file I'm looking at with awk, thos first thing I do is a `wc -l`, then get NF (number of fields) for the first row of the csv (here e.g. 80), followed by `awk -F, 'NF==80'` | wc -l`. If the numbers don't match, then I know it's not parsing properly.

The most common issue of course is commas in quotes strings. I have a small script that removes these, so I can still use awk easily. Anything more complex like newlines in quotes strings and maybe it's time to question if awk is still worth it.

nmz · on Oct 30, 2020

You can parse it with awk though.

https://github.com/Nomarian/Awk-Batteries/blob/master/Units/...

gawk -f $AWK/Units/csv.awk -e '{print NF}'

sergeykish · on Oct 30, 2020

    $ ruby -rcsv -ne 'puts $_.parse_csv[1]' education.csv
    Enrolment in primary, secondary and tertiary education levels
    Total, all countries or areas

AWK is a nice little language. Perl is all of it and shell, sed, grep. I am not proficient in Perl but I know some Ruby

    $ ruby -rEnglish -ne 'puts $_ if $NR <= 5' education.csv 
    $ ruby -rEnglish -aF, -ne 'puts $F[1] if $NR <= 5' education.csv 
    $ ruby -rEnglish -ne 'END { puts $NR }' education.csv
    $ ruby -rEnglish -ne 'puts $_ if $NR % 500 == 0' education.csv

Module English provides AWK names

    $ perl -MEnglish -ne 'print if $NR < 5' education.csv

https://ruby-doc.org/stdlib-2.3.0/libdoc/English/rdoc/Englis...

mulmen · on Oct 30, 2020

Ah Perl, the Swiss Army Chainsaw.

My first big kid job was taking over ownership of a Perl-based Oracle-backed data warehouse. Mostly it was SQL queries wrapped up with a Perl script executed by cron that output excel workbooks or csvs and emailed or dumped to a file server.

Most of the pivots and reporting tables were actually generated in Perl because it was just nicer to work with than Excel.

It was wonderful, I learned so much, mostly how to love Perl and CPAN.

We merged our telco billing system with our new parent company with some Perl, cron, a couple SQL queries and an FTP server.

I have said it for years and I will continue to repeat. If you could snap your fingers and delete all the Perl code in the world your lights would turn off.

firebacon · on Oct 30, 2020

There is https://tools.ietf.org/html/rfc4180

Most CSV files do not follow this standard of course. But you could normalize all CSV files to RFC4180 (or any other consistent format) as the first step of your processing pipeline.

dkarl · on Oct 30, 2020

The issue with encountering CSV in the wild is that everybody who appreciates standards and interoperability ditched it a long time ago. If you are consuming CSV files in the wild, you can be sure that whoever is supplying them to you is using horrible tools to create them and will be unwilling or unable to address issues you find in them.

deathanatos · on Oct 30, 2020

> The issue with encountering CSV in the wild is that everybody who appreciates standards and interoperability ditched it a long time ago.

I worked on a team that used CSV somewhat extensively. For the data we generated, it was RFC complaint. It's pretty trivial to get RFC-compliant CSVs, too; most languages have a library — ours was in the standard library, too.

We also had a ("terrible", as we joked) idea to create a subset of CSV that would contain typing information in a required header row. (We never did it, and it is a bad idea.)

> If you are consuming CSV files in the wild, you can be sure that whoever is supplying them to you is using horrible tools to create them and will be unwilling or unable to address issues you find in them.

…but this is absolutely true. We also consumed CSVs from external sources and contractors, and this was an absolute drain on our productivity. I've also worked with engineers of this caliber, and changing CSV wouldn't change the terrible output. I've seen folks approach eMail, HTTP with a cavalier "oh, it's a trivial text format, I don't need a library!" attitude, and inevitably get it wrong. Pointing out the flaws in their implementation and that a library would fulfill their use-case just fine is just met with more hacks (not fixes) to try to further munge the output into shape. It is decidedly not software engineering. I've seen this even with JSON.

But yeah, even with RFC standard CSV, you shouldn't be parsing it with awk. It is the wrong tool.

protomyth · on Oct 30, 2020

I generally went for turning them into either tab-separated files or used the ASCII codes for record separator and its brethren depending on the job. I never wanted to touch CSV again after parsing it once.

colejohnson66 · on Oct 30, 2020

But if you’re writing code to normalize before sending to awk, why not just process in the normalization program instead of using awk’s bizarre syntax?

mulmen · on Oct 30, 2020

Depends on what you're using to normalize I suppose. Maybe that's Awk too! Maybe Awk is easier for exploration when the data is already clean but you have to write some parsing layer in Java or something and that's not conducive to one-liner exploration.

heresie-dabord · on Oct 30, 2020

> Csv is not standardized and the quoting rules are weird (and not standardized)

There is almost always some normalisation ("scrubbing") needed to prep a CSV. But CSV is viable and awk can rip through massive amounts of data. It is a brilliant and powerful tool.

> It's small enough to fit in your brain, unlike Perl

Perl is also brilliant and powerful. Setting aside the bigotry of people who dislike sigils and using braces for scope, many people who fail to learn Perl well have not tried to use Perl-OOP objects as primitives. Once you do this, Perl's versatility and speed are hard to beat.

lcuff · on Oct 30, 2020

I suffered with Perl until I stopped assuming I could learn it by reading code, and read the Camel book. At about page 85 it says something approximating: "Everything in Perl happens in a context (scalar or Array) and until you realize that, you will be miserable." It still didn't come easy: I've realized we all have different minds. My mind doesn't find Perl natural, but loves Ruby. YMMV.

dataflow · on Oct 30, 2020

Tangent, but another interesting problem with date parsing besides lack of standardization is that it's ridiculously context-sensitive.

For example, if I schedule an event for Oct 1-Jan 1 right now, the most likely (least surprising) parse would be Oct 1, 2021 through January 1, 2022. Which is both surprising because parsing Oct 1 depends on the current date, and because the parsing of Jan 1 depends on the parse of Oct 1, and because both of these depend on the fact that you're scheduling (in all likelihood) a future event, rather than describing a past event. So whatever date parser you use (no matter how good it is at handling different notations) will return garbage if it doesn't take all these into account.

komali2 · on Oct 30, 2020

Heh, interesting, I actually assumed it would be oct 2020 to Jan 2021, and that you were describing an event already in progress.

dataflow · on Oct 30, 2020

Sure :-) even then, you have even more fun... imagine if the next item was Sep 1-Dec 1 and the list was known to be sorted! The parser would need to be stateful to be able to disambiguate!!

Even worse (well, maybe similar, but more surprising): imagine if on Feb 28, you parse "Feb 29-Mar 1"... both of those dates could land in an entirely different year than "Feb 28-Mar 1" would, depending on whether the current year is a leap year...

I dare say I have not yet seen a single parser in my life that handles such issues. In fact I don't think I've seen a parser that can parse a date interval, or that can even do "parse this string assuming it is after that date", or anything like that.

And all of these problems are before we even consider time zones, leap seconds, daylight savings, syntactic ambiguities, etc... not just how they affect individual dates, but also ordered dates (/intervals) like above...

enriquepablo · on Oct 30, 2020

I would say that you are mixing up "parsing" and "calendaring (or something of the sort)". As far as I understand parsing is syntactic analysis, i.e. going from a linear structure to a more complex structure (usually a tree); it should not add to the tree anything that was not in the linear structure. It shouldn't consider a semantic context (such as the current date) to produce an ast.

dataflow · on Oct 30, 2020

By parsing I mean it in the usual sense for a date... strptime, Date.Parse, etc... i.e. turning a string to a date (or multiple dates). You can call it something else if you'd like.

monsieurbanana · on Oct 30, 2020

> I dare say I have not yet seen a single parser in my life that handles such issues. In fact I don't think I've seen a parser that can parse a date interval, or that can even do "parse this string assuming it is after that date", or anything like that.

You can google "nlp date extraction" maybe? I've used libraries in the past that do this.

Note: they're far from being perfect. I ended up not using any, as each had their weird corner cases.

Here's an old one: http://natty.joestelmach.com/try.jsp#

I tried with: "first week of december to end of january"

And it gave me: Tue Dec 01 15:31:30 UTC 2020 Fri Jan 31 15:31:30 UTC 2020

Edit: this one seems more polished: http://nlp.stanford.edu:8080/sutime/process

dataflow · on Oct 30, 2020

Cool, thanks for the tip!

jandrese · on Oct 30, 2020

Parsing anything that was originally developed by humans writing it on paper (or clay tablet, or whatever) is a nightmare. Natural means chaotic.

If your CSV file contains any field entered by humans AWK isn't going to be powerful enough to parse it at scale. Someone somewhere is going to have the name 'Mbat"a, Sho,dlo' in some bizarre ass romanization (and this assume you're not accepting Unicode, which is a whole other can of worms that AWK is not prepared to deal with) that breaks your parser.

mulmen · on Oct 30, 2020

I'm saving that as a test case name but making some small adjustments.

'Mbaät\"a, Sho,dló'

"a" followed by "ä" because some suggest encoding umlauts as double characters. When you decode that does it go first or second?

Answer: use Unicode.

Throw an escape character in there, before the quote character just to make it interesting.

This is a good time for everyone to review: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

asicsp · on Oct 30, 2020

Yeah, here's what a robust solution for parsing csv with awk looks like: https://stackoverflow.com/questions/45420535/whats-the-most-...

basicexploit · on Oct 30, 2020

I was looking for it so thanks a lot!

dkarl · on Oct 30, 2020

I would love to have a command-line tool that reads CSV and has a ton of features to cover different quirks and errors, which can output cleaner formats that I can pipe into other command-line tools.

csvkit [0] might be that tool; I discovered it after my last painful encounter with CSV files and haven't used it in anger yet. Among other things, it translates CSV to JSON, so you can compose it with jq.

[0] https://csvkit.readthedocs.io/en/latest/index.html

pianoben · on Oct 30, 2020

I love these TSV utilities: https://github.com/eBay/tsv-utils Granted they're for "T"sv files, not "C"sv, but there's a handy `csv2tsv` utility included.

There's also `xsv`, from the author of `ripgrep`: https://github.com/BurntSushi/xsv

asicsp · on Oct 30, 2020

See also https://github.com/BurntSushi/xsv

JoshTriplett · on Oct 30, 2020

Seconding the recommendation for xsv; I've used it extensively and it works great.

jhoechtl · on Oct 30, 2020

Use miller and never look back.

https://miller.readthedocs.io/en/latest/10min.html

It so much faster than csvkit

MrDOS · on Oct 30, 2020

At my last employer, I built a filter program, creatively called CSVTools[0], to do something like this. One piece of the project parses CSVs and replaces the commas/newlines (in an escaping- and multiline-aware manner, of course) with ASCII record/unit separator characters[1] (0x1E and 0x1F); the other piece converts that format back into well-formed CSV files. I usually used this with GNU awk, and reconfigured RS[2] and FS[3] appropriately. Or you can just set the input separators (IRS/IFS) and produce plaintext output from AWK.

[0]: https://bitbucket.org/rbr/csvtools

[1]: https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text

[2]: https://www.gnu.org/software/gawk/manual/html_node/awk-split...

[3]: https://www.gnu.org/software/gawk/manual/html_node/Field-Sep...

dbro · on Oct 31, 2020

Good idea! Looks similar to something I wrote called csvquote https://github.com/dbro/csvquote , which enables awk and other command line text tools to work with CSV data that contains embedded commas and newlines.

gpvos · on Oct 30, 2020

csvtool is also nice.[0][1] csvkit is very flexible and can certainly be used in anger, but is a bit finicky; you almost always want to use the -I (--no-inference) option. Additionally, I wrote a tiny Perl script for quick awk-like oneliners.[2]

[0] https://github.com/Chris00/ocaml-csv [1] https://colin.maudry.com/csvtool-manual-page/ [2] https://github.com/gpvos/csved/blob/master/csved

stevebmark · on Oct 30, 2020

"q" is the tool you're looking for http://harelba.github.io/q/ . Impossible to Google for, indispensable for CSV manipulation

drran · on Oct 30, 2020

> As of version 2.0.9, there's no need for any external dependency. Python itself (3.7), and any needed libraries are self-contained inside the installation, isolated from the rest of your system.

Oh, sh*t. I will look for something else.

stevebmark · on Oct 30, 2020

Is your concern that you don't want to use a library that depends on python?

drran · on Oct 30, 2020

Nope. I don't want to use the tool which depends on an unsupported copy of Python.

forinti · on Oct 30, 2020

I parse a lot of CSV files and few are well-formed.

People are careless when writing CSV files because they think it is simple: just put a comma between columns, right?

e12e · on Oct 30, 2020

I just fixed a bug at work, parsing a rather opinionated csv file of products for a web shop. It had mostly good quoting, headers for columns - and uses semicolon for field separation (so, not technically csv, but..).

Funny thing was, a lot of the product names contained an ampersand (no problem there). But one product had an html entity encoded ampersand (&). I have no idea how that semicolon escaped, eh, escaping - but that one line suddenly had most of the columns off by one...

I can see how the entity got into the db (probably errant cutnpaste) - but I wonder at the csv writer that gleefully copied the extra separator to the csv export...

necheffa · on Oct 30, 2020

Everything you just described seems like reasons to not use CSV files in general, rather than not use AWK on CSV.

sidpatil · on Oct 30, 2020

I'm disappointed we didn't use ASCII-separated values.

https://news.ycombinator.com/item?id=7474600

nmz · on Oct 30, 2020

And apparently an even better solution is just putting the size in the format itself. No need to parse when you can just seek.

https://blog.kowalczyk.info/article/fc9203f7c72a4532b1ae51d0...

protomyth · on Oct 30, 2020

Did it multiple times and it works great, but companies you send raw data to often get huffy. So, back it goes into CSV or some poorly defined fixed width text file.

im3w1l · on Oct 30, 2020

I don't see what issue they solve. You still need escaping, because eventually you will want to put an ascii-separated value into your spreadsheet.

smallnamespace · on Oct 30, 2020

CSV is the default interchange in many fields (e.g. a lot of machine learning or academia). You don't get to choose what formats your colleagues send to you.

dotancohen · on Oct 30, 2020

Segue to something that I've been fighting with for years: Do you have a suggestion for working with CSVs in the CLI?

Something that will show a table, with a single pixel border between cells, that allows searching for a value, copying a cell?

walrus · on Oct 30, 2020

It depends what I'm doing, but I typically use one of the following (starting with the most frequent):

1. IPython with Pandas if I expect to do any in-depth exploration/manipulation of the data.

2. A short Python script like `python -c 'import csv, sys; r = csv.reader(sys.stdin); ...' <data.csv` if it needs to run on someone else's machine.

3. https://github.com/BurntSushi/xsv if I want to quickly munge some data or extract a field.

4. Gnumeric if I want a GUI.

5. https://github.com/andmarti1424/sc-im if I want a TUI. This is closest to what you were asking for.

dotancohen · on Nov 1, 2020

Just a heads-up, if you're interested. Only SC-IM was fit for the non-database-format files I need to work with. But it does not have a search function! So I'm writing an app in Python for browsing spreadsheets: https://github.com/dotancohen/osheet

dotancohen · on Oct 31, 2020

  > 5. https://github.com/andmarti1424/sc-im if I want
  > a TUI. This is closest to what you were asking for.

Thank you! Yes, this seems to be almost perfect. I cannot believe that there is no "arbitrary string search" feature, but I can grep in another terminal window at least.

Thank you.

lmns · on Oct 30, 2020

Sqlite3 can import and export CSV via the CLI application. Both Ruby and Python provide CSV parsers in their standard library.

dotancohen · on Oct 30, 2020

Thank you. I've used sqlite, but being able to query the data is not the same as being able to browse that data.

For instance, consider a project that I'm currently working on. Google Docs sheet exported to CSV. The file documents dozens of servers used as load balancers, websites, and database servers. The files has gone from MS Excel to Google Docs, and been maintained by four admins over 12 years. There are no columns per se, rather "sections of rows", each of which isn't even consistent in itself. One section is double columned, with the top row listing public IP address, login name, etc and the second row listing LAN IP address, login password, etc.

sqlite is useless on these files and frankly I've seen these things more often than well-formatted spreadsheets at every company I've ever worked with. Nobody but software developers treat spreadsheet columns as database fields.

lmns · on Oct 31, 2020

I know, there is no easy fix for that.

dotancohen · on Nov 1, 2020

Just a heads-up, if you're interested. I'm writing an app in Python for browsing spreadsheets:

https://github.com/dotancohen/osheet

7thaccount · on Oct 30, 2020

Doesn't it convert everything to strings including numbers? Or can you specify types?

lmns · on Oct 31, 2020

You have to provide a schema anyway, so you can treat specific columns as integers.

fnord123 · on Oct 30, 2020

sqlite won't fail on bad types but you can do maths (e.g. sum, etc) on numeric columns,

ydant · on Oct 30, 2020

I pretty much exclusively rely on:

  csvlook -I | less -S
  or
  csvformat -T | less -S

(-I to avoid mixed data-types being displayed weird, but I don't consistently use that)

from csvkit - https://csvkit.readthedocs.io/en/1.0.2/scripts/csvlook.html

Tons of useful CSV tools in there - csvcut, csvjoin, csvgrep.

  csvformat -T

Makes most CSV much more manageable for a shell pipeline, as tabs are easy to deal with and unlikely to be in the actual data

dotancohen · on Nov 1, 2020

Just a reply to myself, for anybody else looking for an application to browse spreadsheets on the CLI.

There are a few CLI applications for working with data stored in a CSV file, so long as the data is formed as a database: each row as a data entry, each column as a database field. The single CLI app that run on Linux for browsing a spreadsheet that is not formed as a database, SC-IM, does not support searching for arbitrary text!

Therefore I am writing osheet: A Linux CLI app for browsing a spreadsheet. It currently supports only XLS files, because that is what I need. It currently crashes a lot. It currently has no features other than searching. It won't even scroll yet to display all the rows and columns. I just banged it out last night when I should have been sleeping. But I'm working on it, you are invited to test it, report bugs, and pull requests are of course welcome. It's written in Python and Curses.

https://github.com/dotancohen/osheet

mauvehaus · on Oct 30, 2020

If I'm deep enough into a csv file that I'm worried about everything I noted above, I usually just load the data into Postgres and trust that their devs got it right. Pretty presentation is free at that point :-)

dotancohen · on Oct 31, 2020

Postgress or even sqlite are great if the CSV is formatted like a database, with each column a distinct field. Alas, most spreadsheets created by non-software-devs are not so strict.

I need to browse the cells in a something like curses.

rnestler · on Oct 30, 2020

There is https://github.com/BurntSushi/xsv but it doesn't seem to print with borders between cells.

dotancohen · on Nov 1, 2020

That's good for data formatted like a database. I need something to just browse the cells, so I'm writing an app in Python for browsing spreadsheets:

https://github.com/dotancohen/osheet

kermatt · on Oct 30, 2020

https://www.visidata.org/

dotancohen · on Nov 1, 2020

That's good for data formatted like a database. I need something to just browse the cells, so I'm writing an app in Python for browsing spreadsheets:

https://github.com/dotancohen/osheet

genuinebyte · on Oct 30, 2020

The best description of dates and CSV files I've heard by far. Bravo.

"Csv files are a little bit like like dates: superficially simple, with lots of corner cases. Largely for the same reason: lack of standardization."

cafard · on Oct 30, 2020

I just attempted to run some government data through Python using the csv module. It worked like a charm, until it told me that a field had exceeded the maximum allowable length. A close look showed that a field began but did not end with a quotation mark. Using the csv.QUOTE_NONE flag resolved that, but did not remedy such quirks as short or long records.

CSV files, depending on who generates them, are a bit like dates if the status of a year as leap or not depended on whether the date of Easter were 0 mod 4.

jsdwarf · on Oct 30, 2020

My gripe with the Python csv.reader module is that it has no skip_lines parameter. Ill-formatted lines at the top/bottom of a CSV file are quite common

mulmen · on Oct 30, 2020

What format of data file is best for working with Awk? I assume some kind of tab delimited file? With a reliable ingestion layer you can convert all your files CSV-ish files to something ergonomic for Awk scripting.

JSON data files seem to be a similar deal. Sometimes they are actually properly formed JSON arrays and sometimes they are individual line-delimited objects.

Data files are a mess, is there a command line tool that is well suited to taking in many inconsistent formats and outputting something ergonomic?

bluedino · on Oct 30, 2020

CSV standards are whatever the application implemented. We'd pull a "CSV" out of one system, then have to massage it before uploading to system A and then make other changes so it would be recognized by system B.

I'm torn. On one hand, it's easy. It lets people work with the data, albeit using error-prone tools like NotePad++ and Excel.

On the other hand, it's just flat out lazy to use.

easytiger · on Oct 30, 2020

It depends what you are doing. in my experience when you write a script to do something with a CSV it's been created by a specific piece of software and you only feed that csv to the script. I doubt anyone is writing anything generic in ask for parsing a multitude of different csvs

xashor · on Oct 30, 2020

CSV to TSV via robust program, use AWK, and convert back if needed. Then you only have to remember the \\t\n\r escapes, not the whole quoting shenanigans.

kasabali · on Oct 30, 2020

Even better just use TSV. Tabs are better separator for real world data, you can strip tabs unless you need to store verbatim string data.

stevebmark · on Oct 30, 2020

q is a great tool for CSV manipulation http://harelba.github.io/q/ . Awk is not

awild · on Oct 30, 2020

I have been using AWK at work and found it cumbersome for my use-case (run-once analysis/manipulation of 200GB csv datasets). I found out about Miller[1] a year or so back and have been using that instead. I don't know how it stacks up in terms of performance, but for my money, named arguments and one-shot statistics is all I need.

For example analysing the number of people per-year over multiple differently formatted files is as easy as `mlr uniq -f pid,year then count -g year`. It has filtering, very extensive manipulation capabilities and a nice documentation.

John has also reacted very fast on feature requests I had (so fast that I've not yet implemented using them).

[1]: https://miller.readthedocs.io/en/latest/#overview

charris000 · on Oct 30, 2020

+1 for Miller. It's a really slick tool for exploring large csv datasets. I've typically used it to do some prototyping and exploration of 200-500GB csv datasets before doing more hefty work in Java+PigLatin (our use-case is more long-term than just a single run for analysis, so that's re reason for moving out of just Miller). It's great to get a feel for the data and run some initial analysis before diving into the larger, more cumbersome system and tooling.

Gravityloss · on Oct 30, 2020

This seems very interesting.

Making this reproducable and usable for multiple people has always been an annoyance in semi-ad-hoc data collection.

asicsp · on Oct 30, 2020

Nice introduction (based on both tutorials). One suggestion would be to use

    -v FPAT='[^,]*|"[^"]+"'

instead of

    BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }

>If you get the awk programming language manual…you’ll read it in about two hours and then you’re done. That’s it. You know all of awk.

I can't work my head around this quote. That's a ridiculous claim. Even for a experienced programmer, learning a new programming language in 2 weeks, let alone 2 hours would be nothing short of a miracle. I've been using awk for past 2-3 years or so and I wrote a book on GNU awk one-liners earlier this year (https://learnbyexample.github.io/learn_gnuawk/). I'm nowhere close to knowing all of awk

Athas · on Oct 30, 2020

GNU Awk has lots of extensions and is a relatively big language, but I definitely think you can learn most of the original Awk in two weeks - and you can certainly learn enough to be useful in two hours. It's been a decade now, but my recollection is that when I spent a day or two idly reading the Awk book, and was able to immediately employ it in simple shell pipelines after reading not that many pages. I think Awk is useful once you've learned just a bit about it, and I don't think learning a lot about Awk is very useful - there are languages that are better suited for complex tasks.

asicsp · on Oct 30, 2020

Fair enough, someone else made similar point that I had GNU awk in mind wrt all of awk

And yeah, it's possible to start using basic field processing and regexp features (if you already know it) in 2 hours and learn most of it in 2 weeks. But, that quote could've been something like you could get started in 2 hours instead of saying one could know all of awk.

djeiasbsbo · on Oct 31, 2020

This HN post inspired me to pick up and work through "The AWK Programming Language", so I have been doing that for the past two days. I'm currently at the beginning of chapter 5 (of 8).

That quote is definitely ridiculous. He may have been referring to just chapter 2, in which the "whole language" is outlined. But later chapters also take time, there even is an exercise about writing an assembler in AWK as well as implementing multiple algorithms.

Some stuff was already somewhat familiar to me from studying/practicing C earlier this year.

Overall a 10/10 book so far, but takes time and patience for sure!

samoa42 · on Oct 30, 2020

> I've been using awk for past 2-3 years or so and I wrote a book on GNU awk one-liners

not to criticise your work, but this does not sit well with me.

asicsp · on Oct 30, 2020

Could you elaborate? I didn't get your point.

rethab · on Oct 30, 2020

I don't agree that you can learn AWK in its entirety in two hours, but I do (strongly) agree that you can teach yourself the basics in two hours and those two hours are extremely well spent considering that AWK is useful for so many things.

Surely handling the intricacies of CSV can be a pain, but let's face it: Most tasks are simple and its results are easily verified for correctness. Obviously, I don't write my production webserver with AWK -- I write simple script to parse data (locally), find semantic issues in CSV files, analyze log data, etc..

AWK is great! If you want a slightly bigger cheatsheet, here's mine: https://github.com/rethab/awk-scripts/blob/master/talk.txt

kqr · on Oct 30, 2020

You can also trivially come across data designed for consumption with AWK. Whenever I write a script or utility to dump some data for later analysis, I take into consideration if I can shape the output in a way that will make it easy to work with in AWK.

Because I won't know ahead of time what all the interesting values or summaries will be, but if I know I can get it through AWK it will be cheap to extract or compute those values and summaries later, once I know what they are.

disown · on Oct 30, 2020

The biggest problem with awk is that unless you use it regularly, you'll easily forget how to use it. So it's important to keep a running cheatsheet like you did. Nothing more frustrating than forgetting the awk command and searching for it in your history only to find it aged out.

yesenadam · on Oct 30, 2020

> He said, “If you get the awk programming language manual…you’ll read it in about two hours and then you’re done. That’s it. You know all of awk.”

It's hyperlinked to the Gawk manual, but it seems likely he actually meant A, W & K's The Awk Programming Language (1988), which you could conceivably read in 2 hours, as it's a joy to read. I used it and The C Programming Language as exemplars of great documentation when writing my own.

elvis70 · on Oct 30, 2020

It's one of my favorite technical books, with many examples of the application of the language, including the design of small languages and experimentation with algorithms.

It is freely available at archive.org: https://archive.org/details/pdfy-MgN0H1joIoDVoIC7

maxioatic · on Oct 30, 2020

Much thanks

statquontrarian · on Oct 30, 2020

There's also the POSIX standard which is quite readable: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/a...

elvis70 · on Oct 30, 2020

By the way, "Learning Perl" is also a great introductory book to a language.

diegocg · on Oct 30, 2020

FYI, LWN has a great article about the current status of awk, latest features and future developments https://lwn.net/Articles/820829/

swills · on Oct 30, 2020

Happy to see this mentions https://github.com/onetrueawk/awk

SoSoRoCoCo · on Oct 30, 2020

The O'Reilly book is "Sed & Awk" published in the 90's (https://www.oreilly.com/library/view/sed-awk/1565922255/). The languages work together as a team (like lex and yacc). And the book takes about 2 hours to read. Unix is built on the premise of "lots of simple tools that fit together with pipes".

Although this is like 1,000th article on how easy it is to learn awk. Why write yet another one?

Awk in 20 Minutes (2015) https://news.ycombinator.com/item?id=23048054

Learn just a little Awk (2010) https://news.ycombinator.com/item?id=17322412

Learn Awk by Example (2019) https://news.ycombinator.com/item?id=22455779

it goes on and on...

https://hn.algolia.com/?q=awk

celticninja · on Oct 30, 2020

The reason to write one is to record what you learn, to further embed it into your brain. Why post it? Whoever posted it found it interesting.

I had not seen the others so this is an intro to awk for me. But in general if something has made it to the front page of HN it was interesting enough for enough people to put it there.

xashor · on Oct 30, 2020

Instead of the gawk manual, I recommend the OpenBSD manpage: https://man.openbsd.org/awk.1 It's concise and well written (as most OpenBSD manpages), so you can learn AWK even in less than 2 hours. Of course it handles only AWK, not the gawk extensions.

nmz · on Oct 30, 2020

mawk also has a nice and short manual. since GNU decided to make its money selling books, their manuals are usually huge and that's by design.

Zhyl · on Oct 30, 2020

I always link to this blog post [0] by Jonathan Palardy. It was one of those eye opening articles for me that really helped me 'grok' it - and this is after having had a couple of stabs at learning it just from reading the text books on it.

Awk for me is part of the layers of tools from grep/sed -> AWK -> perl one liner -> full on script. The further down the ladder you go, the more restricted you are, but this means that programs are terser and usually stay readable despite the terseness.

I would also recommend Perl One liners [1] which did the same thing for me for Perl one liners and, subsequently, full Perl.

[0] https://blog.jpalardy.com/posts/why-learn-awk/

[1] https://nostarch.com/perloneliners

cesaref · on Oct 30, 2020

My favourite bit of gawk (ok, an extended version of awk but it's probably what you are running) is the network support, which in the manual includes writing simple HTTP servers with it. I kid you not.

Xophmeister · on Oct 30, 2020

I love Awk. The idea of a programming language that effectively facilitates first-class state machines over text streams fits so well with the Unix pipeline model.

motiejus · on Oct 30, 2020

Fred Hebert (a known name in Erlang ecosystem) created a tutorial "awk in 20 minutes": https://ferd.ca/awk-in-20-minutes.html

20 minutes for that tutorial is enough to write non-trivial awk programs, and, least or most importantly, impress colleagues.

untoxicness · on Oct 30, 2020

Brian Kernighan talked about his love of awk on the Lex Fridman podcast [0].

[0] https://lexfridman.com/brian-kernighan/

arminiusreturns · on Oct 30, 2020

My story about learning the power of awk was I was working on a dataset and had written a page or so Worth's of Perl... and kept bungling it. I emailed the scientist who wrote the paper whose technique I was emulating and he said, why not use awk, and gave me a basic example one liner, with a bit of modification it worked! Of course all that really means is I suck at Perl, but it was an eye opener for how powerful awk is.

bingerman · on Oct 30, 2020

I learned awk before Perl and it helped a lot. I still use awk (and sed) regularly but for more complex tasks involving regexes and such Perl is my favourite.

js2 · on Oct 30, 2020

Here's a basic network calculator I wrote in awk once because I had to do these calculations on a minimal SunOS system that didn't have any other way to do math from the shell.

https://gist.github.com/jaysoffian/7bb70e1065f085b46a00

You never know when a bit of awk will save your bacon.

_y4o5 · on Oct 30, 2020

Awk and shell scripting is the rabbit hole that is best avoided IMHO. Often it's just easier to crank out a "real" program using your lost familiar programming language or at very least python instead of starting to work with shell or awk scripting.

nmz · on Oct 31, 2020

Awk is a full blown programming language though. comparable to python but allowing for much more compact code. Though I completely agree on avoiding shell. the only good shell I've seen is rc. but the relying on external utilities is a double edged sword. forking is too costly over time.

known · on Oct 31, 2020

Before using AWK we need to sanitize the data files;