Standard awk warning: it's tempting to try to use awk on csv files. You'll even get good results on simple csv files that leave you encouraged to go further. Don't.
Csv is not standardized and the quoting rules are weird (and not standardized).
If you can live with a certain amount of loss of fidelity in your output, you can get away with using awk. If you want a coarse prototype, use awk.
If you need robust, production-grade handling of csv files, use (or write) something else.
Csv files are a little bit like like dates: superficially simple, with lots of corner cases. Largely for the same reason: lack of standardization.
That said, awk is awesome. It's small enough to fit in your brain, unlike Perl (maybe yours is larger than mine?). It's also pretty universally available, with few massive incompatibilities between versions, unlike shell (provided you avoid the gawk-specific features). I love it.
I have used awk a lot on csv files. I basically agree with your comments, for me the use case is preliminary exploration, not really much more. There is also the issue of one liners not really ending themselves to good coding practice generally, but it sure is great to dive into a file quickly.
What I came here to say is that when I have a csv file I'm looking at with awk, thos first thing I do is a `wc -l`, then get NF (number of fields) for the first row of the csv (here e.g. 80), followed by `awk -F, 'NF==80'` | wc -l`. If the numbers don't match, then I know it's not parsing properly.
The most common issue of course is commas in quotes strings. I have a small script that removes these, so I can still use awk easily. Anything more complex like newlines in quotes strings and maybe it's time to question if awk is still worth it.
My first big kid job was taking over ownership of a Perl-based Oracle-backed data warehouse. Mostly it was SQL queries wrapped up with a Perl script executed by cron that output excel workbooks or csvs and emailed or dumped to a file server.
Most of the pivots and reporting tables were actually generated in Perl because it was just nicer to work with than Excel.
It was wonderful, I learned so much, mostly how to love Perl and CPAN.
We merged our telco billing system with our new parent company with some Perl, cron, a couple SQL queries and an FTP server.
I have said it for years and I will continue to repeat. If you could snap your fingers and delete all the Perl code in the world your lights would turn off.
Most CSV files do not follow this standard of course. But you could normalize all CSV files to RFC4180 (or any other consistent format) as the first step of your processing pipeline.
The issue with encountering CSV in the wild is that everybody who appreciates standards and interoperability ditched it a long time ago. If you are consuming CSV files in the wild, you can be sure that whoever is supplying them to you is using horrible tools to create them and will be unwilling or unable to address issues you find in them.
> The issue with encountering CSV in the wild is that everybody who appreciates standards and interoperability ditched it a long time ago.
I worked on a team that used CSV somewhat extensively. For the data we generated, it was RFC complaint. It's pretty trivial to get RFC-compliant CSVs, too; most languages have a library — ours was in the standard library, too.
We also had a ("terrible", as we joked) idea to create a subset of CSV that would contain typing information in a required header row. (We never did it, and it is a bad idea.)
> If you are consuming CSV files in the wild, you can be sure that whoever is supplying them to you is using horrible tools to create them and will be unwilling or unable to address issues you find in them.
…but this is absolutely true. We also consumed CSVs from external sources and contractors, and this was an absolute drain on our productivity. I've also worked with engineers of this caliber, and changing CSV wouldn't change the terrible output. I've seen folks approach eMail, HTTP with a cavalier "oh, it's a trivial text format, I don't need a library!" attitude, and inevitably get it wrong. Pointing out the flaws in their implementation and that a library would fulfill their use-case just fine is just met with more hacks (not fixes) to try to further munge the output into shape. It is decidedly not software engineering. I've seen this even with JSON.
But yeah, even with RFC standard CSV, you shouldn't be parsing it with awk. It is the wrong tool.
I generally went for turning them into either tab-separated files or used the ASCII codes for record separator and its brethren depending on the job. I never wanted to touch CSV again after parsing it once.
But if you’re writing code to normalize before sending to awk, why not just process in the normalization program instead of using awk’s bizarre syntax?
Depends on what you're using to normalize I suppose. Maybe that's Awk too! Maybe Awk is easier for exploration when the data is already clean but you have to write some parsing layer in Java or something and that's not conducive to one-liner exploration.
> Csv is not standardized and the quoting rules are weird (and not standardized)
There is almost always some normalisation ("scrubbing") needed to prep a CSV. But CSV is viable and awk can rip through massive amounts of data. It is a brilliant and powerful tool.
> It's small enough to fit in your brain, unlike Perl
Perl is also brilliant and powerful. Setting aside the bigotry of people who dislike sigils and using braces for scope, many people who fail to learn Perl well have not tried to use Perl-OOP objects as primitives. Once you do this, Perl's versatility and speed are hard to beat.
I suffered with Perl until I stopped assuming I could learn it by reading code, and read the Camel book. At about page 85 it says something approximating: "Everything in Perl happens in a context (scalar or Array) and until you realize that, you will be miserable." It still didn't come easy: I've realized we all have different minds. My mind doesn't find Perl natural, but loves Ruby. YMMV.
Tangent, but another interesting problem with date parsing besides lack of standardization is that it's ridiculously context-sensitive.
For example, if I schedule an event for Oct 1-Jan 1 right now, the most likely (least surprising) parse would be Oct 1, 2021 through January 1, 2022. Which is both surprising because parsing Oct 1 depends on the current date, and because the parsing of Jan 1 depends on the parse of Oct 1, and because both of these depend on the fact that you're scheduling (in all likelihood) a future event, rather than describing a past event. So whatever date parser you use (no matter how good it is at handling different notations) will return garbage if it doesn't take all these into account.
Sure :-) even then, you have even more fun... imagine if the next item was Sep 1-Dec 1 and the list was known to be sorted! The parser would need to be stateful to be able to disambiguate!!
Even worse (well, maybe similar, but more surprising): imagine if on Feb 28, you parse "Feb 29-Mar 1"... both of those dates could land in an entirely different year than "Feb 28-Mar 1" would, depending on whether the current year is a leap year...
I dare say I have not yet seen a single parser in my life that handles such issues. In fact I don't think I've seen a parser that can parse a date interval, or that can even do "parse this string assuming it is after that date", or anything like that.
And all of these problems are before we even consider time zones, leap seconds, daylight savings, syntactic ambiguities, etc... not just how they affect individual dates, but also ordered dates (/intervals) like above...
I would say that you are mixing up "parsing" and "calendaring (or something of the sort)". As far as I understand parsing is syntactic analysis, i.e. going from a linear structure to a more complex structure (usually a tree); it should not add to the tree anything that was not in the linear structure. It shouldn't consider a semantic context (such as the current date) to produce an ast.
By parsing I mean it in the usual sense for a date... strptime, Date.Parse, etc... i.e. turning a string to a date (or multiple dates). You can call it something else if you'd like.
> I dare say I have not yet seen a single parser in my life that handles such issues. In fact I don't think I've seen a parser that can parse a date interval, or that can even do "parse this string assuming it is after that date", or anything like that.
You can google "nlp date extraction" maybe? I've used libraries in the past that do this.
Note: they're far from being perfect. I ended up not using any, as each had their weird corner cases.
Parsing anything that was originally developed by humans writing it on paper (or clay tablet, or whatever) is a nightmare. Natural means chaotic.
If your CSV file contains any field entered by humans AWK isn't going to be powerful enough to parse it at scale. Someone somewhere is going to have the name 'Mbat"a, Sho,dlo' in some bizarre ass romanization (and this assume you're not accepting Unicode, which is a whole other can of worms that AWK is not prepared to deal with) that breaks your parser.
I would love to have a command-line tool that reads CSV and has a ton of features to cover different quirks and errors, which can output cleaner formats that I can pipe into other command-line tools.
csvkit [0] might be that tool; I discovered it after my last painful encounter with CSV files and haven't used it in anger yet. Among other things, it translates CSV to JSON, so you can compose it with jq.
I love these TSV utilities: https://github.com/eBay/tsv-utils Granted they're for "T"sv files, not "C"sv, but there's a handy `csv2tsv` utility included.
At my last employer, I built a filter program, creatively called CSVTools[0], to do something like this. One piece of the project parses CSVs and replaces the commas/newlines (in an escaping- and multiline-aware manner, of course) with ASCII record/unit separator characters[1] (0x1E and 0x1F); the other piece converts that format back into well-formed CSV files. I usually used this with GNU awk, and reconfigured RS[2] and FS[3] appropriately. Or you can just set the input separators (IRS/IFS) and produce plaintext output from AWK.
Good idea! Looks similar to something I wrote called csvquote https://github.com/dbro/csvquote , which enables awk and other command line text tools to work with CSV data that contains embedded commas and newlines.
csvtool is also nice.[0][1] csvkit is very flexible and can certainly be used in anger, but is a bit finicky; you almost always want to use the -I (--no-inference) option. Additionally, I wrote a tiny Perl script for quick awk-like oneliners.[2]
> As of version 2.0.9, there's no need for any external dependency. Python itself (3.7), and any needed libraries are self-contained inside the installation, isolated from the rest of your system.
I just fixed a bug at work, parsing a rather opinionated csv file of products for a web shop. It had mostly good quoting, headers for columns - and uses semicolon for field separation (so, not technically csv, but..).
Funny thing was, a lot of the product names contained an ampersand (no problem there). But one product had an html entity encoded ampersand (&). I have no idea how that semicolon escaped, eh, escaping - but that one line suddenly had most of the columns off by one...
I can see how the entity got into the db (probably errant cutnpaste) - but I wonder at the csv writer that gleefully copied the extra separator to the csv export...
Did it multiple times and it works great, but companies you send raw data to often get huffy. So, back it goes into CSV or some poorly defined fixed width text file.
CSV is the default interchange in many fields (e.g. a lot of machine learning or academia). You don't get to choose what formats your colleagues send to you.
Just a heads-up, if you're interested. Only SC-IM was fit for the non-database-format files I need to work with. But it does not have a search function! So I'm writing an app in Python for browsing spreadsheets:
https://github.com/dotancohen/osheet
> 5. https://github.com/andmarti1424/sc-im if I want
> a TUI. This is closest to what you were asking for.
Thank you! Yes, this seems to be almost perfect. I cannot believe that there is no "arbitrary string search" feature, but I can grep in another terminal window at least.
Thank you. I've used sqlite, but being able to query the data is not the same as being able to browse that data.
For instance, consider a project that I'm currently working on. Google Docs sheet exported to CSV. The file documents dozens of servers used as load balancers, websites, and database servers. The files has gone from MS Excel to Google Docs, and been maintained by four admins over 12 years. There are no columns per se, rather "sections of rows", each of which isn't even consistent in itself. One section is double columned, with the top row listing public IP address, login name, etc and the second row listing LAN IP address, login password, etc.
sqlite is useless on these files and frankly I've seen these things more often than well-formatted spreadsheets at every company I've ever worked with. Nobody but software developers treat spreadsheet columns as database fields.
Just a reply to myself, for anybody else looking for an application to browse spreadsheets on the CLI.
There are a few CLI applications for working with data stored in a CSV file, so long as the data is formed as a database: each row as a data entry, each column as a database field. The single CLI app that run on Linux for browsing a spreadsheet that is not formed as a database, SC-IM, does not support searching for arbitrary text!
Therefore I am writing osheet: A Linux CLI app for browsing a spreadsheet. It currently supports only XLS files, because that is what I need. It currently crashes a lot. It currently has no features other than searching. It won't even scroll yet to display all the rows and columns. I just banged it out last night when I should have been sleeping. But I'm working on it, you are invited to test it, report bugs, and pull requests are of course welcome. It's written in Python and Curses.
If I'm deep enough into a csv file that I'm worried about everything I noted above, I usually just load the data into Postgres and trust that their devs got it right. Pretty presentation is free at that point :-)
Postgress or even sqlite are great if the CSV is formatted like a database, with each column a distinct field. Alas, most spreadsheets created by non-software-devs are not so strict.
I need to browse the cells in a something like curses.
I just attempted to run some government data through Python using the csv module. It worked like a charm, until it told me that a field had exceeded the maximum allowable length. A close look showed that a field began but did not end with a quotation mark. Using the csv.QUOTE_NONE flag resolved that, but did not remedy such quirks as short or long records.
CSV files, depending on who generates them, are a bit like dates if the status of a year as leap or not depended on whether the date of Easter were 0 mod 4.
My gripe with the Python csv.reader module is that it has no skip_lines parameter. Ill-formatted lines at the top/bottom of a CSV file are quite common
What format of data file is best for working with Awk? I assume some kind of tab delimited file? With a reliable ingestion layer you can convert all your files CSV-ish files to something ergonomic for Awk scripting.
JSON data files seem to be a similar deal. Sometimes they are actually properly formed JSON arrays and sometimes they are individual line-delimited objects.
Data files are a mess, is there a command line tool that is well suited to taking in many inconsistent formats and outputting something ergonomic?
CSV standards are whatever the application implemented. We'd pull a "CSV" out of one system, then have to massage it before uploading to system A and then make other changes so it would be recognized by system B.
I'm torn. On one hand, it's easy. It lets people work with the data, albeit using error-prone tools like NotePad++ and Excel.
On the other hand, it's just flat out lazy to use.
It depends what you are doing. in my experience when you write a script to do something with a CSV it's been created by a specific piece of software and you only feed that csv to the script. I doubt anyone is writing anything generic in ask for parsing a multitude of different csvs
CSV to TSV via robust program, use AWK, and convert back if needed. Then you only have to remember the \\t\n\r escapes, not the whole quoting shenanigans.
I have been using AWK at work and found it cumbersome for my use-case (run-once analysis/manipulation of 200GB csv datasets). I found out about Miller[1] a year or so back and have been using that instead. I don't know how it stacks up in terms of performance, but for my money, named arguments and one-shot statistics is all I need.
For example analysing the number of people per-year over multiple differently formatted files is as easy as `mlr uniq -f pid,year then count -g year`. It has filtering, very extensive manipulation capabilities and a nice documentation.
John has also reacted very fast on feature requests I had (so fast that I've not yet implemented using them).
+1 for Miller. It's a really slick tool for exploring large csv datasets. I've typically used it to do some prototyping and exploration of 200-500GB csv datasets before doing more hefty work in Java+PigLatin (our use-case is more long-term than just a single run for analysis, so that's re reason for moving out of just Miller). It's great to get a feel for the data and run some initial analysis before diving into the larger, more cumbersome system and tooling.
Nice introduction (based on both tutorials). One suggestion would be to use
-v FPAT='[^,]*|"[^"]+"'
instead of
BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
>If you get the awk programming language manual…you’ll read it in about two hours and then you’re done. That’s it. You know all of awk.
I can't work my head around this quote. That's a ridiculous claim. Even for a experienced programmer, learning a new programming language in 2 weeks, let alone 2 hours would be nothing short of a miracle. I've been using awk for past 2-3 years or so and I wrote a book on GNU awk one-liners earlier this year (https://learnbyexample.github.io/learn_gnuawk/). I'm nowhere close to knowing all of awk
GNU Awk has lots of extensions and is a relatively big language, but I definitely think you can learn most of the original Awk in two weeks - and you can certainly learn enough to be useful in two hours. It's been a decade now, but my recollection is that when I spent a day or two idly reading the Awk book, and was able to immediately employ it in simple shell pipelines after reading not that many pages. I think Awk is useful once you've learned just a bit about it, and I don't think learning a lot about Awk is very useful - there are languages that are better suited for complex tasks.
Fair enough, someone else made similar point that I had GNU awk in mind wrt all of awk
And yeah, it's possible to start using basic field processing and regexp features (if you already know it) in 2 hours and learn most of it in 2 weeks. But, that quote could've been something like you could get started in 2 hours instead of saying one could know all of awk.
This HN post inspired me to pick up and work through "The AWK Programming Language", so I have been doing that for the past two days. I'm currently at the beginning of chapter 5 (of 8).
That quote is definitely ridiculous. He may have been referring to just chapter 2, in which the "whole language" is outlined. But later chapters also take time, there even is an exercise about writing an assembler in AWK as well as implementing multiple algorithms.
Some stuff was already somewhat familiar to me from studying/practicing C earlier this year.
Overall a 10/10 book so far, but takes time and patience for sure!
I don't agree that you can learn AWK in its entirety in two hours, but I do (strongly) agree that you can teach yourself the basics in two hours and those two hours are extremely well spent considering that AWK is useful for so many things.
Surely handling the intricacies of CSV can be a pain, but let's face it: Most tasks are simple and its results are easily verified for correctness. Obviously, I don't write my production webserver with AWK -- I write simple script to parse data (locally), find semantic issues in CSV files, analyze log data, etc..
You can also trivially come across data designed for consumption with AWK. Whenever I write a script or utility to dump some data for later analysis, I take into consideration if I can shape the output in a way that will make it easy to work with in AWK.
Because I won't know ahead of time what all the interesting values or summaries will be, but if I know I can get it through AWK it will be cheap to extract or compute those values and summaries later, once I know what they are.
The biggest problem with awk is that unless you use it regularly, you'll easily forget how to use it. So it's important to keep a running cheatsheet like you did. Nothing more frustrating than forgetting the awk command and searching for it in your history only to find it aged out.
> He said, “If you get the awk programming language manual…you’ll read it in about two hours and then you’re done. That’s it. You know all of awk.”
It's hyperlinked to the Gawk manual, but it seems likely he actually meant A, W & K's The Awk Programming Language (1988), which you could conceivably read in 2 hours, as it's a joy to read. I used it and The C Programming Language as exemplars of great documentation when writing my own.
It's one of my favorite technical books, with many examples of the application of the language, including the design of small languages and experimentation with algorithms.
The O'Reilly book is "Sed & Awk" published in the 90's (https://www.oreilly.com/library/view/sed-awk/1565922255/). The languages work together as a team (like lex and yacc). And the book takes about 2 hours to read. Unix is built on the premise of "lots of simple tools that fit together with pipes".
Although this is like 1,000th article on how easy it is to learn awk. Why write yet another one?
The reason to write one is to record what you learn, to further embed it into your brain. Why post it? Whoever posted it found it interesting.
I had not seen the others so this is an intro to awk for me. But in general if something has made it to the front page of HN it was interesting enough for enough people to put it there.
Instead of the gawk manual, I recommend the OpenBSD manpage: https://man.openbsd.org/awk.1 It's concise and well written (as most OpenBSD manpages), so you can learn AWK even in less than 2 hours. Of course it handles only AWK, not the gawk extensions.
I always link to this blog post [0] by Jonathan Palardy. It was one of those eye opening articles for me that really helped me 'grok' it - and this is after having had a couple of stabs at learning it just from reading the text books on it.
Awk for me is part of the layers of tools from grep/sed -> AWK -> perl one liner -> full on script. The further down the ladder you go, the more restricted you are, but this means that programs are terser and usually stay readable despite the terseness.
I would also recommend Perl One liners [1] which did the same thing for me for Perl one liners and, subsequently, full Perl.
My favourite bit of gawk (ok, an extended version of awk but it's probably what you are running) is the network support, which in the manual includes writing simple HTTP servers with it. I kid you not.
I love Awk. The idea of a programming language that effectively facilitates first-class state machines over text streams fits so well with the Unix pipeline model.
My story about learning the power of awk was I was working on a dataset and had written a page or so Worth's of Perl... and kept bungling it. I emailed the scientist who wrote the paper whose technique I was emulating and he said, why not use awk, and gave me a basic example one liner, with a bit of modification it worked! Of course all that really means is I suck at Perl, but it was an eye opener for how powerful awk is.
I learned awk before Perl and it helped a lot. I still use awk (and sed) regularly but for more complex tasks involving regexes and such Perl is my favourite.
Here's a basic network calculator I wrote in awk once because I had to do these calculations on a minimal SunOS system that didn't have any other way to do math from the shell.
Awk and shell scripting is the rabbit hole that is best avoided IMHO. Often it's just easier to crank out a "real" program using your lost familiar programming language or at very least python instead of starting to work with shell or awk scripting.
Awk is a full blown programming language though. comparable to python but allowing for much more compact code. Though I completely agree on avoiding shell. the only good shell I've seen is rc. but the relying on external utilities is a double edged sword. forking is too costly over time.
Csv is not standardized and the quoting rules are weird (and not standardized).
If you can live with a certain amount of loss of fidelity in your output, you can get away with using awk. If you want a coarse prototype, use awk.
If you need robust, production-grade handling of csv files, use (or write) something else.
Csv files are a little bit like like dates: superficially simple, with lots of corner cases. Largely for the same reason: lack of standardization.
That said, awk is awesome. It's small enough to fit in your brain, unlike Perl (maybe yours is larger than mine?). It's also pretty universally available, with few massive incompatibilities between versions, unlike shell (provided you avoid the gawk-specific features). I love it.