Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Parsing anything that was originally developed by humans writing it on paper (or clay tablet, or whatever) is a nightmare. Natural means chaotic.

If your CSV file contains any field entered by humans AWK isn't going to be powerful enough to parse it at scale. Someone somewhere is going to have the name 'Mbat"a, Sho,dlo' in some bizarre ass romanization (and this assume you're not accepting Unicode, which is a whole other can of worms that AWK is not prepared to deal with) that breaks your parser.




I'm saving that as a test case name but making some small adjustments.

'Mbaät\"a, Sho,dló'

"a" followed by "ä" because some suggest encoding umlauts as double characters. When you decode that does it go first or second?

Answer: use Unicode.

Throw an escape character in there, before the quote character just to make it interesting.

This is a good time for everyone to review: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: