Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

STOP.

People insist on making data exchange formats used to transfer data between software 99.99% of the time "easy for humans" that 0.001% of the time, at the expense of the common case. This inevitably results in squishy, inefficient formats like HTML and JSON that become performance quagmires and security land mines: http://seriot.ch/projects/parsing_json.html

CSV is not even easy for humans to process, with or without tools like grep. Can you successfully handle quote escaping and line-wrapping?

    This ,"is ",    valid 
    "Hello",123,"''from the
    not,what,you,
    ""think"",next line"""
That's one record, by the way.

Your level one challenge is: Extract the unescaped value of the "valid" column. Write your robust, tested, "grep" command-line below.

Level two: Make that parse consistently across a dozen tools from a dozen vendors. Try ETL tools from Oracle, Sybase, and SAP. Try Java, PowerShell, and Excel. Then import it via an ODBC text driver, and also into PowerBI. I wish you luck.



This is disingenuous. There isn't really a CSV standard that defines the precise grammar of CSV.

But, suppose there was such a standard, then som of the problems you've described would become non-problems (eg. support across multiple tools). Potentially, if such a standard disallowed linebreaks, it would also make the use of grep easier.

I actually really wish there was such a standard...

Your other assertion about "squishy" formats is also... well, not really true. The format you listed became popular due to the data they encoded, not because of their quality. It's very plausible that a better format may exist, and they are probably out there. The problem is that we set the bar too low with popular formats, and that they often seem easy to implement, which results in a lot of junk formats floating around.


> There isn't really a CSV standard that defines the precise grammar of CSV.

Did you read the link going to a page literally titled: "Parsing JSON is a Minefield."?

JSON has a "precise" grammar intended to be human readable. The end result is a mess, vulnerable to attacks due to dissimilarities between different implementations.

Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu

Why bother, you ask? Why would anyone bother to make floating point number parsing super efficient?

JSON.


> Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu

It's not a parsing library, but a printing one, i.e., double -> string. https://github.com/fastfloat/fast_float is a parsing library, i.e., string -> double, not by Google though, but was indeed motivated by parsing JSON fast https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac...


> Did you read the link going to a page literally titled: "Parsing JSON is a Minefield."?

No, I didn't. But... why is this relevant to CSV?

I think I might have read an article with a similar name before. And, in general, am not a big fan of JSON, but I didn't think this link contained a standard for CSV...

Also, JSON has a defective grammar, with bad semantics, so, yeah, having a standard didn't save it... but it could have saved it. Not having a standard would've been even worse.


JSON is well defined but hard to parse. CSV is not well defined but you still have to parse and print in such a way that is compatible with every other CSV implementation otherwise what is the point? Use xlsx Spreadsheets instead, those actually work even if they are a bit bloated.


>There isn't really a CSV standard that defines the precise grammar of CSV.

If you don't implement the pseudostandard that pretty much every library uses then whatever format you have can be called CSV but you can bet that it will be compatible with nothing else as you have implemented a bespoke file format that has no guarantee of compatibility. If it works at all it is an accident and it probably shouldn't.


One problem that needs to be solved and standardized is escaping.

It's extremely common for text formats to store text. Some characters always need to be escaped, but the set of characters is always unique to the host format.


I downvoted you, not because I disagree with your point (although I do disagree with it), but because of your peremptory and dismissive usage of "STOP" and "level one challenge."

This issue is not so straightforward that you should be rudely stating that people with other views should just stop and go back to the drawing board. Not that I think one should ever do that, even if the issue were simple.


Excel is ubiquitous and makes csv easy for humans to read.

Machines do the 99.99% of the time that things are working. The .01% when humans need to get in the loop is when something goes wrong and life is better when you can read the data to figure out what went wrong. JSON is amazing for this. CSV not so much but it's still better than a binary format.

Machine efficiency is not the most valuable metric. Machines are built for human use. The easier a system is for humans to use the better. Even if it's orders of magnitude less efficient.


Excel does things to CSV data that shouldn’t be done.

Encodings are troublesome. (See ftfy, a miracle library for fixing them) Everything might be a date. Even genes. Leading zeros get dropped. Even when significant. (Like routing numbers) Long numeric strings get turned into scientific notation. (Like account numbers)


Excel doesn't make CSV easy to read.

Type information has to be communicated out of band with CSV. If you want to view the data in Excel just use xlsx files. Compared to CSV it is a joy to work with.


Accounting wants to work with a portion of the Parquet database. Your options are: give them data in CSV/Excel formats, or hire a dev to work for accounting full time and still have them constantly complaining about having to wait on the dev to get them the data they need.

The programmers, data scientists, and other technical fields should never process data in Excel, but you can’t expect that of the whole organization, so exporting to it is important. At this point, I’d recommend XSLX as the export format because it’s harder to mess up the encoding, but libraries exist for CSV and others as well.


After being constantly frustrated by the way Excel was mangling data I'd output as CSV, but being unwilling to implement real XLSX, I discovered that I can actually output a UTF-8 encoded web page which contains nothing but a charset declaration, a title, and a HTML table, and lie that it's application/vnd.openxmlformats-officedocument.spreadsheetml.sheet and Excel will open it without problems. So will Libre Office.


You can give them anything from SQL interface to BI tool like Superset to no-code tool like Thoughtspot. They might still export some to CSV, but having GUI is much better than having CSV.


I think the issue is more of recovering misformatted or incomplete data when something goes wrong, or otherwise understanding what's in the file if reading it in one format doesn't work as intended (e.g if the file format goes out of use and maintainers are unfamiliar with it). CSV has issues for sure, and yet you have posted content that's readable at some level as text. It's not a quasirandom selection of unicode characters that looks like some private key.

I think these types of examples can be misleading because they take an example of pathological CSV and don't present the comparison, which is typical binary. Sure your example is a problem but what if it were in a nested binary format? Would that be easier for humans to read if the file were mangled? (It's also often more difficult to resolve formatting errors from just one record too, so that's also misleading.)

I disagree with the idea that humans might want to look in the file using text or other software "0.001%" of the time. Sure in some settings it's rare but in others it's pretty common.

CSV is a bit of a misnomer anyway because the field separators can really be anything so you should be using a separator that is extremely unlikely to appear in a field.

Anyway, I'm not actually trying to defend CSV as the ideal as I think it is way, way overused, in part because of the issues you raise. But I also think making a format that's easy for humans to read in a failsafe file format (text usually) is an important criterion.

I guess in general I haven't been convinced that any of the data format proposals I've seen get some level of publicity are especially better than any other. For some use cases, probably, but not for others. Sure, that's how it works, but I wish there were better options getting traction or attention. It seems like when I do see something promising it often doesn't get off the ground, and there isn't a robust competition of ideas in the space. Often the reading-writing software drives data format use — that's not necessarily my preference but it seems to often be the case.


> recovering misformatted or incomplete data when something goes wrong

For CSV, what goes wrong is usually people editing the file as text or tools consisting of "a couple of lines of anything" that don't handle escaping and delimiters correctly.


HTML and JSON are not good examples of text formats designed with security or performance in mind.


That honor belongs to XML


Sarcasm, I hope




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: