Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's because the intended purpose is either useless (for machine control characters) or useless and logically impossible (for delimiters).

What do you do if you have a record that includes a record separator character? Given that you have this problem anyway, why do you want a character dedicated to achieving the same thing that a comma achieves?



The record separator isn't on people's keyboards, so it's less likely to show up where it's not expected. Also it's less likely to legitimately occur in something like a name, so there are many users of CSVs who can say they will never need to consider data containing a record separator, and they will be right more often than those who never consider data containing a comma.

Of course, the fact that record separators aren't on keyboards is probably why CSVs use commas.


In the DOS days, you could "type" control characters by pressing Ctrl and the corresponding letter key, Ctrl+M is Carriage Return, Ctrl+H is Backspace, Ctrl+Z is End Of File, etc.

It was probably possible to type an RS with Ctrl+Shift+. and the others with similar combos.


In a desktop linux terminal, Ctrl-^ or Ctrl-~ work for me. In a tty, I need to press Ctrl-V before them.


Yeah Linux still works exactly this way. The modern WIN32 API even works that way too. When you ReadConsoleInput() it gives you teletypewriter style keyboard codes. When I wrote a termios driver for Cosmopolitan to have a Linux-style shell in CMD it really didn't take much to translate them into the Linux style. We're all still using glorified teletypes at the end of the day. It will always be the substrate of our world. One system built upon another older system.


I think it's worth mentioning that Ctrl-A is ascii 1, Ctrl-B ascii 2, etc, as it is in Unix today.


you can still type them -- alt + 030(for instance) on the keypad will insert that RS character. In Windows at least -- not sure about the other OS.


On Linux terminals entering control characters is done with the control key, Ctrl-G for example, but they will often be intercepted by the program that is running.

Bash will insert the control character (rather than interpret it) if you prefix it with Ctrl-V.


> Also it's less likely to legitimately occur in something like a name, so there are many users of CSVs who can say they will never need to consider data containing a record separator, and they will be right more often than those who never consider data containing a comma.

No, they'll be right exactly as often, 0% of the time.

But their mistake will show up less frequently, causing more problems when it does.

As soon as it's possible for some of your data to come from someone else's dataset, you're guaranteed to have to accommodate record separators within your data as well as within the metadata. You're better off using a system that plans for this inevitability than one that pretends it can't happen at all.


> No, they'll be right exactly as often, 0% of the time.

> But their mistake will show up less frequently, causing more problems when it does.

Enough people use CSVs (and have limited, small-scale use-cases) that I'd be willing to bet "less frequently" means never for at least 1% of people who use CSVs.

I don't know whether the chance of no problems is worth the increased difficulty of problems that do occur - considering that balance feels a bit silly because if you're aware there could be a problem in a context where you could choose between commas and unit separators, you could just add validation or escaping.


> considering that balance feels a bit silly because if you're aware there could be a problem in a context where you could choose between commas and record separators, you could just add validation or escaping.

As soon as you have validation or escaping, having a record separator character loses its entire purpose. The existence of the character is predicated on the idea that you don't have to do that, and that idea is false.

That's why the character is never used. It's a conceptual mistake that was accidentally enshrined in a series of encoding standards that had enough free space to accommodate it.


> As soon as you have validation or escaping, having a record separator character loses its entire purpose. The existence of the character is predicated on the idea that you don't have to do that, and that idea is false.

I disagree with this - the data needs to be stored somehow, and while other characters (like comma) can be used, having a dedicated character can help - for example if the data might legitimately contain commas or newlines but not unit separators or record separators, then escaping isn't needed if you use unit/record separators (although validation is still necessary).


I agree.

TSV is widely used, but lacks a way to escape the tab and new line characterss. RS-V is the same, but allows including tabs and new lines in records.


> As soon as you have validation or escaping, having a record separator character loses its entire purpose.

Not true. Validation is easier than escaping.


I can’t think of a case where someone would write a control character like that into something intended for text on purpose. So you might as well disallow it.


The situation that comes up the most often that you need to consider is when someone embeds the same sort of file into itself, or chunks of the same sort of file into itself. If using the ASCII characters to delimit fields was common, you'd need to consider that over the course of some moderately interesting system's life time the odds of someone copying and pasting something from an encoded file into the spreadsheet application and picking up the ASCII control characters with it is basically 100%. And while we may be able to say with some confidence that nobody is going to embed a CSV file into a CSV file (and I say only some confidence, the world is weird and I'm sure someone will read this who has actually seen someone do this), there's other situations like HTML-in-HTML (for example, every HTML tutorial ever) that are guaranteed by their nature.

It is still valid to disallow the ASCII control characters, one just has to make sure that it is done comprehensively, in all places users may input them. But that's not created by using ASCII control characters, that's a consequence of the "ban the control characters entirely" approach regardless of what the control characters are.

It's neat when you can get away with it, but I generally prefer to define a robust encoding scheme instead. A minimal one like "replace backslash with double-backslash, replace control characters with backslashed characters" and "replace backslash sequences with their control characters, including backslash-backslash as a single backslash" can be inserted almost anywhere in just a few lines of string replace (or stream processing if you need the speed). The only tricky bit is you need to make sure you get the order correct or you corrupt data, and while I've done this enough to have it almost memorized now I do recall feeling like the correct order is backwards from what I naturally wanted the first few times. But it is simple and robust if you get it right.


Someday I will create both formats: a control-characters are banned format (and never accepted) and one where they are escaped. That ought to be good enough for all needs!

(A trivial evening project for some; not for all of us)


> What do you do if you have a record that includes a record separator character?

This comes up every time. Options:

1. You disallow it. And you might as well disallow all the control codes except the carriage return, line feed, and other “spacing” characters. Because what are they doing in the data proper? They are in-band signals.

2. You use the Escape character to escape them

3. Weirdest option: if you really want to nest in a limited way you can still use the group and file separator characters


> What do you do if you have a record that includes a record separator character?

You use the ASCII escape character (0x1B), which is designed for exactly that purpose.


Well, that's what an escape is for. Are we really having a serious discussion in 2024, where someone is suggesting that it's not the responsibility of the software engineer to sanitize inputs before chucking the data into some sort of database?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: