Please stop spreading bs. It's not throwing away structure. Piping text doesn't ...

dang · on March 3, 2017

> Please stop spreading bs.

This breaks the HN guidelines: https://news.ycombinator.com/newsguidelines.html. Please edit out such bits. This would be a fine comment without it.

TeMPOraL · on March 3, 2017

Piping isn't the culprit. It's what you pipe that is, and Unix Philosophy says "pipe whatever the hell you want, and let the users upstream sort it out".

It's not about encoding everything in exactly the same way. It's about providing the basic, shared protocol for representing structure. With typical Unix tools, you don't have "simpler encodings", you have no encoding at all. Each tool outputs whatever its particular author felt like (and this changes between Unix systems), each tool parses things in whatever way its author felt like, and as a user your job is to keep transforming unstructured blobs of text to glue them together.

jstimpfle · on March 3, 2017

[flagged]

TeMPOraL · on March 3, 2017

> Name a well-thought out text file format that can't correctly be parsed e.g. by a Python one-liner with basic string operations. And please don't include: JSON, XML, YAML, sexps, because it's not possible, at least not without a library.

Well, because this library should be a part of the OS API.

A set of conrete cases where existing practice is bad is Unix itself (and its descendants). Think of every time a script breaks, does something unexpected, or introduces a security vulnerability, because every program has to contain its own, half-assed parser for textual data.

jstimpfle · on March 3, 2017

> because every program has to contain its own, half-assed parser for textual data.

As I said, name me a format that I can't parse correctly as a Python one-liner.

I work as a systems administrator and my scripts (mostly shell, python) don't break. I'm not kidding you.

Of course when writing shell scripts (which I think you imply) I need to know how to write non-broken shell scripts (in a clean and straightforward way) and I will freely admit that it's not easy to learn how to do it. Partly because shell has some flaws, but more because of the insane amount of broken scripts and tutorials out there.

But it's not even about defending shell. We are talking about text representation.

> Well, because this library should be a part of the OS API.

You are free to postulate that but it's won't make it less work. By the way "OS API" is ridiculous. These libraries have to be implemented for every language (and they have been, for most popular languages).

swhipple · on March 3, 2017

> As I said, name me a format that I can't parse correctly as a Python one-liner.

mboxo? [1] It is a popular text format that cannot be unambiguously parsed.

More generally, most Unix tools' output is also not able to be unambiguously parsed. For example, use gcc to compile a file, then collect the warnings? The regex "^.+:\d+:\d+: warning.*" will be right most of the time, but there's no 'correct' way to parse gcc output (there is not a surjective mapping of output to input).

There are various ways to work around the problem: mboxrd format uses an escape sequence to work around the earlier problem mentioned with mboxo. `ls -l --dired' (GNU) will allow you to parse ls by appending filename byte offsets to the output. `wc --libxo xml` (FreeBSD) will give the output in XML, which is unambiguous as well. multipart/form-data (RFC2388) is used to embed binary data in a text format, by using a byte sequence which doesn't appear in the data.

Binary formats present their own set of issues, but "accidentally unparseable" is more common in text-based formats (or ad-hoc text output).

[1] https://jdebp.eu/FGA/mail-mbox-formats.html

jstimpfle · on March 3, 2017

Thanks!

It's true that filenames with whitespace or newlines are bad for interoperability ("make" is another example). There are three simple options: escaping filenames, making filenames NUL-terminated or declare such filenames as invalid. The latter way seems to have won for practical reasons, and it's a pity that "safe filenames" were never standardized (but C-identifier plus extension should be safe everywhere).

Mbox is definitely broken (for example body lines that start with "From" are changed to "> From"). I don't think it is ambiguous today (all software I know interprets "From " at the beginning of a line as a new mail), but it clearly was not much designed at all. It still has some precious properties which is why it's still in use today. For example, appending a new email (Mail server) is very fast. Crude interactive text search works also very well in practice, although automation can't really be done without a library.

Email is complex data (not line- or record-oriented), so various storage formats achieving various tradeoffs are absolutely justified.

> Binary formats present their own set of issues, but "accidentally unparseable" is more common in text-based formats.

It's true, especially with formats from the 70s where the maxime was "be liberal in what you accept", and where some file formats weren't really designed at all.

On the other hand, "accidentally unextendable" (for example, fixed-width integers) and "accidental data loss" is much more common in binary formats.

TeMPOraL · on March 3, 2017

> As I said, name me a format that I can't parse correctly as a Python one-liner.

Sorry, I misread that in your previous comment as "name me a format that I can parse correctly with a Python one-liner, without special libraries".

Anyway, the original article contains numerous examples of the issues I'm talking about; scroll to "Let’s begin with a teaser" and read from there. The point being, it's very difficult to correctly parse output in general case, because unstructured text doesn't reliably tell you when various data items begin and end. Most people thus won't bother with ensuring their ad-hoc parsing is correct.

> By the way "OS API" is ridiculous. These libraries have to be implemented for every language (and they have been, for most popular languages).

Sure each language has to implement its own bindings to the OS. My point is that there should be a structured format defined as standard on the system level, so that all CLI programs could use the same parser and generator instead of each rolling their own.

jstimpfle · on March 3, 2017

> Let’s begin with a teaser. How can we recursively find all the files with \ name in a folder foo? The correct answer is: find foo -name '\\\\'

He doesn't know shell quoting (or has problems with the blogging software). It's '\\' and there is nothing wrong with that (-name accepts a pattern, not a fixed string)

> How to touch all files in foo (and its subfolders)? At first glance, we could do it like this: find foo | while read A; do touch $A; done

No.

     find foo -exec touch {} \;
     #or
     find foo -print0 | xargs -0 touch

These examples only prove that the author is not proficient at shell.

And we are not talking about shell (which does have flaws) but text representation. You still haven't provided the text format I asked for.

> To argue for the OP, consider the case of passwd being parsed on every system call. That is simply sub-optimal.

As you know there are various encoding schemes, but mostly character separated (space, newline, NUL, colon, whatever) or record-oriented (two separator levels, often newline and space/colon/comma.

In most places, only identifiers are allowed ("and" ("there" ("is") ("no" "fucking" "point") "in" ("wrapping" "everything" "in" ("quotes" "and" "parens"))). Just write things like this, and parsing won't be any harder than splitting on whitespace. Was that so hard?

pdimitar · on March 3, 2017

> Author is not proficient at shell

> Partly because shell has some flaws, but more because of the insane amount of broken scripts and tutorials out there.

So what are you saying then? Basically, "git gud"? I am struggling to find your exact argument here. I wonder if you keep saying "it's not broken, you're just using it wrong", or "you must be proficient and if you're not, it's nobody's fault", or what exactly?

The main argument here is IMO that unstructured text which can be parsed with space/tab delimiters in mind is NOT good enough. You say it is. I disagree; I've had numerous cases in my career where any random dev never takes that into account and just throws almost-native-English files into a Linux VM expecting a 1970s system tool to be able to parse it and make sense of it.

Their fault? Absolutely and definitely. But it's the job of the tech to slap you through the wrist if you are not obeying to standards. Computers are not AI and they need protocols / standards. Are there standards in piping things between processes in UNIX/Linux? No.

Then what's the point of technology at all, I ask.

jstimpfle · on March 3, 2017

I clearly said I'm not defending shell. Even when the author is responsible himself for wanting to put a fixed string where a pattern is expected.

But this is about text formats. Text is simple. It's only the overengineering farts who think they have to wrap everything in three levels of parens. It doesn't make a difference.

> Their fault? Absolutely and definitely. But it's the job of the tech to slap you through the wrist if you are not obeying to standards. Computers are not AI and they need protocols / standards. Are there standards in piping things between processes in UNIX/Linux? No.

I just don't get why people keep thinking just because it's "text" it's somehow not standardized (enough), or why putting things in parens would help.

Please, stop with this vague FUD. Give an actual example.

> Are there standards in piping things between processes in UNIX/Linux? No.

That's called separation of concerns. That the kernel doesn't care doesn't mean that the endpoints don't care.

pdimitar · on March 3, 2017

> Text is simple.

Sigh. I am not here to argue with your out-of-context sweeping generalizations. So I won't.

BTW, do you have a particular gripe with S-expressions / LISP? You ranted twice about parens in your comment towards me.

And no -- me, the OP, and several others in this thread will definitely not stop with this "vague FUD", "bs", "trolling" -- all your quotes from other comments -- simply because it's something we struggle with regularly.

We all have day jobs. When we stumble upon a piping problem -- be it unable to find an erroring process easily and quickly (sometimes not at all), or unable to understand an exit code, or having to actually look for signal values, or stumbling upon a bug in an older version we're stuck with -- we try our best to get the problem out of the way and move on. Most non-tech-savvy bosses would react extremely bad if you told them you're spending hours or days on a problem they perceive as one small piece of the glue you're using to put a painting together, and especially when they find out that you're not even at the part where you must hang the painting on the wall (example: deployment). And that's a fact of the daily life of many devs. You can call that a vague FUD if you wish. <shrugs>

So forgive all of us working folk who don't keep Star Trek-like exact logs on every problem we ever bump into. /s

The negative impressions build up with time. You can try calling for 100% scientific method on this but I can bet my neck that if I've known every single minute of your life, I'd catch you with your pants down on many occasions that you don't keep a detailed record on everything that has ever frustrated you. Can you deny this? If not, then I don't understand why you are holding on to a strictly scientific approach on things people bump daily into but can never excuse spending huge amounts of time on, in front of their bosses. Peace?

TL;DR:

Since we have jobs and we must go on about it relatively quickly, most of us never spend the effort to write down every single instance where the UNIX shell semantics have made our lives harder but we managed to pull through via a workaround and just went on about our business minutes or hours later.

jstimpfle · on March 3, 2017

Again, you have ignored that this discussion is not about shell (which I know, including its few flaws, and can easily deal with, but am in no way trying to describe as easy to learn given that there are so many broken scripts and tutorials out there. It's hard to just learn the quoting rules for once, and browse through "bash pitfalls" once, simply because people don't know where to look for good resources. And I have freely admitted it was hard for me as well. Nevertheless I seriously recommend learning it rigorously because it has tremendous practical benefits).

This discussion is about text representations. Why do you keep claiming that text formats are broken when you can't give a single example?

> BTW, do you have a particular gripe with S-expressions / LISP? You ranted twice about parens in your comment towards me.

I will rant again until people stop making stupid claims.

I actually like LISP as a programming language. There is just zero benefit from writing record- or even word-oriented data in a (random) free-form syntax that is meant for representing trees. If I wanted I could parse /etc/passwd format like this:

  struct_passwd = namedtuple("passwd", "pw_name pw_passwd pw_uid pw_gid pw_gecos pw_dir pw_shell")
  passwd = [struct_passwd(* line.rstrip('\n').split(':')) for line in open('/etc/passwd')]

That's it. It works. There, I even made a nice datatype for you. And there's already more integrity checking in these two lines compared to a json.parse() or similar.

It works so nicely that I'm even making a text format for such databases with strong schema support that can still quite easily be used with generic text tools (git, grep, sed, awk, diff...). http://jstimpfle.de/projects/wsl/main.html

> So forgive all of us working folk who don't keep Star Trek-like exact logs on every problem we ever bump into. /s

Never asked for that. Give a single reasonable example why text file practice is bad, to get any credibility. It can't be that hard.

> ... And that's a fact of the daily life of many devs. You can call that a vague FUD if you wish. <shrugs>

Well, it's a bit less vague now that you have actually described a little better. But there is no connection to text representations. Sorry, you replied to the wrong thread.

Too · on March 3, 2017

> There exists a shared protocol. It's called "explain it". But that's typically not even needed, the user can just look at the data and figure it out.

This is the root cause of 99% of all parse errors and security holes in the world.

If you just "look" on the output of ls in some arbitrary directory there is nothing there telling you that a file name can contain a newline that will mess up the output. Write your parser with this assumption and it's broken. (See OP)

If i had a penny for every csv-"parser" I've seen that is just data=input.split(','); i would be a rich man now. Because the developer when looking at their data had no comma in any cell. Doesn't mean the customer don't have it.

jstimpfle · on March 3, 2017

I'm pretty sure most security errors come from implementations of complex binary formats. (Okay, there is the web world and I hear people still haven't learnt to escape their SQL queries).

ls is only for human consumption. I said this elsewhere in this thread.

CSV is utterly broken (at least was RFC'ed at some point, but the escaping rules are still shit. We have known for decades how to do it better).

JoeAltmaier · on March 3, 2017

I call "No True Scotsman"

jstimpfle · on March 3, 2017

I don't understand (sorry). Could you explain?

JoeAltmaier · on March 3, 2017

"All things like A are in category X. Except this long list I wrote, but they aren't really A, because I need my syllogism to work."

jstimpfle · on March 3, 2017

You missed my cynism. I was opposed to these formats in the first place.

JoeAltmaier · on March 3, 2017

Which seems at odds with the thesis, at least as far as I can figure it out.

jstimpfle · on March 3, 2017

Look again. For example I wrote "space-separated" in multiple places.