I'll add "reduce code size and complexity" to the list of benefits. A python lib...

dkarl · 2025-03-21T14:14:54 1742566494

Hard agree. A library should not inflict complex use cases' complexity on simple use cases, but sometimes they do, either because they're poorly designed or because they're overkill for your use case. But often I see pain and complexity excused with "this is the library that everybody else uses."

Sometimes a simple bespoke solution minimizes costs compared to the complexity of using a massive hairball with a ton of power that you don't need.

One big caveat to this: there's a tendency to underestimate the cost and complexity of a solution that you, personally, developed. If new developers coming onto the project disagree, they're probably right.

jofer · 2025-03-21T14:54:54 1742568894

The big caveat is a big one. Choose your battles wisely!

There are plenty of things that look simpler than an established library at first glance (I/O of specialized formats comes to mind quickly). However, a lot of the complexity of that established library can wind up being edge cases that you actually _do_ care about, you just don't realize it yet.

It's easy to wind up blind to maintenance burden of "just a quick add to the in-house version" repeated over and over again until you wind up with something that has all of the complexities of the widely used library you were trying to avoid.

With that said, I still agree that it's good to write things from scratch and avoid complex dependencies where possible! I just think choosing the right cases to do so can be a bit of an art. It's a good one to hone.

JackFr · 2025-03-21T15:26:25 1742570785

> I/O of specialized formats comes to mind quickly

The classic "I'll write my own csv parser - how hard can it be?"

aleph_minus_one · 2025-03-21T16:05:29 1742573129

> The classic "I'll write my own csv parser - how hard can it be?"

I did as part of my work. It was easy.

To be very clear: the CSV files that are used are outputs from another tool, so they are much more "well-behaved" and "well-defined" (e.g. no escaping in particular for newlines; well-known separators; well-known encoding; ...) than many CSV files that you find on the internet.

On the other hand, some columns need a little bit of "special" handling (you could also do this as a post-processing step, but it is faster to be able to attach a handler to a column to do this handling directly during the parsing).

Under these circumstances (very well-behaved CSV files, but on the other hand wishing the capability to do some processing as part of the CSV reading), likely any existing library for parsing CSV would likely either be like taking a sledgehammer to crack a nut, or would have to be modified to suit the requirements.

So, writing a (very simple) own CSV reader implementation was the right choice.

dkarl · 2025-03-21T18:21:04 1742581264

> very well-behaved CSV files

You were incredibly lucky. I've never heard of anyone who insisted on integrating via CSV files who was also capable of consistently providing valid CSV files.

aleph_minus_one · 2025-03-21T22:38:33 1742596713

> I've never heard of anyone who insisted on integrating via CSV files who was also capable of consistently providing valid CSV files.

To be fair: problematic CSV files do occur. But for the functionality that the program provides, it suffices if in such a situation, an error message is shown to the user that helps him track down where the problem with the CSV file is. Or if the reading does not fail, the user can see in the visualization of the read data where the error with the CSV file was.

In other words: what is not expected is that the program gracefully has to

- automatically find out the "intended behaviour" (column separators, encoding, escaping, ...) of the CSV parsing,

- automatically correct incorrect input files.

jofer · 2025-03-21T17:41:53 1742578913

CSV is _way_ hairier than folks think it is!!

And for anyone who's not convinced by CSV, consider parsing XML with a regex. "I don't need a full XML parser, I just need this little piece of data! Let's keep things lightweight. This can just be a regex..."

I've said it many times myself and been eventually burned by it each time. I'm not saying it's always wrong, but stop and think whether or not you can _really_ trust that "little piece of data" not to grow...

mdaniel · 2025-03-21T21:49:58 1742593798

> "I don't need a full XML parser, I just need this little piece of data! Let's keep things lightweight. This can just be a regex..."

relevant:

> ruby-saml was using two different XML parsers during the code path of signature verification. Namely, REXML and Nokogiri

where "REXML" does exactly what you described, and hilarity ensued

Sign in as anyone: Bypassing SAML SSO authentication with parser differentials - https://news.ycombinator.com/item?id=43374519 - March 2025 (126 comments)

krab · 2025-03-21T18:26:20 1742581580

A plural of regex is regrets...

LPisGood · 2025-03-21T15:53:51 1742572431

What are some footguns? It does seem easy

mikepurvis · 2025-03-21T16:12:39 1742573559

It's easy if the fields are all numbers and you have a good handle on whether any of them will be negative, in scientific notation, etc.

Once strings are in play, it quickly gets very hairy though, with quoting and escaping that's all over the place.

Badly formed, damaged, or truncated files are another caution area— are you allowed to bail, or required to? Is it up to your parser to flag when something looks hinky so a human can check it out? Or to make a judgment call about how hinky is hinky enough that the whole process needs to abort?

naitgacem · 2025-03-21T17:55:56 1742579756

Even with numbers, some locales use a comma `,` as the decimal seperator, and some use the dot `.` so that can cause headaches out of the box.

mjw1007 · 2025-03-21T17:49:43 1742579383

Beyond the basic implementation of quoting and escaping, those are things you also have to worry about if you use someone else's csv parser.

And if you implement your own, you get to choose the answers you want.

lelanthran · 2025-03-21T17:56:22 1742579782

What do you mean "allowed to bail"?

Regardless of the format if you're parsing something and encounter an error there are very few circumstances where the correct action is to return mangled dat.

mikepurvis · 2025-03-21T18:01:20 1742580080

Maybe? If the dataset is large and the stakes are low, maybe you just drop the affected records, or mark them as incomplete somehow. Or generate a failures spool on the side for manual review after the fact. Certainly in a lot of research settings it could be enough to just call out that 3% of your input records had to be excluded due to data validation issues, and then move on with whatever the analysis is.

It's not usually realistic to force your data source into compliance, nor is manually fixing it in between typically a worthwhile pursuit either.

rho4 · 2025-03-21T16:04:53 1742573093

multiline values, comma vs semicolon, value delimiter escaping

geysersam · 2025-03-21T15:29:49 1742570989

At my current workplace the word "bespoke" is used to mean anything that is "business logic" and everyone are very much discouraged from working on such things. On the other hand we've got a fantastic set of home made tooling and libraries, all impressive software engineering, almost as good as the of the shelf alternatives.