Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'll add "reduce code size and complexity" to the list of benefits. A python library to calculate a simhash, or track changes on a django model, or auto generate test fixtures, will often be 90% configuration cruft for other usecases, and 10% the code your app actually cares about. Reading the library and extracting and finetuning the core logic makes you responsible for the bugs in the 10%, but no longer affected by bugs in the 90%.


Hard agree. A library should not inflict complex use cases' complexity on simple use cases, but sometimes they do, either because they're poorly designed or because they're overkill for your use case. But often I see pain and complexity excused with "this is the library that everybody else uses."

Sometimes a simple bespoke solution minimizes costs compared to the complexity of using a massive hairball with a ton of power that you don't need.

One big caveat to this: there's a tendency to underestimate the cost and complexity of a solution that you, personally, developed. If new developers coming onto the project disagree, they're probably right.


The big caveat is a big one. Choose your battles wisely!

There are plenty of things that look simpler than an established library at first glance (I/O of specialized formats comes to mind quickly). However, a lot of the complexity of that established library can wind up being edge cases that you actually _do_ care about, you just don't realize it yet.

It's easy to wind up blind to maintenance burden of "just a quick add to the in-house version" repeated over and over again until you wind up with something that has all of the complexities of the widely used library you were trying to avoid.

With that said, I still agree that it's good to write things from scratch and avoid complex dependencies where possible! I just think choosing the right cases to do so can be a bit of an art. It's a good one to hone.


> I/O of specialized formats comes to mind quickly

The classic "I'll write my own csv parser - how hard can it be?"


> The classic "I'll write my own csv parser - how hard can it be?"

I did as part of my work. It was easy.

To be very clear: the CSV files that are used are outputs from another tool, so they are much more "well-behaved" and "well-defined" (e.g. no escaping in particular for newlines; well-known separators; well-known encoding; ...) than many CSV files that you find on the internet.

On the other hand, some columns need a little bit of "special" handling (you could also do this as a post-processing step, but it is faster to be able to attach a handler to a column to do this handling directly during the parsing).

Under these circumstances (very well-behaved CSV files, but on the other hand wishing the capability to do some processing as part of the CSV reading), likely any existing library for parsing CSV would likely either be like taking a sledgehammer to crack a nut, or would have to be modified to suit the requirements.

So, writing a (very simple) own CSV reader implementation was the right choice.


> very well-behaved CSV files

You were incredibly lucky. I've never heard of anyone who insisted on integrating via CSV files who was also capable of consistently providing valid CSV files.


> I've never heard of anyone who insisted on integrating via CSV files who was also capable of consistently providing valid CSV files.

To be fair: problematic CSV files do occur. But for the functionality that the program provides, it suffices if in such a situation, an error message is shown to the user that helps him track down where the problem with the CSV file is. Or if the reading does not fail, the user can see in the visualization of the read data where the error with the CSV file was.

In other words: what is not expected is that the program gracefully has to

- automatically find out the "intended behaviour" (column separators, encoding, escaping, ...) of the CSV parsing,

- automatically correct incorrect input files.


CSV is _way_ hairier than folks think it is!!

And for anyone who's not convinced by CSV, consider parsing XML with a regex. "I don't need a full XML parser, I just need this little piece of data! Let's keep things lightweight. This can just be a regex..."

I've said it many times myself and been eventually burned by it each time. I'm not saying it's always wrong, but stop and think whether or not you can _really_ trust that "little piece of data" not to grow...


> "I don't need a full XML parser, I just need this little piece of data! Let's keep things lightweight. This can just be a regex..."

relevant:

> ruby-saml was using two different XML parsers during the code path of signature verification. Namely, REXML and Nokogiri

where "REXML" does exactly what you described, and hilarity ensued

Sign in as anyone: Bypassing SAML SSO authentication with parser differentials - https://news.ycombinator.com/item?id=43374519 - March 2025 (126 comments)


A plural of regex is regrets...


What are some footguns? It does seem easy


It's easy if the fields are all numbers and you have a good handle on whether any of them will be negative, in scientific notation, etc.

Once strings are in play, it quickly gets very hairy though, with quoting and escaping that's all over the place.

Badly formed, damaged, or truncated files are another caution area— are you allowed to bail, or required to? Is it up to your parser to flag when something looks hinky so a human can check it out? Or to make a judgment call about how hinky is hinky enough that the whole process needs to abort?


Even with numbers, some locales use a comma `,` as the decimal seperator, and some use the dot `.` so that can cause headaches out of the box.


Beyond the basic implementation of quoting and escaping, those are things you also have to worry about if you use someone else's csv parser.

And if you implement your own, you get to choose the answers you want.


What do you mean "allowed to bail"?

Regardless of the format if you're parsing something and encounter an error there are very few circumstances where the correct action is to return mangled dat.


Maybe? If the dataset is large and the stakes are low, maybe you just drop the affected records, or mark them as incomplete somehow. Or generate a failures spool on the side for manual review after the fact. Certainly in a lot of research settings it could be enough to just call out that 3% of your input records had to be excluded due to data validation issues, and then move on with whatever the analysis is.

It's not usually realistic to force your data source into compliance, nor is manually fixing it in between typically a worthwhile pursuit either.


multiline values, comma vs semicolon, value delimiter escaping


At my current workplace the word "bespoke" is used to mean anything that is "business logic" and everyone are very much discouraged from working on such things. On the other hand we've got a fantastic set of home made tooling and libraries, all impressive software engineering, almost as good as the of the shelf alternatives.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: