Hacker News new | past | comments | ask | show | jobs | submit login
Applying “make invalid states unrepresentable” (kevinmahoney.co.uk)
375 points by fanf2 on Oct 5, 2020 | hide | past | favorite | 183 comments



In general I agree that it's nice to make invalid states unrepresentable, but I'm not sure if I agree that this counts as a fundamental "invalid state". There is nothing about contracts which require that you can only have one active at the same time, or that that current one must be open ended.

From a practical point of view it might be advantageous if you maintain only a single contract with a customer at all times, but that is a business requirement which might be changed in the future.

I mention this mostly from experience: Multiple times I've designed systems where I've reduced the representable states to the minimum, and when some requirements change I realize I have to re-design the full system.

The new presented representation might make sense in this situation, but I'd be very wary of taking current business practices and make all other alternatives impossible to represent. It's a balancing act of course as you can go in the opposite direction and make it way too flexible.

> This poor choice was not just a theoretical problem - gaps in contracts were found on more than one occasion, requiring hours of engineering effort to hunt down and fix.

I'd like to hear more about what happened here. Was the problem that the default contract was not re-applied correctly? If so, changing the representation might not actually solve any problems — it make actually make it _worse_. A renewal of a contract typically involves some automated process where other services are involved (payment, invoicing, emails). The previous representation (with explicit start/end dates) made it possible for you to verify that everything was correct and lined up.


> Multiple times I've designed systems where I've reduced the representable states to the minimum, and when some requirements change I realize I have to re-design the full system.

Yes, if requirements change, you change the design and code to support the new requirements.

Compromising the consistency and maintainability of the current design to accommodate a hypothetical future requirement change is a bad trade-off IMHO, since you can't predict the future. A requirement change may happen in a completely different direction than the one you anticipated, and then you have the worst of both worlds.

It is better to make code maintainable than making it flexible.


> since you can't predict the future

That's one of those statements that makes sense, but is often not true. Very rarely does a client comes to me with a feature request that does require a pretty significant design change, but most of the time they're changes that were foreseen.

Using this current article as an example, I love the way that they're storing the intervals to guarantee that they can't overlap. That's awesome! What I would likely end up doing, though, is use that as the underlying representation but still return individual interval objects through the query API with a start and end date on each interval. That way, if the "only one at a time" rule changes, the changes required are localized.


> Using this current article as an example, I love the way that they're storing the intervals to guarantee that they can't overlap. That's awesome! What I would likely end up doing, though, is use that as the underlying representation but still return individual interval objects through the query API with a start and end date on each interval.

The article addresses that concept:

> It is sometimes still useful to represent the periods as a sequence of start and end dates. It is trivial to project the set of dates in to this form. As long as the canonical representation is the set, the constraints will still hold.


> What I would likely end up doing, though, is use that as the underlying representation but still return individual interval objects through the query API with a start and end date on each interval.

How the model you present to the user is represented in the database is an implementation detail.


I like to call this, "speculative complexity". I've seen many cases where speculative complexity was added, and persisted for a long time, for reasons that fundamentally mispredicted the way the system would evolve and actually inhibited that evolution.


I've seen this too. And I've also seen the converse--complexity that was added because engineers refused anything but the most myopic designs, using thought-terminating cliches like "YAGNI" or "that's hypothetical".

"Never think ahead" is obviously not good advice. There's no silver bullet here--we have to think about how likely future scenarios are, and plan for them based on the business context and needs. Many of them are unlikely or too costly to do anything about...and many of them aren't.


But then you'd have to have domain knowledge and why would you learn domain knowledge when the ideal career is a new job in a new industry every few months. </HN>


Under what circumstances did following YAGNI lead to added complexity?


A real world case I recently ran into where thankfully we realized we would need it and added it is versioning. A struct you pass to or receive from an API that has a version in it means the difference between being able to make changes to the internals without changing the externals and not being able to. We didn't need the version field until the second version was released. Had we just said, "Oh, we aren't going to need it," on the first version we would have been boned.


That's slightly different, that is not predicting future requirement changes but making it easier to make breaking changes when needed.


A common pattern is writing code as a series of isolated cases, when taking some time to design the general case would greatly reduce the amount of code. You add a bool parameter to a function to modify one small bit of what it does, then another one, and you add some new return value, and before long, you've got a class with several getters and instance variables represented as code in a single function, with parameters controlling which actual method is run.


What you describe a development style where the first iteration is well-designed but subsequent modifications are applied as a series of hacks and kludges. This way any code will turn into a big ball of mud over time. This is a problem.

But trying to anticipate everything in the first iteration is not the solution. The solution is to write maintainable code and apply each change with the same discipline and thought which was used when in the initial iteration. Follow the boy scout rule: After any change, the code should be in a better state than before.


I think if you misinterpret YAGNI as “you don’t need to change this later”. So your becomes rigidly hard coded, instead of having easily configurable variables and arguments. The over-engineered solution (the real YAGNI) was an interface, an object, methods and fields, only serving a very niche purpose with a lot of boilerplate.


Not OP, but it would add complexity because that method you "weren't going to need" turns out to actually be needed.

Now you have to work around your simplified design because you decided that you didn't need anything more.


That is not adding complexity, since the end result (the added method) is the same. You are just postponing some work until it is needed, which is prudent anyway due to opportunity cost.


Correctly judging this tradeoff is what makes the difference between a good architect and a great architect. There are definitely cases where I have put in a bit of effort (a week or so's worth of programming) to make things flexible because due to the requirements I knew they would be necessary in the 9-12 month timeframe (I've also been wrong about architectural decisions). Then when the time came around, it was painless to make the transition.

I suppose if you were cynical, you could claim that if it's painless no one sees how important you are. And then you wind up leaving the company, because they think everything is easy and don't provide you with the autonomy to achieve what you need to make their system work. And then they discover that it's actually hard.


> Yes, if requirements change, you change the design and code to support the new requirements.

Code and representation (i.e. schema) are vastly different. In my experience it's takes an order of magnitude longer to change a representation than to change code. Once there are multiple services/tools which works with a representation you typically have to support both the new and the old representation at the same time (since you can't rollout everything simultaneously).

> Compromising the consistency and maintainability of the current design to accommodate a hypothetical future requirement change is a bad trade-off IMHO

Designing a representation which can handle possible requirement changes does not necessarily mean "compromising the consistency and maintainability". We have great tools for ensuring consistency (e.g. transactions and constraints in SQL databases), and I don't exactly see how this new representation is more "maintainable" than the old one (although we don't have all the information in this article).


Getting good at data migrations (with tools and processes to do this) can pay off. It's a more general way of preparing for the future than attempting to anticipate specific changes.

On the other hand, some may say YAGNI.


Then the code will be changed. Making up business requirements is the number one reason for instant legacy code. Code is not set in concrete, you can add that flexibility later when it is needed, but making everything overly generic to make it easier to "implement new requirements" only leads to code that is hard to change in my experience. Also don't forget that this is only an example.


A few basic invariants about how your code is structured does a lot more than it being excessively "generic". Writing simple components that actually adhere to the single-responsibility principle (and composing them into more complex logic) doesn't just help out the mysterious future. It makes your existing codebase easier to understand and to validate up front.


The YAGNI (you aint gonna need it) principle overrides the DRY principle imo.


This is partly why I've found it helpful to wait until there are at least 3 identical (not nearly identical) implementations of something before trying to make a more generic/abstracted version of it.


I think these rules of thumb are... okay. But I think it's more helpful to go back to how DRY was initially defined ("Every piece of knowledge must have a single, unambiguous, authoritative representation within a system") and ask whether what I'm dealing with is actually "a piece of knowledge." There can be 10 identical copies, and if they just happen to be identical but they represent different things that might change independently, they should probably remain 10 identical, independent copies. Alternatively, if there are exactly two places something occurs in the code, but if they're out of sync the result is a broken system, you should think about unifying.

DRY is too often treated as purely (or principally) syntactic, when that's actually much less useful.


This is a great way to look at it


Computers have been widespread for between 70 and 30 years (there is reason to debate, but whatever). Nearly everything has been done before: what you are doing isn't fundamentally new. There is lots of opportunity to add minor new features, reliability, or better user interfaces. But the fundamentals of what you are doing isn't new anymore. You can look at your past versions and what competitors do for guidance on what you will probably need next. If you have any broad knowledge of your problem domain you can make reasonable guesses as to what you will need and what you won't need. When replacing a subsystem I know if there will be 100 users of it in the future or if it is a leaf with 1 user - because I know what the old crufty subsystem has (the first case I spend days thinking about the interface, the second I design the interface when I integrate the one subsystem)


Yes - as does AHA (Avoid Hasty Abstractions). IMHO.


The requirements here aren't clear, but I'm guessing the requirement is to model the contracts the company actually has with customers.

The business also tells you that there are never two contracts running at the same time. But are you actually going to believe that? Is this condition really "impossible?"

A vital and necessary factor here is whether the system being designed has complete control of the creation of contracts. This is perhaps taken for granted by the author, but it's too important to leave implicit. You have three choices in a situation like this: make it impossible for contracts to overlap, model it, or don't model it and accept the consequences. Depending on the frequency and the consequences of the assumption being incorrect, maybe it's acceptable not to model it. Maybe not. My point is that you can't assume something is impossible unless you can actually prevent it from happening, and the author should not have sidestepped this part of the analysis (though possibly they meant for it to be understood that contract creation happens through this data model.)

> Also don't forget that this is only an example.

The problem is that this is an example meant to illustrate and justify a rule of thumb, but it's extremely, extremely simple. How often do you deal with requirements that are this simple, this mathematical? Is this really the kind of example you want to build a rule of thumb from?

Realistically, when I hear requirements like this, I assume they're wrong (very common at the beginning of a project) and I get together with the product manager and ideally a domain expert representing the customer (if the product manager isn't too territorial about that being their job) and figure out what the hell the actual requirements are. What if a customer has a contract to rent 10 units of space at $5 and in the middle of that contract needs more 5 more units but the price has gone up $10? Do you tell them they have to cancel the existing contract at $5 and pay $10 for all their units if they want to add some? Or give them the new units at the old price? Or is it okay to represent the same customer by distinct customer records?

I do like the principle of making invalid states unrepresentable, but I would like to add two supplementary principles:

1. Oftentimes what the business tells you about the data they produce is purely aspirational.

"There will never be overlapping contracts," often means, "We swear we're going to stop creating overlapping contracts, and this time we really mean it." You have to follow up with questions like, "How often have we had overlapping contracts in the past? When was the most recent occurrence?" You should even ask, "When do we anticipate signing the next one?" A logically-oriented software developer might expect someone to take offense if you respond to "we don't sign contracts like that" with "Do you have any currently in the pipeline?" but this is a totally normal kind of question to ask.

2. When users give you a rule in their business requirements, they often take it for granted that the software will handle exceptions to the rule gracefully.

They don't necessarily appreciate how bad things can go in software when something "impossible" happens. When they say, "Contracts will never overlap," you have to say, "What should happen when they do?" If you are talking to a mathematician or a programmer this might come off as questioning their competence, but most people will not find it unusual at all or at least will appreciate that the question is motivated by experience rather than disrespect. It's not like a math problem in school; it is legitimate to question the givens.


I once did weeks of work trying to unpack what my employer meant by the term "customer" - if your customers are large companies with hundreds of legal entities and hundreds of locations across the world things can get pretty complex pretty fast e.g. does "never two contracts running at the same time" mean that you can't have a contract with a subsidiary and another separate subsidiary of the same parent (which might make sense for credit checking purposes)? What about subsidiaries in different countries? What about partly owned subsidiaries...


A method that works well for this kind of analysis in my experience is to start by looking at the data, together with a business person and then ask the needed question until you can model it. "Oh so I see we have customers that are sites of large corporations, so that is a requirement" "Oh and these rows look like subsidiaries" etc


There are trade-offs, of course, but I'm generally not a fan of using implicit defaults for business applications (i.e., the application infers the default when there's no data).

If things go well, business data outlives business applications. After years or decades, it can be a major pain to figure out all the "secret" values that aren't actually in the data.


I think this depends on if you are talking about figuring out what contract a company had some time ago (say 18 months), or if you talk about the current default value. The last one should be very easy to find in the code and if you are migrating to a new system the information should be transferred in some way. For historical things you obviously has to store the data somewhere and then just figure out a nice way to do that. In this case it could just be a table of default contracts with some fields and a startdate. Then you can query for any point in the past and get the default contract that was active at that point.


> There is nothing about contracts which require that you can only have one active at the same time, or that that current one must be open ended.

Agreed and yet for many systems, that flexability still eludes.

One example I personally experienced was changing a phone contract. The contract had run for many years, so could be cancelled any time with one months notice. The new plan and contract was much better and yet a limitation played out doing this. Ended up that the systems at the telco was unable to activate the new contract until the old contract had ended. Whilst a new contract could be physically signed in a shop with a start date of the day of signing, and logged into the system. The provisioning backed was unable to activate it until the old contract had ceased as you can't have two contracts for the same phone number.

That I do believe is a case of - whilst some things can run in parallel, others are locked to a single dependant resource.

But every rule has an exception, it is with good design that you limit those exceptions impact.


I think the fundamental problem here is that the table/entity is incredibly badly named. What kind of contract has only three fields?! This isn't a case of YAGNI; no real-world "contract" is this simple.

It appears to actually be some sort of contract_period or contract_duration, and probably has a link to a real "contract" object somewhere that contains the real meat of the concept. But it's hard to tell what's literal and what's the author trying to simplify the example for us.


I agree very much with your comment.

I've found the fundamental principle that helps to keep the system extensible is to: make your system model the real world accurately.

This involves building the system concepts closely mapped onto real world concepts without taking shortcuts.

That way when the requirements change, all the fundamental pieces of the system stay valid, and only the piece that is changing tends to need updating.

This helps to void the problem you mentioned of needing to re-design the full system, keeping the system extensible.


Absent business requirements, I would love to see what you think fundamental invalid states for a contract would be. Every property of a contract I can come up with seems like a business requirement.


Hard agree. The type system is no place for business logic. If you want to get fancy, maybe some database constraints you can change later. Business logic is constraints over the representation... not the representation itself.

Also in generally the principle of "make invalid states unrepresentable” ends with the realization that only Idris can properly do it, which is not pragmatically useful.


I suspect there might exist a fundamental difference in the way automated-testing, static-type-leveraging software developers view software design and requirements compared to those who don't.

I have no idea about the commenter here, but in my direct personal experience, the "those who don't" developers tend to regard software design and requirements to be toward the "write-only, then try to never change" end of the spectrum.

I get a lot of "are you crazy? We can't change that!" type of pushback from the those-who-don'ts, regarding functional changes that should be reasonably manageable to my testing/typing mind.


This.

Basically this is what event sourcing tries to solve: it lets you change your state representation to reflect new requirements, because you can always rebuild everything from the event log.


A large fraction of the comments are about how if you do this and then someday your requirements change you might have to redo your underlying data structures or databases, with the implication being that you should therefore make those as general and flexible as possible.

That reminds me of an interesting point I saw in a book whose title and author escape me. He said one of the reason you encounter so many bad designs in Java programs is that many new Java programmers look at the design of Swing to learn good design.

He wasn't saying that Swing is badly designed--but Swing is a framework/library, not an application. What it takes to be a good framework/library is different than what it takes to be a good application.

If you are writing an inventory management application you can design your tables and data structures and interfaces around things inventory management applications need. If you are writing a medical billing application you can design around what medical billing needs.

If you are writing a framework or library that might be used by inventory management applications and medical billing applications and all the other nearly infinite kinds of applications people will write you need to keep it very general and flexible...but you also have to keep it fast and not too bloated. It's a much harder design problem, with different best practices for what is good design and what is not.


There is a trick that I wish all senior staff knew but we find ourselves having to teach as a matter of routine.

1) Don't advertise what you're not selling

2) Don't sell everything that you've made

It's possible to write software that has a public contract that allows only a subset of states that the internal system allows. You can use this to support one customer that has a requirement that is mutually incompatible with another customer's, but you can also use it for migrations. One should be able to create an API where the legal values in the system are strictly limited, but the data structures and storage format may have affordances to support migration.

That doesn't necessarily solve the problem of communication between services and migrating these changes into a running system, but it's a useful tool. If you've ever had a coworker who insists on your service call diagrams looks like a tree or an acyclic graph, this problem is certainly one of many reasons they may be insisting on this. With a DAG there is an order in which I can deploy things that has a prayer (but not a guarantee) of letting all of the systems understand each other during each increment of deployment.

People have come up with alternative solutions for this problem by employing sophisticated sets of feature toggles, and in some ways this is superior, but it trades the number of steps (each of which has a potential for human error, and consumes calendar time) for increased reliability on average.


While this is great if you know exactly what you want to achieve, it does “lock you in” those constraints on a more fundamental level. More times than I can count I’ve seen business requirements change to require those “unrepresentable” states, and since you’ve now designed you whole data model around it you need to add awful hacks to make it work.

The timeline example is actually very telling. A lot of times you’d actually want to encode overlapping time periods at the edges.

You’d be laughed out off a meeting where a business asks about this and you smugly explain how it is unrepresentable.

I guess what I’m saying is that it might be worth over designing your system a bit to leave you some wiggle room, unless you have hard guarantees that something should be “impossible”


Even "hard guarantees" are worthless. All it takes is one client with a checkbook to change business requirements.


I agree that this is an issue, but I think the answer is simply using a very flexible basic data representation (which admits invalid state) and then using predicates to refine it. e.g. starting with a list of (start,end) intervals and then adding predicates for valid intervals (start <= end), ordered, non-overlapping and continuous.

If any of the requirements change, it's easy to either add more predicates or relax/even outright remove them.


If you play your cards right, you can even get your type system to alert you to every place in your code that needs to change when you change the constraints.

I don't deny there's a certain art to that, and I can't explain it all myself. But I don't even necessarily mean amazing clever type tricks like you might see in Haskell or something, I mean that something as simple as "I've changed the definition of what a 'Customer' is in some fundamental manner, so I'm going to rename that class to 'CustomerNew', use the compiler to point me at every single place that breaks, audit it, change the local name to CustomerNew, and then, once everything is fixed, use my IDE's rename feature to rename CustomerNew back to Customer before my final commit". Many times you can get by just by renaming a field or something to similar effect, but in the worst case you may need to audit everything.

It's one of the more tedious bits of the job sometimes, but net-net this can still be a timesaver, if you account for the full cost of the trickle of bugs this sort of thing can prevent.


You could always split your types into frontend and backend types where the backend ones are more open and the frontend ones are more restricted. I don't necessarily mean FE/BE as on the web. A lot of code is only interested in shuffling around data anyway, the shape is fairly uninteresting.


I would put it the other way around: make your basic representation restricted, but present a more permissive API. This way, the data model helps enforce your constraints, but you don't need to redesign the API when the requirements (inevitably) change.


If you make your front-end interface more complex than is needed to represent current data, you may save clients a migration in the future. But you force your clients to deal with the complexity right now, and you could wind up needing to change things in a different way in the future. Neither of these concerns clearly dominates all the time - it's a question of just how much effort you're saving them, just how likely various changes are, etc.

On the other hand, if you make your internal representation support more complexity than what you hand out to your users, there arises the question of how you simplify (probably lossily!) or break when you wind up with data that can't be expressed with the simpler interface. But you might well be saving yourself a costly data migration down the line. Also a decision with trade-offs that could potentially go either way.


Kind of like internal data/algorithms and external interface in OOP?


That could be fixed with 1 additional concept. Instead of

    Customer: 1 - 1: Set(Dates)
What if it was

    Customer 1 - *: Contract: 1 - 1: Set(Dates)
Overlap achieved.


Additionally, turning (startDate, endDate) into a set of dates will make the code more complex in some places. Before:

    SELECT event FROM events WHERE endDate<"2021"
After:

    Whatever additional complexity you
    add to your codebase to query end
    dates.


It's not hugely complex although definitely more than the first example. I guess it depends whether it's offset by the benefits...

https://dbfiddle.uk/?rdbms=postgres_11&fiddle=50e6a963cd1db0...

(YMMV, obvs.)


It's not that much more complex: if you do

    SELECT event FROM events WHERE startDate < "2021"
then all but one of the results (the one with the greatest `startDate`) will also have an implicit end-date prior to 2021.


This is fine for an open-ended query like the one given, because you still receive all the relevant data. But if you're looking at a range, for the same reason you have one extra at the end, you also have one missing at the beginning. And you can't just filter away missing data.


Doing all the logic in SQL requires more complexity using subqueries.

It gets uglier if you need to find the contract valid on a certain date based off of a join.

These issues can be covered up with code as it will a easier to have reusable functions, but it makes the job of a data analysts much more difficult and error prone.


Uglier than having to find the record after the one you're inserting (so you can determine your new record's end date from the subsequent record's start date) and the record before (so you can modify its end date to match your new record's start date)?


Until business tells you that endDate is not necessarily greater than startDate. <- real world experience


Sounds interesting can you explain more about it?


I think you are vastly overstating the risks of changing requirements. It is usually easy to go from a less permissive model to a more permissive one, but the opposite is often difficult.


This is a good introduction on a conceptual level.

I think a large contributor to the problem is story-oriented development, where all that matters in the sprint is "getting it done" and not looking at the broader context.

To make unrepresentable states practical, Scott Wlaschin has an excellent write-up here (0). His book (plugged in that article) is also excellent.

[0] https://fsharpforfunandprofit.com/posts/designing-with-types...


> I think a large contributor to the problem is story-oriented development, where all that matters in the sprint is "getting it done" and not looking at the broader context.

I think you have a point here. This design offers much better safety, comparable to "parsing instead of validating". But it requires up-front design. And that is indeed "verboten" in modern software development management style.

Why is it "verboten"? I think it has to do with two fundamental concepts of scrum et. al.:

1. Stories that are "ready" just need to be "implemented". The implication is that the developer does not design and everything is orthogonal. There are no interdependencies, maintenance effort or non-functional requirements.

2. A story that is "ready" focuses solely on the desired outcome for some selected examples. There is no generalization of the examples and consequently little to no abstraction.

I think these issues stem from the fact that Scrum et. al. are intrinsically tools for managers to isolate them from the complexities of software engineering. Every metric of scrum, for instance, like "progress" or "definition of ready" is essentially empty of meaning for software engineering.


I like how you've said it here, but one thing that scrum doesn't have in it is relief from your professional duty as an engineer.

If a system needs to be designed in a particular way, do so. That's how long it takes and that's why in planning you discuss how it will be designed.

The design of how you're going to build the system is taken care of before the task is split into easily digestible bits that meet a definition of ready.

If you need to build it before you know how to build it, scrum allows for research spikes into this. A focused, timeboxed interval that so you have the ability to estimate the actual difficulty of work to be completed.


I get the impression that anything management touches gets turned (one might say corrupted) into a tool for detaching from engineering details. That's not what sole was meant to be about, and it's not what waterfall was about, but it happens anyway.

Details will always matter, but management's paycheck depends on not understanding that.


+1, favorited.


> I think a large contributor to the problem is story-oriented development, where all that matters in the sprint is "getting it done" and not looking at the broader context.

I think that's exactly backwards. This kind of overcomplicated representation usually happens because people put too much effort into designing their representations up front. If you follow story-oriented development and only implement the parts they actually need to get the current task done, you never end up with these wasteful extra states because you never actually needed them. But people think that planning before coding is somehow virtuous, and then they're tied to following those plans.


To be fair, both are possible scenarios:

- A team spends too much time over-engineering a representation for something that could be more easily maintained using a simple model

- A team spends too little time considering the edge cases with a representation because they feel pressure to deliver the feature within a short space of time


And they often happen at the same time, too. I've seen situations where developers over-plan a system in advance, then spend ages hacking smaller features and changes into it to meet quick turnaround times, instead of taking proper step back and being aware of when their original design needs to be revised.


The representational debt the parent comment is referring to is akin to that quip, "if I had more time, I would have written a shorter letter."

I find that both types of representational debt are common.


I assure you that I have wasted enough time fixing story-oriented development with major refactorings, because some edge cases weren't possible to be easily extended in the existing code.

Also have experience fixing story-oriented development with dirty workarounds, because major refactorings were also required, but not desired by whom was paying for the stories.

Both ways I didn't care, it was money on the bank anyway.


Our database gets shit added on an ‘as necessary’ basis, and I guarantee you it’s not a great thing.


the flip side of this is that sometimes your initial designs need to be expanded to account for something new and that's hard, leading to tech debt and hacks.

personally I think that's a better trade-off than implementing something complicated up-front that you don't know will work and ending up with flexibility in the wrong places, leading to tech debt and hacks.


Sum types are one of the main things I miss when working in Python. Is anyone aware of any good ways of adding sum types to Python?


The entire point of sum types is that they be statically checked. Without static typing, they don't seem really useful: Erlang doesn't have sum types, and it doesn't really have a use for them until it gets a type system which can leverage them. Instead it models "sum types" as tagged tuples e.g.

    {ok, Ok} | {err, Err}
however Erlang has very good pattern matching. Python… doesn't.

There's a PEP but I stopped following it because the discussion was a mess. And it's apparently now split into 3 different PEP, I don't know whether that's an improvement or not though.

Furthermore Python's dislike of HOFs means you can't really do "monadic" processing as you'd do in, say, smalltalk where your "variants" would really be subtypes with cool higher-order messages. So you're mostly just adding indirections.


I'm not sure you would consider this a good way, but one can implement (unchecked) sum and product types in any language with lexical closures via Scott encoding[1]:

    # data Pair a b = Pair a b
    def pair(x, y):
      return lambda f: f(x, y)

    # data Either a b = Left a | Right b
    def left(x):
      return lambda l, r: l(x)
    def right(y):
      return lambda l, r: r(y)

    v0 = pair(2, 3)
    v1 = left(7)
    v2 = right(9)

    # case v0 of { Pair x y -> x + y }
    print(v0(lambda x, y: x + y))

    # case v1 of { Left x -> x + 1; Right y -> y * 2 }
    print(v1(lambda x: x + 1, lambda y: y * 2))
[1] https://en.wikipedia.org/wiki/Mogensen%E2%80%93Scott_encodin...


Wow, this looks clever. (maybe a bit too clever)

I'll need to read more about it, thanks for the pointer!


There's a library: https://sumtypes.readthedocs.io/en/latest/

It's serviceable, and even does static checking of total coverage of cases. I will say that it makes linters freak out.


It doesn't sound to me like the kind of thing that any particular WoW is going to fix. As if it would suddenly naturally dawn on someone to do this given enough time. It's just lack of knowledge and/or discipline.


So Google's protocol buffers have this feature called "required" fields, which enforce schema in the type system. You should never use it. Never. It's one of those things that sound good until you're a few years into the project. Similar to how you should never be using meaningful IDs as primary keys for objects, always use meaningless fingerprint-like integers. Or how all integers should be signed unless you're dead sure the number is unsigned (like a fingerprint). And how many integers should actually be strings, unless you're dead sure this is a number (externally provided IDs, such as for example customer account IDs, are not numbers). Or how you should be careful to use bytes rather than unicode strings.

Make your schema permissible and your code paranoid, it will pay off later. Build a data linter if necessary, but don't tie the schema.


> And how many integers should actually be strings, unless you're dead sure this is a number.

I phrase this as: If it doesn't make sense to do math on it, it's not a number. What does adding one to a customer account number mean? Absolutely nothing -- you get a completely different account number. So it's not a number, but a numeric string.


It's not a string either though: in the same way integer addition (almost always) does not make sense, string concatenation (almost always) does not make sense either. The proper type would allow for equality check and explicit string (de)serialization only.


I meant string in a more casual sense. From a more technical sense, it would be a symbol. I'll be more precise in the future.


I may be in the minority, but after happily using protobuf for years, I believe that there's nothing inherently wrong with required fields - instead, what's "wrong" is the protocol buffer API.

Namely, when constructing a protobuf, theoretically, there might be two different ways: (A) first gather all the fields, and then construct the protobuf from these fields; (B) first construct an empty protobuf, and fill in the fields as necessary. The actual protobuf uses (B) - which is convenient in most cases, because when you start constructing a protobuf usually you don't have all the data ready yet.

However, with required fields, this means when you construct the protobuf it starts with all required fields missing - i.e., an invalid state!

I'm not sure what's the best way to fix it, because it would be infeasible to rewrite all the code to gather all the fields and then construct the protobuf - also it will be hugely inefficient in many cases. However, I feel the "no required fields" rule is essentially a null pointer (the "billion dollar mistake") in disguise - the actual problem is that the API doesn't enforce type safety.


This isn't actually the issue with required fields (some languages, like java and (usually) python, use a construct-once style).

Imagine you have an innocent `required` field. You have a producer and a consumer of that field that communicate over the wire. (or instead of the wire, imagine a database).

You send or store an instance of that protobuf. Now let's say that you want to make the field optional (or remove it). With an already-optional field, this is easy. You stop setting it, and maybe eventually you clean it up.

With a required field, however, you can't do that. If any of your clients don't have the newest schema version, you can't unset the field (so imagine that you support mobile clients who may never update). Or if there's middleware you don't know about that introspects your proto. Even if you do the dance right and update your server and client before not setting the new field, you could crash outdated middleware that you didn't know about. Whoops!

Or with the database, you now need to dual write or something complex because if you need to roll-back to an older version, you'd be unable to read the protos that don't include the required field.

Required doesn't do well over time. It has nothing to do with setting the values.


If you have old clients that expect that field you are removing to be there in a meaningful way, you still have to update all clients before you can stop setting it. Having the protobuf schema itself use optional or required doesn't change that, it just makes the dependency explicit there, instead of only in the code at the endpoints.

Changing required to optional isn't a magic fix for protocol compatibility. If it were (for your limited use case) you can just make that change to the protobuf client side as it doesn't affect the wire representation/interpretation.


> If you have old clients that expect that field you are removing to be there in a meaningful way

Right, there's the rub. `required` means that anyone who deserializes your proto falls into this category. That's a much larger group than "anyone who reads a specific field". So the list of clients now is forced to include any and all middleware that may read your proto (imagine a routing layer or some kind of analytics system or whatnot).

(Note also that there's lots of ways to make reading a field that is empty fallback to doing some reasonable non-catastrophic behavior, required doesn't let you do those things).


What about changing it to optional on the server side, and clients that don't upgrade will always send that required payload and you will know to parse it but then ignore it. Updated clients know its optional and also ignore it and don't set it. Once you have low frequency of messages containing that required field you can do the final cleanup.


I think the problem is when outdated clients try to parse responses with a missing, used-to-be-required field.


This might make sense for a transport schema because you can receive messages from the past or the future but it does not translate to internal program state or database schemas where this is not the case.

Making invalid states unrepresentable is basically the process of taking human-checked invariants and turning them into type-checked invariants. This reduces the likelihood of bugs and guides humans to use the system correctly.


"required" is an example of making an invalid state representable and having it ruin your program.


Yea, a better example of making invalid state unrepresentable in Google’s protocol buffers is to use the “oneof” feature to mark that a set of fields are mutually exclusive. If A, B, and C are mutually exclusive you can put them in a oneof, which also saves space in the binary representation. If in future you discover that A and B but not C needs to be a valid state, you can add a 4th AB option inside of the oneof.


Can you elaborate more on the "required" fields point? We've been using a similar feature for several years now in APIs at my work and haven't run into any issues, though we do only use it very sparingly for fields that logically can never be missing. At some point a client has to make the call for what they consider essential, so pushing it in the schema makes this less ambiguous from what I've seen. Maybe it's fine for our use-case (mostly static APIs), whereas what you're saying is good advice in general.


https://capnproto.org/faq.html#how-do-i-make-a-field-require...

Required now means requires forever because people can't migrate safely. But technically you can change a protocol descriptor from required to optional, which is invalid (usually, in a distributed non-transactional system (the common kin) but nothing stops you from doing it. So why not make required forever? Well, do you really want to commit to anything forever?


After reading that article my take-away is not that "required" is bad and should never be used ever, but rather it was bad with how Google wanted to use it. And since this is Google's project, it makes sense for them to remove the feature if it's causing data center outages, it's not worth the risk at that point.

For example, in the case of the message bus they say "And even though the message bus doesn’t care about message content", and later on "The right answer is for applications to do validation as-needed in application-level code." Strict schema and validation is most helpful for application developers, not some middleware routing code. Was it not possible for them to write a parser that doesn't fully validate the message for use-cases like this?


Protocol buffers already require you to commit to some things forever, like the type of a field, or whether two fields belong in a oneof together. I’m not saying that “required” was a great feature, but it’s not exactly unique.


No they don't. An optional field can be deprecated and replaced with a different field. This can be done to change the type (also some types can be changed, although you probably shouldn't).

Required usually cannot be deprecated.


You can deprecate an optional field and reserve the field number, but if you reintroduce use of that field number with a field of a different type that is not backward compatible. Types can be changed if they are binary compatible, but in that case they haven’t actually changed, because the binary format is the canonical format.


Right, I'm not sure what your point is. You can always add more fields, so being unable to change the type of a field isn't a problem, since you can introduce a new field and start using it. You cannot however, stop using a required field. The best you can do is set it to a nonsense value and leave a comment saying "well we need to set this to something, but we don't actually use it anywhere.

Because to be sure of that you can remove it, you need to be sure that every storage system and every piece of middleware and every since thing that links your proto anywhere in the world that you might care about is upgraded, otherwise if they encounter a new message they'll crash.

If you only have a single client and server, and you control both, this is doable. If you don't have that though, you cannot.


I feel that this advice is almost opposite to that given in the article. By making all fields optional, your data model no longer helps in making invalid states unrepresentable.


Exactly! Because today's invalid states are commonplace in the future.


That feature's been dropped in proto3, all fields are optional.


This is good article. But the second example seems to suffer from the defect of the first. Removing default contracts and representing fixed contracts as intervals leaves it possible that these fixed contracts can overlap....which is probably...undesirable?

In that case, applying the remedy of the first example (a set of dates, and inferring that every 2nd (even zero length, to account for adjacent fixed) interval will be default,) introduces another bug where if you lop off any random date in that set or list, you invert everything.

I love the concept represented here, it is akin to normalization as in...simplify the representation so no redundancy is introduced as this usually leads to better results...but it seems it's no guarantee of better results.

But maybe that's just because the "model" we are simplifying from was not an optimal representation. Perhaps there's a better model of the second example that doesn't end up with the defect of the first example.

I really like this article but am struck by how something that I wanted to be almost a silver bullet trick for modeling, ends up being a mass of compromises mired in tradeoffs that doesn't show any clear way forward in the general case. Still, probably a good rule of thumb, but I guess this rule is not optimal...as it can have so many unworkable misinterpretations/misapplications.

It would be cool to see a list of, like, "Programming Heuristics", ranked by decreasing general applicability, of which this rule was a member somewhere far down the list.


> In that case, applying the remedy of the first example (a set of dates, and inferring that every 2nd (even zero length, to account for adjacent fixed) interval will be default,) introduces another bug where if you lop off any random date in that set or list, you invert everything.

This is a good point, and something I usually describe recoverable problems versus non-recoverable problems. If I make start/end dates in the first example instead of just a set of start dates, then I can always create application-level or database-level constraints that don't allow either overlapping or incomplete segments. When business rules change, I can delete the constraints and update the business logic as necessary with no change to underlying data structures.

However, if I miss implementing a constraint and it erroneously allows overlapping or incomplete segments, I can easily run a query to identify all such invalid entries. Then I can then investigate and decide how to fix them.

However, if I go with the start-date-only set-based approach, and miss implementing a constraint, and it leads to a deleted date creating incomplete segments... I'm screwed. There will be no query you can do to identify incomplete segments to investigate or fix, because all segments are assumed to extend to the next start date. You can irreversibly lengthen one segment by deleting another, due to a forgotten constraint preventing you from making the change.

These could both be errors on the developer's part, depending on the requirements at the time, but one data design may lead to more non-recoverable issues than the other. Add in the flexibility of the former approach, and I'd probably be more likely to implement the former approach than the one proposed.


"It would be cool to see a list of, like, "Programming Heuristics", ranked by decreasing general applicability, of which this rule was a member somewhere far down the list."

+1 for this! Anyone got good links to share?


I like the concept but I’ve seen a fair few examples of where the developers and users clearly had differing opinions about which states are invalid!

Dates are a rich vein of examples. Some users will happily consider “25th December” to be a date, without any year, because it might be the name of a folder in which they store their Christmas stuff. More seriously, genealogists or historians may want to record “25th December” (again without a year) as the date of a photograph because they can clearly see it was taken on a Christmas, they just don’t know which one. The naive developer would just slap a DateTime type onto the system and feel good about having avoided malformed input.


Also in genealogy:

- estimated dates

- calculated dates (e.g. someone was 30 in 1870, so he was born in "calculated 1840")

- unreadable or unavailable months or days (typically recorded as 1980-00-13)

- time ranges with all of the above as boundaries e.g. "after 1760-03-00 and before calculated 1800"

- plainly incorrect dates, but that's what the document says (1865-02-30)

- no dates (some software tries to enforce putting in some data in for whatever reason)

- dates with unknown calendar


All plausible and scary. It sounds like a recipe for a system where the different qualities of dates are their own entity, and anybody doing a date-search needs to provide some criteria on the degree of specificity or certainty they require.

For that matter, someone might want to do a text-based search on dates: "This damaged photo shows 196_-_1, which could be at least 20 different months..."


There's also another dimension to that uncertainty. Dates are typically attributes of events involving one or more people in various roles, and these events are (hopefully) attached to one or more sources. The type of sources you use affects the confidence of various attributes of the event (like dates or participants).

In my area of research death records are typically a bad predictor for date and location of birth. When the records were managed by the churches (until after WWII pretty much), priests did not want to pester grieving family for the exact date of birth, so they relied on approximate age. Naturally the subject of the death record would not typically point out errors. On the other hand marriage records required looking at the actual birth records (sometimes mailed across the country), as these contained notes on all marriages of the individual (this was done to ensure monogamy).

Is it on the genealogist to consider this confidence when specifying the date of the event, or should the software intervene? Due to complexity, the former is the industry standard. But maybe there are some brave (or stupid) people who will try to take it into account in the future.


I implemented a [private] genealogical data entry system based on the GenTech data model, which has a "surety scheme" entity, designed to capture perceived uncertainty. I also added a "Fuzzy Date", where every date-like value was decomposed into all its constituent components (year, month, day, hour, ...), all optional. It could also capture a range of such fuzzy dates, so it was possible to enter a "date" such as "The first of a month no earlier than 1950 and before June 1990". There was a loooong list of validation constraints to attempt to prevent contradictions being entered, and I think I caught all the cases, but...


If you only have a day and month like 25th December you should represent it with a type that contains just that information. Java for example has the MonthDay class (https://docs.oracle.com/javase/8/docs/api/java/time/MonthDay...) which can be converted into a local date when the additional information is available. If your users want to refer to that as a date then that should be handled in the UI but should not lead to ambiguity in the internal representation.


But then you're not dealing with dates at all, just categories which happen to have names that look like dates, no?

Like you say, a "folder". But the photo file's metadata will either have a complete DateTime, or none at all, unless there is some sort of camera that is able to know what day of the year it is without knowing the year! Which due to things like leap years is impossible.


Importantly, these “date-like” things have important date-semantics. That is, it may be reasonable to expect your software to be able to handle varieties of precision or completeness of date metadata, in which case your date representation may need to be able to interact with DateTime fluidly without actually being one itself, despite it being tempting to remove these possibilities by making incomplete date information unrepresentable.


> But then you're not dealing with dates at all, just categories which happen to have names that look like dates, no?

If you and your client disagree about what constitutes a date, that doesn't mean "your client is wrong" it means "you need better communication. You can't fix this problem by requiring everyone use consistent definitions. Instead the solution is to check assumptions as often as possible.


> some sort of camera that is able to know what day of the year it is without knowing the year! Which due to things like leap years is impossible.

Just a small note, as we're talking about assumptions, but this is clearly untrue. Consider - for decades, humans wore watches that knew the day of the month, but not the month - you were just expected to turn the day forward on the 1st day of months following non-31 day months. Similarly, we can imagine disposable cameras that ask the user for the current day on first use and simply assume leap years don't exist, and require the user to correct the date for any leap years. You might call this a silly design, but systems often have to interact with external systems that have silly designs. I believe I actually owned a toy PDA (a device for a child, not an adult) that did not have a year back in the 2003 or so.


This is so obvious now that you’ve said it. I can see so many possibilities. I feel as though my eyes have been opened.


The time period example seems to miss an obvious weakness in the described Time Period 'object' - it's implied that the end date should be >= the start, but if you are representing a time period with ( Date, Date ) then you are still allowing invalid states to be represented - yet this is what the writer is trying to avoid. Likewise, a timeline split into contiguous periods can still represent out-of-order Dates.

a Time Period object of ( Date, Duration ) would fix the first issue, and a TimeLine of ( Date, Duration, Duration, ... ) would fix the second one (assuming unsigned Durations!)


I think you misunderstood the author's point: time periods aren't explicitly represented as (Date: start, Date: end). Instead they're a set of dates. The time periods are then implied by the set, making "end date before start date" impossible.


It also deals with the problem of the bounds on the ranges (open-closed, open-open etc) implicitly in a way which is harder to mess up.

One complication it's potentially missing is exactly who's days we are talking about i.e. is it days starting in GMT or UTC or EST or whatever the 'suppliers' or the 'customers' timezone is, are we actually talking about some day concept perhaps from start of business. Representing this as a datetime start / end certainly makes it possible to represent these concepts, if perhaps not making anything else particularly easier.


I was going to say that this has a weakness in testing because you can't `assert(end > start)` without knowing which is which because `assert(later > sooner)` will always be true but then I guess you'd have input (and other) validations before it got to that point anyway.


You should re-read the article. The author explicitly mentions to use sets of dates. Hence the ordering is implicit. For an interval you can do the same and use either a set or an unordered pair.


A set doesn't necessarily imply an ordering (unless this article is about some specific programming language, but it seemed fairly generic to me). e.g. Java and C++ have many Set implementations, some sorted (e.g. a TreeSet) and some not (e.g. HashSet)


That's the point. There's no ordering of the items. So the representation can never have an error like [yesterday, tomorrow, today]. It's just a set of (yesterday, tomorrow, today).


Doesn't that just push all the responsibility onto every piece of code that uses the datatype? 'Remember to sort the contents of this set every time before using it' sounds like it is asking for trouble.


The fact that it's a set can be hidden, and any queries that depend upon an ordering can be presented as a sorted view or whatnot.


So finding out a contract type from a timestamp is linear?


The set, and an ordering over the elements is sufficient. It perfectly defines and represents the intervals mathematically. What implementation of a set you use is a practical detail.

When implementing this in practice, you probably want to use a set implementation that gives fast ordering results. But that is a performance consideration. Not a data-representation consideration.


I think the example's last visualization is confusing.

The timeline is {date1, date2, date3, date4}. Let's say you have 2 periods, date1 - date3, and date2 - date4. Period 1 can be represented as {date1, date2, date3}. Period 2 can be {date2, date3, date4}.

Am I understanding this correctly?


This line raised a huge red flag for me:

"If the customer doesn’t have a fixed contract, it is assumed they are on a default contract"

No. Don't assume, specify. Explicitly.

If this is part of your specification, have a DefaultContract entity of some kind somewhere. And don't call this table just "Contracts", make it clear that those exist in addition to or overlay a default contract.

It might sound like overkill, but in my experience in business application development, one of the single largest and most painful sources of errors and refactoring headaches are implicit assumptions in the data model.


> have a DefaultContract entity of some kind somewhere

This is the OO mindset described at the bottom - the odd compulsion to have a reified entity for every concept.

A lot of people have missed that the representation you persist doesn't have to match the representation you present. In this case, don't store default contracts, but present them e.g. via a database view.


But nothing of this has to do with OO. This is an argument at the relational level.

And yes, if there's something as fundamental a concept in the business model as a default contract that's in effect when no other contracts overrule it, then IMO it damn well should be represented explicitly in the persisted data model.

I didn't talk about the specific nature of the representation. The important part is that the intent of the data should be explicit - data lives longer than code.

I've seen far too many DB schemas leaning too hard on implicit assumptions and inferring information that lead to hard to understand data models, unnecessary complex (and hard to optimize) data access (no matter the paradigm), and well, lots of errors.


> If this is part of your specification, have a DefaultContract entity of some kind somewhere.

Yes, but not in the database table for contracts. That's how I read this part of the post. I would expect the assumption that if the customer doesn't have a fixed contract, they are on a default contract to be encoded in business logic somewhere in an application that uses this database.


> I would expect the assumption that if the customer doesn't have a fixed contract, they are on a default contract to be encoded in business logic somewhere in an application that uses this database.

That's exactly the kind of harmful assumption I'm talking about. Harmful in that when people act on that assumption and actually implement it that way.

How a default contract might be represented may vary, and of course it is in no way required or even sensible to be stored as a row in the contracts table.

But to think that it is such a fundamentally different kind of data that it should be represented apart from the rest, in a different system, in a different layer, in a completely different form of representation - this is where madness lies.


> How a default contract might be represented may vary, and of course it is in no way required or even sensible to be stored as a row in the contracts table.

But you're saying it does need to be represented somewhere in the database? That putting it anywhere else is harmful?

Can you elaborate? Why is it harmful?


> to think that it is such a fundamentally different kind of data that it should be represented apart from the rest, in a different system

The concept of a "contract" is a business logic concept to begin with. That concept is represented in the application already. How (or whether) the data associated with a particular contract is stored in the database is an implementation detail.


Yeah that's a lot to be said about being explicit. For example, what if even on a default contract you want to start tracking payment adherence, or maybe sales information and attributing them to sales' numbers.

Or maybe you start reporting and want reports on contracts vs default contracts.

And then things get messy. Because the real world gets messy.


This violates one of my core, learned-the-hard-way, database design constraints: querying of rows should never have to rely on any linear dependence on any other row in the same table. If you ever have to do some sort of inner join of a table on itself to bring in the single "next" row to tell you information about the current row, then you are stuffed. Query complexity explodes. Performance takes a nose dive.

The end date of a contract is a property of that contract, not any other.

Forgetting all other contracts for a moment, what do you need to know about one contract? There should be a straight forward way to query that contract on its own, with a query that represents a tree through tables in the database. It should not become a graph, with the potential for cycles that graphs allow.

And I get the business requirements could need no overlaps, but gaps are clearly possible if a customer leaves for a while and then comes back later. Does that person need to then become a new "customer", because you don't allow gaps? And then are your customers' PII only allowed to be registered to a single account? Comcast has been a huge pain in the ass in years past because of moving, gaps, and email address reuse.


Not certain I get the needless attack on OOP in this text. The error could just as well have happened in any alternative to OO. What is needed is the realization that there is that there is a schedule that needs to have full control over the times to not create coordination issues. That is a realization that is utterly independent of the OOP-ness of the eventual solution.


The problem with a narrow OO mindset is that it encourages encapsulation and atomic objects with hidden state. Simply gluing those pieces together can create suboptimal representations - a more holistic thought process is better.


Not necessarily. Some people can't see the forest for the trees no matter what kind of programming language they're using.

What's needed here is to sit down and think REALLY hard about how the system works and what some of the terms are that are used to describe things. And also to have an intution for when someone is using imprecise language. GP cited use of a "Schedule" object as a way to enforce the invariant. That might represent a block of intervals by a sequence of times.

The less precise you are about what's needed and how the system should work, the more of a soupy mess you're going to build.

People gravitate toward OO because OO makes it feel like you've got a lot of conceptual clarity. You can get back to that familiar "subject/object" dual we love in English. But the problem is actually that, although people can pick subjects and objects readily from a sentence, they might not have much luck picking the important subjects and objects from a sentence. And that's the problem we really need to solve.


It's as if OOP code negates putting thought in designing a solution, absolute nonsequitur for me.


This is part of approach to programming that is more popular in functional world.

You take requirements and make system exactly right to fit these requirements perfectly and don't bother with any other concerns.

When you design this close to the requirements you get better, faster and more elegant code that's easier to understand - but when requirements change you have to do much more work to adapt. Suddenly a state which was previously invalid is valid, or a part of system that only needed one kind of input needs 4 different inputs from separate parts of your code. Have fun basically rewriting your program.

That's IMHO the main motivation between differences in functional and OO programming - how close to the requirements you want to design your code.


This kind of approach (type driven development) was already popular in Algol derived languages, hence why the cowboy coder would call us on Ada, Modula-2, Object Pascal side of the fence, programming with straitjacket.


I don't think it maps 1-1 with strong vs weak typing.

C (very loosely typed language) code is usually pretty close to the requirements and very strongly typed functional languages (like Haskell) are often "make DSL and write the specification in it, then run it".

Meanwhile object oriented languages often have pretty strong typesystems and cultures of using them extensively, but they also encourage designing with margin for changes (and thus using lots of layers of abstractions instead of just implementing the specification as elegantly as possible).


There's a great talk from Richard Feldman that talks about this in the context of Elm. https://www.youtube.com/watch?v=IcgmSRJHu_8


My favourite real-world example is something like: you have a React component which accepts the boolean props `showFoo`, `showBar` and `showBaz` to control its mode. The intention is that exactly one of `foo`, `bar` or `baz` will be shown at any given time, but that invariant is maintained only loosely, e.g. by a parent component which holds three flags in its state and mutates them in tandem inside various event handlers.

The obvious problem is that those three props can easily get out of sync if you make a mistake in updating them, and the equally obvious solution is to replace them with a single prop (e.g. `show` or `mode`) which contains a value from some enumeration — maybe just the name of the thing to show, or a numeric ID that is given meaning elsewhere, or (galaxy brain) a component. That way the invariant is maintained strongly and automatically by the representation itself.

This example sounds vacuous — who would ever use three booleans in the first place? — but in practice it’s very easy for UI code to incrementally get into this mess over time without anyone noticing. The situation is also often more subtle, e.g. the invariant is more complex than “exactly one flag is true”, which makes it harder to spot that you can model all the valid states with a finite enumeration.


A great talk on this subject is "Making Impossible States Impossible" by Richard Feldman from Elm Conf 2016

https://www.youtube.com/watch?v=IcgmSRJHu_8


I was pleased and a bit surprised to see this post talk about a database level approach to this problem. As important an idea it is at the application code level where most posts discuss it, especially in the context of type systems like Haskell, I think it gets neglected when it comes to persistence.

For those of us developers who are mere CRUD peons I think it's the most important factor in system stability that is mostly negected; either in favour of speed of iteration (NoSQL) or checks at the application code layer.

As I'm increasingly coming to appreciate, systems without enforced integrity at the database level are a breeding ground for bugs. You can add checks in application code but all it takes is 1 bad commit, or 1 check that slipped your notice and now you have bad data and all future code in the system needs to support and work around the bad data. With foundations of sand even the most elegant structure in application code is doomed to a short and catastrophic future.

As other commenters mention hindsight is 20:20 and you won't always know what the constraints should have been until after the fact, or the constraints might be wrong. But the 'trendy' development practices treat good old fashioned SQL constraints and data integrity as decidedly unsexy, to the detriment of a lot of systems.

MySQL didn't even have check constraints (well, actually apply them) until version 8 which shows how ignored these things are. I appreciate the post is more about the fundamental design of the stored data but people are also forgetting unique constraints, foreign keys and all the other tried and tested tools which protect the most important part of most CRUD systems, the data, from devolving into an awful mess.


I don't see how the contracts example is simplifying things while staying realistic. What if I need to have separate kind of default contracts for different classes of customers? What if I need to modify the details of a certain type of default contracts for all consumers that are using it? And on and on. If storing them in contracts table is not good (which I don't really agree with), then where should we store these variety of default contracts? How do I join them with contracts table to know which consumers have default contracts? Or groups consumers by type of default contracts?

Keep the model sensible. Contracts belong to contracts. Add basic sanity to the model. The service that manages the data guards the data beyond basic data model sanity checks. Also, explicit is better than implicit.


Presuming if you are 100% sure what "invalid" is: It is possible that being _unable_ to represent a logic error might mean that a logic error is represented as a "valid but incorrect" value, which is even more dangerous.

Take the example of storing a continuous series of date ranges. If I only store the first date of each pair, I can never accidentally have an overlap or gap. But if my code has a logic error that incorrectly calculates a range, being able to represent it could throw an error. If that code error translates to an incorrect break-point instead, I haven't prevented a bug, I've hidden it.


Ah, good old design for systems with no special business cases. Heard of them, but never seen one.


Related video on this subject: https://www.youtube.com/watch?v=IcgmSRJHu_8


I thought about that video today when integrating with an old SOAP API. I need to find the name+some property of some persons. Instead of having a list with [(name1, prop1), (name2, prop2),...], I get two distinct lists of [name1, name2,..] and [prop1, prop2,..]. In practice I think the lists will always match. But there is nothing stopping them from not being the same length, or even worse: one having a gap..


Put an assertion. It's better to throw early rather than dealing with wrong data later.


“Make invalid states unrepresentable” is a type of fool’s gold. You don’t care that invalid states are unrepresentable, you only ever care that a specific instance of your running program is very unlikely to enter an invalid state - and the difference between formally disallowing invalid states vs. test coverage that proves a reasonable likelihood of avoiding invalid states is huge.

The extra code and conceptual complexity spent to make type designs that disallow invalid cases is a liability, it comes with its own bugs, maintenance and huge risks of premature abstraction and brittleness in the face of changing requirements.

If it takes anything more than a simple enum-style menu of permitted options, then it’s a code small. Things like Scala case classes (especially with sealed behavior), or pattern matching against type constructors, or phantom types - these are all very bad ideas, where the costs far outweigh the benefits.

Most of the time you can just ignore enforcement of assumptions, and add a few assert statements plus lightweight unit tests and integration tests that generate an abundance of real world example cases - and achieve all the safety you need for a fraction of the code & conceptual complexity and tech debt incurred by false promises of enforcing correctness with type system designs.


Surely the data model is helpful, not only in keeping data integrity intact by disallowing invalid state. But also to help you think about your data, and discover simplifications and subtle rules to improve your model.

I attacked the old 8-queens problem years ago as part of a contest in Byte Magazine. All the solutions published modelled the board as an 8X8 array with a 1 or zero to indicate the presence of a queen. They all ran slow and suffered from invalid game states confusing the algorithms.

My solution was to observe that only 1 queen could be in each column (all solutions require queens to not be able to capture one another, and they can capture vertically). So I represented the board by an array of 8 values, the height of the queen in the column represented by indexing the array.

Further, since only one queen can be in each row, the values were the numbers 1-8.

My solution then, was to seed the array with the value 1, 2, 3, 4, 5, 6, 7, 8. Then test if array[i]-array[j] == (i-j) or (j-i), which would mean a diagonal capture.

Simply permuting the values, searched a subset of board states that had to contain all possible solutions. And the permutation tree could be truncated as soon as for any (i,j) the test failed.

Anyway, the program was tiny and finished in negligible time. A pity I didn't enter the contest!


This is very much like database normalization, in that it has the benefit of making invalid data impossible, but the drawback of often making queries into the data much more cumbersome and usually also inefficient.

As with database normalization, it is a good idea to first do it as much as possible, and then denormalize again until it is fast enough.


I have always has the idea of a database that does the denormalizations you want automatically for you.

Essentially, you keep the DB in a normalized state. You define views of the DB that you want. Then the DB keeps those views as tables for you, and the DB does all of the hard work of keeping those view tables consistent with the denormalized data. Essentially the DB does the atomicity, cache-invalidation, and cache-updating for you.

You get performance, and you get the certainty that invalid states are un-representable.

I guess the biggest blocker here is in automatically determining what fields do and do not need to be update automatically?




I'm not sure that this is what I think about, when I think of "Making Invalid States representable" (A concept that I practice).

That said, it's an excellent, commonsense article that describes a highly usable approach to information architecture.

I also agree that OOP programmers have always considered their designs to "represent the 'Real World'™." In my experience, I use OOP constructs to represent many things that should never be exposed to the user (like messages, adapters, states, and state transitions).

There's the classic usability concept of the "Mental Model." That is the model that the user builds in their head, as they navigate the UX. These mental models can be drastically different from what happens internally, and a good UX designer can reinforce a desired model (which the user may then ignore).


Um..."unrepresentable". :P


>I think this happens because of atomistic, object-orientated thinking.

If you think storing a list of date tuples is "OOP thinking", you have no clue what OOP really is. Educate yourself by listening to people who invented it, not Java consultants or FP zealots.

OOP is about interacting with things via interfaces and messages, rather than data. An OOP solution to inconsistencies of this sort would be an interface that either automatically corrects inconsistencies or throws errors when you try to introduce them. The whole point of OOP approach is that you're not locked into a single data representation, so, for example, you can improve how you store data without re-engineering everything in your system that relies on that data.


The OP's conclusions are askew. If you remove the default contracts from the contracts table you are left with fixed term contracts. This means customers without fixed term contracts are assumed to have default contracts. Agree so far. But removing the default contracts removes the contract start date. Meaning the default start date HAS to be the customer start date and not a separate date. This changes the functionality of the system, not the state representation.


I appreciate what people are getting at with “make invalid states unrepresentable”, but it can't be the best description of what the true objective is. After all, the first language to make invalid states unrepresentable was probably TECO https://en.wikipedia.org/wiki/TECO_(text_editor)#As_a_progra... ...


Regarding "make invalid states unrepresentable", I'm curious what others think of FSM (Finite State Machines), eg use of XState (vs eg Redux) in the FE webapp state mgmt space.


Another application of this principle is in data representation, and why I think text-based formats are horrible in general for communication between software that doesn't involve a human reading it the majority of the time: there, the "invalid states" not only cause complexity in the parser, but they also waste space (think of storing a 4-byte integer as the 4 bytes directly, vs. a string of variable-length ASCII text.)


this is a nice idea, but as mentioned in other comments, i think the most important goal when designing a schema is to design to make the queries you'll frequently be making simple and efficient. extensibility comes second and if you can make some invalid states impossible to represent, then that's definitely a bonus.


Customer signed a document? Great, job one is to get all the data from those filled-in blanks into the database accurately.

Whatever the customer signed is the authoritative contract, not some imaginary entry in a schema intentionally constructed to make it impossible to represent scenarios not envisioned by the developer.


If you enjoyed this blog post and want to go into more depth on how to model your domain using types check out

https://fsharpforfunandprofit.com/series/designing-with-type...


Just use the language of the business/domain to write your model/API. No need to reinvent names and rules that already exist. Do whatever you want with the underlying implementation, your DB, etc.


It looks like a variation of Run Length Encoding to me, where the 'Run' is a duration instead of a count.


You could also see this as a form of single point of truth.


This is sound advice for sure, and I think it applies much more broadly (or do I mean deeply?) than just for databases.

For instance, one micro-application of it that makes a lot fo sense to me is the const-ness of variables in languages like C. Since a normal variable can be overwritten, and that affects the use and semantics of that variable, marking them as const whenever possible really helps in my opinion.

For instance, take this micro-snippet of code from Redis [1]:

    int time_independent_strcmp(char *a, char *b) {
        char bufa[CONFIG_AUTHPASS_MAX_LEN], bufb[CONFIG_AUTHPASS_MAX_LEN];
        /* The above two strlen perform len(a) + len(b) operations where either
         * a or b are fixed (our password) length, and the difference is only
         * relative to the length of the user provided string, so no information
         * leak is possible in the following two lines of code. */
        unsigned int alen = strlen(a);
        unsigned int blen = strlen(b);
        unsigned int j;
        int diff = 0;
Here, it seems quite important that the values of 'alen' and 'blen' do not change during the execution of the function, since it's iterating over them. The 'diff' variable on the other hand is intended to change as a function of all the characters in both strings, that's the whole purpose of the function.

So, I think the middle two lines should be:

    const size_t alen = strlen(a);
    const size_t blen = strlen(b);
That "locks" the values in, so you know that for the rest of the function at least these two values stay the same. Since changing either length mid-function would represent an invalid state, I think this is close to the OP's point.

Also please note that I have massive amounts of respect for Redis and Antirez, I'm not trying to say that the code is bad or anything, it was simply the first file in the first high-profile open source project that came to mind. Obviously this code works and has probably been more tested than most things I've written, again I'm NOT trying to somehow paint that program(mer) in a bad light.

Btw, changing the type (to me) to size_t is also an obvious, free, improvement since it frees the reader from having to worry about why the type was unsigned int to begin with. Also 'int' can be less wide than 'size_t', which again is probably not a problem in practice since the CONFIG_AUTHPASS_MAX_LEN is probably always going to be even less, but still. It's pointless complexity that triggers anxiety in people like me. :)

[1]: https://github.com/redis/redis/blob/unstable/src/acl.c


Software engineer here.

Does anyone have any questions?


lol, you should post this in every thread


It’s not going well


Wait, is kevinmahoney.co.uk your blog?


No, just a software engineer


Dude, this is HN - we're all either software engineers, or people pretending to be software engineers

edit: or related hangers-on, like entrepreneurs or "thought leaders"


I can answer questions for VCs then I guess


It was a joke, son.


But jokes aren't allowed here


Sure: what's the value in $, engineer time, number of clients calling, etc, where this is a good tradeoff or a bad tradeoff?


Lol, nice one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: