Hacker News new | past | comments | ask | show | jobs | submit login
Some items from my “reliability list” (rachelbythebay.com)
567 points by luu on July 25, 2019 | hide | past | favorite | 169 comments



Required reading. Have you worked as an SRE?

> Item: Rollbacks need to be possible

This is the dirty secret to keeping life as an SRE unexciting. If you can't roll it back, re-engineer it with the dev team until you can. When there's no alternative, you find one anyway.

(When you really and truly cross-my-heart-and-hope-to-die can't re-engineer around it fully, isolate the the non-rollbackable pieces, break them into small pieces, and deploy them in isolation. That way if you're going to break something, you break as little as possible and you know exactly where the problem is.)

Try having a postmortem, even informal, for every rollback. If you were confident enough to push to prod, but didn't work, figure out why that happened and what you can do to avoid it next time.

> Item: New states (enums) need to be forward compatible

Our internal Protobuf style guides strongly encourage this. In face, some of the most backward-compatible-breaking features of protobuf v2 were changed for v3.

> Item: more than one person should be able to ship a given binary.

Easy to take this one for granted when it's true, but it 100% needs to be true. Also includes:

* ACLs need to be wide enough that multiple people can perform every canonical step.

* Release logic/scripts needs to be accessible. That includes "that one" script "that guy" runs during the push that "is kind of a hack, I'll fix it later". Check. It. In. Anyway.

* Release process needs to be understood by multiple people. Doesn't matter if they can perform the release if they don't know how to do it.

> Item: if one of our systems emits something, and another one of our systems can't consume it, that's an error, even if the consumer is running on someone's cell phone far far away.

Easy first step is to monitor 4xx response codes (or some RPC equivalent). I've rolled back releases because of an uptick in 4xxs. Even better is to get feedback from the clients. Having a client->server logging endpoint is one option.

And if a release broke a client, rollback and see the first point. Postmortem should include why it wasn't caught in smoke/integration testing.


> Release logic/scripts needs to be accessible

More than accessible: all scripts must be in git just like source code because that is also source code.

Building releases should be done using a fresh VM. Creating and configuring that VM should only use a script which is also in git.

Everything needs to be automated using scripts. If you type "apt-get" on the command line you have lost reliability. When multiple people are involved, such manual setups will be a problem at some point: people make mistakes. Manual steps also means you have lost good testing of the build process.


Required reading. Have you worked as an SRE?

She was both a long-time SRE at Google and a long-time PE at Facebook.


I suspected.

Couldn't find a resume on her site to confirm it.


Her RSS is very much worth a subscription.


> Rollbacks need to be possible

I always feel like people who write these never faced SQL schema changes or dataset updates. I wonder what rollback plans are in place for complete MySQL replication chains, for example.


This is addressed in the article (in passing). You split the changes up -

First make sure your code handles both old and new schemas

Then introduce the schema change

Then introduce the code which depends on the schema

Each of these steps is performed separately and monitored for problems, with rollback possible at each stage. The most painful thing to rollback is the database, but it is possible, though rarely necessary if done in isolation and tested against the old code first.

There may be a few more steps after if you want to tidy up the schema by removing old data etc. It's trading off complexity in development/deploy for reliability.


Pretty much this.

Writing code that can handle both old and new schema is admittedly annoying. But it's safe and forces a thoughtful rollout.

Even data mutations can be rolled back. Dual storage writes, snapshots, etc.

The goal isn't to eliminate risk, it's to reduce the risk and make it calculated and bounded. Heck, having a rollback plan that says "we'll restore a snapshot and lose X minutes worth of data mutations" is fine in some cases. It's tradeoffs.

(I've seen a case where literally adding a column caused an outage, because the presence of the column triggered a new code path. Rollback was to delete the column.)


And to add to this, one of brutal anti-rollback gotchas I’ve seen on smaller projects (not distributed systems failures, but code bugs that put you in a painful spot): not having transactional DDL for your schema changes.

The typical scenario I’ve had to help with: MySQL database that’s been running in prod for a while. Because of the lax validation checks early in the project, weird data ends up in a column for just a couple rows. Later on you go to make a change to that column in a migration, or use the data from that column as part of a migration. Migration blows up in prod (it worked fine in dev and staging due to that particular odd data not being present), and due to the lack of transactional DDL, your migration is half applied and has to be either manually rolled back or rolled forward.


In my experience this is rare, but I suppose it depends what you mean by "Migration blows up in prod". Do you mean the ALTER TABLE is interrupted (query killed, or server shutdown unexpectedly)? Or do you mean you're bundling multiple schema changes together, and one of them failed, which is causing issues due to lack of ability to do multiple DDL in a single transaction?

The former case is infrequent in real life. Users with smaller tables don't encounter it, because DDL is fast with small tables. Meanwhile users with larger tables tend to use online schema change tools (e.g. Percona's pt-osc, GitHub's gh-ost, Facebook's fb-osc). These tools can be interrupted without harming the original table.

Additionally, DDL in MySQL 8.0 is now atomic. No risk of partial application if the server is shut down in the middle of a schema change.

The latter case (multiple schema changes bundled together) tends to be problematic only if you're using foreign keys, or if your application is immediately using new columns without a separate code push occurring. It's avoidable with operational practice. I do see your point regarding it being painful on smaller projects though.


Generally in my experience it's been bundled schema changes. As an (dumb) example in an SQL-like pseudocode (I've been doing EE work recently; apologies for imperfect syntax):

  ALTER TABLE user ADD COLUMN country; -- oops
  CREATE TABLE user_email (blah blah); 
  -- process here to move data from user.email column to user_email table. This blows up because someone has a 256-byte long email address and the user_email table has a smaller varchar()
  ALTER TABLE user ADD state_province;
Yes, this is sloppy and should be cleaner. The net result is still the same though; you have the user_email table and country column created, and due to MySQL DDL auto-commit, they're persisted even though the data copy process failed. The state_province table does not exist, and now if you want to re-run this migration after fixing the problem, you need to go drop the user_email table and country column.

With e.g. Postgres, you wrap the whole thing in a transaction and be done with it. It gets committed if it succeeds, or gets rolled back if it fails.


Makes sense, thanks. fwiw, generally this won't matter in larger-scale environments, as it eventually becomes impractical to bundle DDL and DML in a single system/process/transaction. In other words, the schema management system and the bulk row data copy/migration system tend to be separated after a certain point.

Otherwise, if the table is quite large, the DML step to copy row data would result in a huge long-running transaction. This tends to be painful in all MVCC databases (MySQL/InnoDB and iiuc Postgres even more so) if there are other concurrent UPDATE operations... old row versions will accumulate and then even simple reads slow down considerably.

A common solution is to have the application double-write to the old and new locations, while a backfill process copies row data in chunks, one transaction per chunk. But this inherently means the DDL and DML are separate, and cannot be atomically rolled back anyway.


Yeah, definitely for larger scale systems the process gets quite a bit different, although "large scale" will remain undefined :).

It's admittedly been a long time since I've used MySQL for anything significant, but I feel like teams in the past have run into issues where DDL operations on their own have succeeded in dev/staging and failed in prod, even though they're running on the same schema. Simply due to there being "weird" data in the rows. I don't know that for sure though. If I'm remembering right, one of those had something to do with the default "latin1_swedish_ci" vs utf8 thing...


Ideally, if you are not sure you have any environment equal to prod, you clone prod first, apply your schema migration there, install your applications and run their tests.


This absolutely! I used to see the high level plans at BT for the at the time biggest IBM system DB2 in this case.

They planed changes to the system years in advance in lock step with changes to the physical plant and incrementally introduced changes.

eg release 55 and 56 would introduce stuff that would be used 6 months later in release 58 "Caller ID works on every phone in the country"


I have a simple mantra for this that covers this and other situations “always release readers before writers”.


Building datasets up via migration sets, not manual updates, and storing source data will solve both of those issues. If you also need fresh data, store snapshots which you should be doing anyway. Shortly put, with a thought out approach I don't see this as a bigger issue than anything else on the list.


fwiw, in this case the author (Rachel) was a very senior production engineer on Facebook's cross-functional Web Foundation team, which worked closely with the db ops team on many occasions. A huge amount of automation existed around both schema changes and dataset migrations, so rollbacks generally weren't a problem.


Databases suck for this, but don't forget "restore from backup" can be a viable rollback option.

Apart from that you can do stupid SRE tricks like copying tables around in-memory before schema changes.


My answer to this is that it's why we have backups.


SRE: Site Reliability Engineer


Second step would be to have a way for the client to report an error. I’ve seen a number of systems where this isn’t the case — just looking at the server logs everything looks okay until a week later that client calls to say they haven’t gotten a file in a week and wondering if something is wrong.


"Check. It. In. Anyway."

It's consistently a source of sorrow to me how many bugs exist and how much inefficiency there is because people don't want to be embarrassed by the code they quickly wrote.


Check. It. In. Anyway.

Amen


> Also, if you are literally having HTTP 400s internally, why aren't you using some kind of actual RPC mechanism? Do you like pain?

I just had a discussion about this yesterday where we have an internal JSON API that auths a credit card, and if the card is declined it returns a status and a message. Another developer wanted it to return a 4xx error, but that made me uneasy. I think you could make a good argument either way, but to me that isn't a failure you'd present at the HTTP layer. 4xx is better than 5xx, but I was still worried how intermediate devices would interfere. (E.g. an AWS ELB will take your node out of service if it gives too many 5xxs, and IIS can do some crazy things if your app returns a 401.) Also I don't want declined cards to show up in system-level monitoring. But what do other folks think? I believe smart people can make a case either way.

EDIT: Btw based on these Stack Overflow answers I'm in the minority: https://stackoverflow.com/questions/9381520/what-is-the-appr...


In my opinion, a rejection is an expected outcome, and therefore should have a response code of 200. You're not asking, "Does this card exist?", and sending a 404 if you have no record. You're asking a remote system to do a job for you, and if that job completes successfully, it's a "Success" in the HTTP world.

At that point, you rely on the body content to tell you what the service correctly determined for you. A result that the user doesn't like is way different than a result that comes about because something was done wrong at the client side (4xx) or a failure on the server side (5xx).


Its the difference between whether you believe you are using HTTP as a transport mechanism or you believe you are using it to create a DSL protocol.

Personally I prefer to think of HTTP as a transport mechanism, with business logic at a higher level. The question "If I replaced this http transport with, say, FTP or a dvd drop, would it require changes at the business logic level?" feels important to me.

That preference puts me on the wrong side of history at the moment though, and Im going with history. Lets express our business logic through the error system provided by our transport mechanism instead....


> You're asking a remote system to do a job for you, and if that job completes successfully, it's a "Success" in the HTTP world.

But a provided credit card number not existing is not success, it is unambiguously a failure.

HTTP is a transport but its response codes are clearly designed to map somehow to the "business logic" of the thing underneath it. We have to twist ourselves up in knots to map most of our business systems to the set of status codes that make sense for the default HTTP-fronted application (Fields' thesis stuff) but that's what we sign up for when we decide to expose our business services via HTTP. Other transports require different contortions.


>But a provided credit card number not existing is not success, it is unambiguously a failure.

But the failure is ambiguous. If I get a 404 back I can't rely on that to mean the credit card doesn't exist. It could mean that I have the wrong URL or that the service application screwed up their routing code.


You wouldn't put a card number in the path anyway (for obvious reasons). Far more sensible to put that in the request body.

And who's to say you can't put the reason in the body and still keep the code? What are you hurting by sending back 400? Unless you have lb's taking out nodes because of excessive 4xx's (which sounds like insanity) I don't see a reason _not_ to send 4xx's. At the very least it's a useful heuristic tool.


What are the obvious reasons? I'll presume you are referring to disclosure of the card number.

I had this discussion recently about 'security' with regard to X-Header versus ?query=param. Either it's http all plaintext on the network or it's http with tls all cyphertext on the network. Every bit in the http request and response is equivalent - verb, path, headers, body, etc - agree?

You could represent the card number as cyphertext in the request body, that's a good practice regardless of tls, but of course don't roll your own crypto. You could put that cyphertext in the path as well but if the cyphertext isn't stable that makes for a huge mess of paths.

You could make a case for trad 'combined' access logs situation with the path disclosed in log files. I can appreciate keeping uris 'clean' makes it safe to integrate a world of http monitoring tools, I would make this argument. In the case of the card represented in a stable cyphertext it's kinda cool to expose it safely to those tools.

Anything else?


If you grab something from an external service (say a cdn) then I believe by default the referrer will contain the url + query params, but not you X=supersecret header. Bit me once


>And who's to say you can't put the reason in the body and still keep the code?

Often systems will have application wide error handling to catch that and handle it in a systemic way. It can be a pain to short circuit that in a customizable off the shelf applications like Salesforce.

Philosophically, 4XX means the client did something wrong. Sending invalid data to a validation service is not doing something wrong.


> Philosophically, 4XX means the client did something wrong. Sending invalid data to a validation service is not doing something wrong.

The precise definition of 400 Bad Request is

    The 400 (Bad Request) status code indicates that the server cannot or
    will not process the request due to something that is perceived to be
    a client error (e.g., malformed request syntax, invalid request
    message framing, or deceptive request routing).
There's some room for interpretation for the phrase "process the request". If the only job of the service is to validate the request and return Valid or Invalid, then I guess yeah even an invalid request will have been processed, and 400 Bad Request may not be totally right. I think you could make a case for it! But I think you could also make a case for 200 OK with a more detailed error message and code in the body.

But if the service both validates the request and _then_ does something more substantial with it, I think 400 Bad Request is probably the most appropriate response to something that fails at step one.


Exactly. The parent is thinking along the lines of “but when the error is thrown successfully, it actually means success!” and at the same time missing out on an opportunity to leverage a well-understood error handling standard in HTTP.


Agreed. What else would you use 400 for? I don't see how sending back 400 is going to hurt.

The payoff of using 400 is you can watch your 400 rates with almost no effort (HTTP is well established and there's many many tools out there.) If you somehow start accidentally munging the card number sometimes or if your card processor starts doing wacky stuff you'll see a spike in 400 rates.

If it was really that troubling that declined cards are expected, I would personally at least want to see 200 come from the internal API and 400 go out to the client.

And if your "intermediate devices" start doing goofy stuff to 400's then you've got bigger problems... 4xx's shouldn't be taking nodes out of prod. That's wack.


In my opinion 404 should really mean the API endpoint doesn't exist and therefore the requested action did nothing. Using HTTP for business logic is like using ethernet frames with parts of IP. It is another level, HTTP is just a transport. 404 - not found, 5xx server problem, 301 - endpoint is now elsewhere etc. HTTP is for browser pages but you are actually making API requests using your JSON(or urlencoding or whatever)-based application protocol.


> 4xx's shouldn't be taking nodes out of prod. That's wack.

Welcome to AWS, where this is actually part of standard procedure (at least Elastic Beanstalk does this, not sure if it’s actually Elastic Loadbalancer under the hood).


Don't you love it when all configured backends/origins for a service are marked unhealthy through healthchecks, Amazon's load balancers route traffic to all origins.


I mean, the alternative is routing it nowhere. Might as well throw it at the wall and see if it sticks.


>What else would you use 400 for?

400 is "bad request"; you might use it if the request body was not valid JSON.


It is a failure, but not of the HTTP service. It's a failure of the RPC call.

404 means "this API doesn't exist", not "the API exists but it returns an error"


404 does not mean "this API doesn't exist". It means, quote,

    The 404 (Not Found) status code indicates that the origin
    server did not find a current representation for the target 
    resource or is not willing to disclose that one exists.
Resource here is a transport-level (HTTP) representation of a business-logic concept, in most cases a document, but if your HTTP server happens to front a credit card validation service, potentially a credit card. Or a user. Or a recipe.


I typed up a big spiel about the difference between "expected in business logic failures" and "infrastructure failures", but then I looked at the wikipedia page for HTTP error codes and it does look like some HTTP 400 level errors are intended for cases like clients sending data that fails business logic level validation and the like.


That's right.

And let me say it's totally valid to just ignore that advice, and say that here, in this company, we use 200 OK for everything that isn't a parse failure. And that's not totally wrong! It won't break anything. You're not really leveraging any of the power of your transport but maybe that's fine for your situation.


I think this is the exact right way of looking at it. 4xx/5xx are for layers that are below the business logic (like "is the web server able to route this path", not "did the application successfully validate the user's input data and allow them to continue on to the next step in a process").

It can get a little fuzzy for 401 and 403 errors, though; those seem to be returned by APIs pretty often. Not entirely sure how I feel about that, but I think those are a bit more sensible.


The HTTP IETF RFC (2616) states:

> The first digit of the Status-Code defines the class of response. The last two digits do not have any categorization role. There are 5 values for the first digit:

> - 4xx: Client Error - The request contains bad syntax or cannot be fulfilled

What do you think status codes like 404, 405, 406 are for? You say they shouldn't be for "did the application successfully validate the user's input data" but status code 400 is explicitly for bad requests. In your view should a HTTP server ever return 4xx?


It depends how you define validate. If the HTTP headers are malformed, sure. If a JSON or XML API receives malformed JSON or XML, sure. But if it's "register this account" and the username already exists, or "pay with this credit card" but it's not a valid credit card number, or "process this invoice" but the total is missing, then I think it should be a 200 status with an error message. It's validating the request was received and loaded by the application vs. validating the business logic requirements for the request's data payload.

Obviously if you send a TAIL method request, you should get a 405, and if you send Accept: eggs/*, you should get a 406. If a route doesn't exist, you should get a 404. If you fail HTTP basic auth, you should get a 403 (but why are you using HTTP basic auth?). If you want certain paths to never be accessed for some reason, you should return a 401.


You're not asking, "Does this card exist?", and sending a 404 if you have no record.

Doesn't this depend on the nature of the REST API? (Though putting a credit card number on the path doesn't seem like a good practice.)


I think you're right. 4XX means "client made an error" at the application layer. Something like a malformed request (say, bad JSON passed in): absolutely, 4XX is the way to go. However, if the client sent a good message who's contents happen to be rejected, that's a 200 (and whatever error goes in the response body). The whole "did this card work" question happens in a layer above the application layer, and HTTP status codes are only meant for the application layer itself.


> HTTP status codes are only meant for the application layer itself.

You seem to be using a strange definition of "application", but assuming you mean "transport" — I mean, this is obviously not true, right?

If you sent a perfectly semantically valid HTTP request for a resource that doesn't exist on the remote, that's a 404. But by your logic, you're saying it should be a 200, because there wasn't a failure in (essentially) nginx or the HTTP library of your application server?

HTTP is a transport designed for a specific type of application (document-oriented resources) and its status codes (and lots of other things) reflect that initial coupling, so it's always gonna be a pain when you swap out the application (to a credit card processing service or whatever) and have to think about how the status code mapping will work. And you can certainly "cheat" and say that every request that doesn't crash the server is 200 OK, I guess, but that's pretty clearly not what the Hyper Text Transfer Protocol wants you to do. Its set of status codes clearly reflect details of the underlying application.


Transport is TCP/UDP/etc. HTTP is part of the application layer, see https://en.wikipedia.org/wiki/OSI_model and https://en.wikipedia.org/wiki/Application_layer.

>If you sent a perfectly semantically valid HTTP request for a resource that doesn't exist on the remote

Requesting a path (via the application layer) that doesn't exist is a client error, hence the 4XX. But, say, if your website has a search bar and a client searches for "blahblah-thisdoesntexist" and gets no results, the webpage should return a 200 (the request was valid: the client sent form data to an existing resource and it was processed successfully) with something like "no results found" in the body -- because the search function itself is above the application layer. This is what I mean about "is the CC is authorized or not" being above the application layer.

>Its set of status codes clearly reflect details of the underlying application.

This is where I think you're wrong: HTTP status codes reflect the result of the HTTP request itself, not the next application in the chain, if one exists.

This can also be expressed if we look lower down the chain. If my HTTP request gave me a 404, should an error be shown in the TCP stack? Well no, because TCP doesn't know or care about HTTP, as far as it's concerned the TCP connection is established and happy. Likewise, in this example, the API HTTP stack doesn't care about whether the data was a CC or something else or what the result was: as far as the HTTP stack is concerned, some JSON got passed in and parsed properly, and some output got sent back, and everything went well so "200".


> OSI

I understand OSI, but it's not a useful model in this circumstance, because it draws no distinction between the HTTP layer and the business logic served over HTTP. The complexities of that interface is what we're talking about.

> HTTP status codes reflect the result of the HTTP request itself, not the next application in the chain, if one exists.

Let's say I request /robotz.txt and I get 404 Not Found.

How do you maintain that 404 is a result of "the HTTP request itself" and not the HTTP server handing the request to the component that looks on the filesystem (or in the cache, or in MongoDB, or...) and not finding the resource?


Interesting question. My understanding is that a 400 response indicates that there's something malformed about the request itself such that the server can't/won't process it. Given that in order to decline the card the service has to actually process the request, I'd agree that a 400 is inappropriate.


The previous definition was 400 means the server can't understand the request. Current one is

>The 400 (Bad Request) status code indicates that the server cannot or will not process the request due to something that is perceived to be a client error (e.g., malformed request syntax, invalid request message framing, or deceptive request routing).

Of course, a lot of things could be said to be a client error.


I would agree with not using the http status codes to return a card processing error.

Specifically, you want to be able to easily distinguish between "your url is wrong" or "your authentication credentials are wrong" or "the API endpoint threw an exception" type errors and "credit card processing failed" errors.

It's easier long run to put business logic errors somewhere separate from protocol/routing layer (i.e. even a http header would be better) so that you can tell what is Rails/Flask/whatever failing vs. logic failing. This also gives you more flexibility to do stuff at the hardware layer (another commenter mentioned ELB) without interfering with the application layer.


4xx errors are no better or worse than 5xx. They just mean different things.

Generally, 4xx errors means 'Client screwed up'. There's probably no point sending this message again, it isn't going to work until something is fixed. That might mean it will work if, for example, the account is funded or card unlocked. But something needs to be done on behalf of the client.

5xx errors mean 'Server screwed up'. It probably _is_ worth having another attempt at sending the request. Maybe it was a temporary glitch, or maybe a new release of the server has to happen. Regardless, there was probably nothing inherently wrong with the actual request.


Using HTTP for RPC lends itself to this sort of endless discussion. Authorization is even worse.

You could argue either way. What's important is being consistent, at the very least throughout an API but preferably throughout the whole organization.

(Personally I'd probably lean towards a 40x of some kind, just make sure it doesn't clash with something that you care about.)

Along the same lines, and arguably more important, is how to log the operations where a transaction completed successfully but with a negative answer. If you log expected negatives as errors you can get error blindness.


Stripe returns HTTP 402 ("Payment Required") when a card is declined.


I think the key is

> A "HTTP 400 bad request" is only the sort of thing you get to ignore when it's coming from the great unwashed masses who are lobbing random crazy crap at you, trying to break you. Internally, that's something entirely different.

If this is an internal-only API then most people look at 400s as their bug. Sounds like you are in the same boat.

If it's an external API, it's not their bug.


I always thought, if there’s a code that officially signals what we’re trying to signal, we should use it, and rely on it as “the signal”.

But at the end of the day, your app must target a subset of available APIs. If you use every available API the surface will be too big and you will have snowballing complexity. So if you are already using another API which can signal what you’re trying to signal then feel free to institute a “everything we can expect returns 200” policy. That kind of decision is critical for putting an upper bound on your system complexity.

I want to institute a “no default exports” policy in the JavaScript app I’m working on. It’s not that there aren’t places a default export makes sense—there are. It’s that prohibiting them will decrease the number of things our coders have to think about.


> I always thought, if there’s a code that officially signals what we’re trying to signal, we should use it, and rely on it as “the signal”.

This is a line I hear a lot in regards to this general situation, and the interesting part is that the 400 code really does not signal the thing the developer is trying to signal, not at all. But a lot of devs have a weird hang-up about this for some reason, and will go to any lengths to not send a 200 response. I blame the REST cultists.


Well, 4XX are for client errors, so if the client fails authing to the API, it would be appropriate there. If the API fails authenticating to an API further down the chain, that shouldn't be returned as a 4XX status code, because it could be confused with the immediate API's auth.

HTTP status codes can be used to good effect if done thoughtfully, because there's often already a lot of logic in the useragent to handle some cases (such as redirects, etc). Not using anything except for 200 is limiting. Using a bunch of status codes to refer to things that make no sense is confusing. Defining a sane standard for how your application will choose between them that allows for the benefits they offer without confusing clients as to what is actually being communicated is the best of both though.


For the HTTP API my company has been building, we're returning 400 status codes for most errors. It is intended as a strong suggestion that client code needs to handle errors rather than ignore them. However, we're aware that some client libraries are limited in their ability to handle status codes other than 2xx, so we allow an optional request parameter that asks the server to return 200 instead of 400 for most errors. As you said, smart people disagree about this, so we let them choose. I think we've hit a good balance.


That sounds like an excellent approach. If a consumer chooses "always 200 on error", how do you return the API response status? HTTP Response Header or in a custom value in the response body?


Thanks. Error responses from our API always generate a JSON object with top level error and error_description attributes, while non-errors never contain a top level error attribute. It might be interesting to add a response header to disambiguate even more.


I came to a good analogy - Imagine sending that card number using HTTP form to a HTML target URL. You wouldn't return 4xx code if a card is not found. The user would see your 404 page or browser 404 page and think the page doesn't exist! API shouldn't be different.


I think the pain comes from not controlling the full stack. But managing the full stack is also painful. Having several abstraction layers will affect performance. Personally I agree that's its not a good idea to reuse (HTTP) error codes in your app layer ala REST


IMHO returning status 400 is only reasonable if the request itself is malformed. A request to check whether a CC is valid does not become malformed depending on whether the CC is valid or not.


It sounds like a lot of the problems you describe come from using software that assumes normal “HTTP web server” applications when you really have a custom web application.


The one I've seen missed most often in startups is directly implied by a lot of the other points and obvious to anyone with long experience: Take the time and put a lot of thought into how to break up your big transitions into smaller stages, each of which are functional. It's usually possible to at least narrow down the risky parts to a few finer grained steps and when something fails only rolling back one part to get to a good state is almost always faster and safer.

It's very easy to get absorbed into the awareness of the high level change you're making and miss the details of the process. Even just sitting down together and outlining what you think is actually going to go on (and then breaking those down into what they each are comprised of) can make it really clear that you don't have to run as many giant risks. I'm occasionally amazed how brilliant people (including some with big names in devops) can forget it's an option.

It's like taking small steps from stable to stable when you're going across a steep scree slope and only jumping when you have to - sometimes it feels riskier to take lots of small steps, but if you start to slide it can be a lot easier to recover from. Your chance of dying taking a big leap isn't the sum of the equivalent small steps. Perhaps complex computer systems have the equivalent of an Angle of Repose?


On JSON:

"if you only need 53 bits of your 64 bit numbers"

JSON numbers are arbitrary precision.

"blowing CPU on ridiculously inefficient marshaling and unmarshaling steps"

On the other hand I am not blowing dev and qa time on learning/developing tools to replace curl/jq/browser/text editor.


> JSON numbers are arbitrary precision. > On the other hand I am not blowing dev and qa time on learning/developing tools to replace curl/jq/browser/text editor.

What a pleasant surprise it will be for you when you find out that jq silently corrupts integers with more than 53 bits.


It's perhaps less a limitation of JSON and more a limitation of JavaScript itself[1].

But still, given that easy consumption from JavaScript is the ultimate primordial reason for choosing JSON over other formats, it seems like trying to transmit integers with more than 53 bits of precision over JSON is asking for trouble. Because it's only a matter of time until someone will want to do something like write a new service in Node, and the JavaScript parsers for other formats are at least somewhat more likely to guide people toward using BigInt for large integers.

[1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


In TC39, we specified BigInt to not participate in JSON by default, precisely because emitting it as a native JSON number would not be able to be read back by JSON.parse() and many other environments would also not be able to parse it without extra logic if they simply use IEEE754 double.

I explicitly asked for and achieved step 2. in the modified SerializeJSONProperty algorithm[1] so that users could decide and opt-in to serializing BigInts as strings if they so choose, with or without some sigil that could be interpreted by a reviver function. e.g.:

  > JSON.stringify(BigInt(1))
  TypeError: Do not know how to serialize a BigInt
  ...
  > BigInt.prototype.toJSON = function() { return this.toString(); }
  > JSON.stringify(BigInt(1))
  '"1"'
[1]: https://tc39.es/proposal-bigint/#sec-serializejsonproperty


Or they'd just wrap as a string and move on with life, much as people do when they send uuids which should consume 128-bit as wide strings. The cases I've encountered that actually fundamentally required 64-bit ints are completely overshadowed by the cases where some well-intentioned developer decides to pack 64-bit ints with information to turn it into a uuid as well, so I really hope the consternation over this is from people who work a lot with mathematical apis/systems and not devs doing more heinous things with 64bit ints.


> I really hope the consternation over this is from people who work a lot with mathematical apis/systems and not devs doing more heinous things with 64bit ints.

No, it's typically IDs. And my observation is that inexperienced devs start out serializing 64bit IDs as ints and get burned first before they move to text.

There are several reasons why you end up with IDs > 2^53 even if 2^53 is good for enumerating ~100k things per earthling:

Many cloud providers will give you IDs that are specified to be 64 bit uints, and you'd better not corrupt some fraction of them. Generating unique integers strictly serially does not scale, so you might end up with something like Twitter's snowflake. Or you might want to tag additional info (reserve a few of the low bits for the type of id, or reserve particular ranges for particular types of acccounts etc etc).

I can promise you I have run into the issue in real life more than once and it's not just javascript and jq that like to silently corrupt ids by truncating them to double precision. For example pandas is really great at it as well.

What makes it especially fun that often only a very small fraction of IDs will be affected (< ~1 in 2000 if uniformly distributed).


And this is why I say, if you don't do math on it, it's not a number! IDs are always strings. That we often choose to represent them more compactly as binary numbers is an implementation detail that should not be leaking out of your API.


You need some integer to make 53 bits though...


I don't understand why you are being downvoted. This struck me not as something that starts to "emerge after you've dealt with enough failures in a given domain" as the author claims, but more like a pet peeve.

The fact that JS has 53 bit precision will be a JS problem whether you use protobuf or anything else. On the other hand, if you are not using JS, it will almost certainly parse numbers of the precision that your language offers.


The original article's complaint wasn't (just) about 53 bits, but about using inefficient formats requiring constant serialization and deserialization for backend-to-backend communication.


Protobuf or Thrift would require serialization as well? I still wouldn't do that (or, in fact, I am currently not doing that) before I have an incentive, i.e. performance problems. Of course, my services are all Nodejs at the moment, so JSON feels more natural there than it does in other environments.


One potentially nice thing about formats like Cap'n Proto is that they don't eagerly parse the message. You can just lazily grab the values you care about, which can be a big performance win if you typically only care about a few fields out of a larger message.

That's impossible with text-based formats like XML and JSON, because everything is fractally variable-length.


It seems like its more "problems you deal with when you work at scale" like the author has at Google & FB.

At smaller scales, I think the (machine) benefits of using something like protobuf don't nearly outweigh the human benefits of just using JSON.


I’m not sure how unpopular this opinion is, but I’d actually argue that there are underrated human benefits to using “just protobuf”, which are not shared by using “just JSON”.

1. Single source of truth for certain types, enums.

2. (assuming you’re using gRPC too) A separate, minimalist, and language-agnostic definition of your service interface, which you can document with comments to your heart’s content, and which (unlike other documentation) can NOT be out of sync with the actual service. I don’t have to read C++ to understand your C++ service should work.

3. Protobufs encourage message type / enum reuse, by allowing you to import other definitions. This might seem trivial, but it’s super important in mediumish orgs that everyone is using the same definition of time, geography, etc. It all adds up to less surprises when you open up a new .proto file.

The kicker is not that you can’t somehow get these things with JSON-over-HTTP, too. It’s that protobufs-over-gRPC won’t work without them. The trade off is that you can’t inspect raw requests unless you have built some tooling around it.


I think the missing assumption is that doing "just protobuf" is not always a possibility when you in the end have an open API front-facing the internet.

So now there is a point in your codebase where you are receiving data in a JSON form. At this point having protobuf elsewhere is not chosing between JSON and protobuf, it's chosing between "json+protobuf" and "json".


Not really. If you are sane you carefully parse all data from the internet and sanitize it first thing. json+protobuf is in fact an advantage because by keeping your internal and external data formats different you eliminate one class of mistakes where external data makes into internal systems. (this isn't perfect, you can change format without doing sanity checking but it makes it harder to do accidentally, and easier to find in an audit)


In addition to what my sibling comment says about "sanitizing" external data – which I wholeheartedly agree with – the "open API front-facing the internet" is in the process of changing, if you're talking about normal websites.

gRPC-web isn't quite feature complete relative to normal gRPC, but it is getting pretty close, and the gains of avoiding JSON (de-)serialization would be big. I think once the protobuf story has a complete chapter for the front-end, bigger engineering orgs will roll it out much like they're rolling out typescript today.

If you're talking about actual developer/public-facing APIs, those will probably remain in JSON land for a while.


#1 and #3 can be solved by using TypeScript types to define the schema of your JSON.

IMO they're readable enough to serve #2 too.

Before someone mentions "JSON schema" - I've tried to use them, and they're difficult to write and impossible to read. Reading/writing TypeScript types are a dream in comparison.


I find that the human benefits of using something like protobuf outweigh the human benefits of using JSON. The machine benefits are just a cherry on top.

Protobuf - or, more specifically, proto files - gives you a central place where you can define and also document your formats. You can throw an ASCII-field UML sequence diagram in there, if you need to. And it's right there in the single file that everyone will use to communicate the protocol, and the protocol at least can't change in any structural way without editing that file, so it's got a much higher chance of being kept up-to-date, and of being read by the people who need to read it, than any of the available options for documenting JSON-based protocols.

All JSON gives you is human readability, and browsers can read it without a library. The 2nd, I don't care about with back-end services. The 1st I don't really care about at all, because command-line utilities and library functions for dumping protobuf datagrams to a text format are a dime a dozen.


You don't even have to get close to Google or FB before running into issues with JSON. At a previous job we were working on REST APIs that topped out at around 100qps. Performance wasn't what we wanted and we were initially pointing the finger at Python and the database. We started profiling the application and found that most time was spent in JSON serialisation - which is actually implemented in C. Unless you're serving the browser there's really no excuse for not using a binary format.


The excuse is that there are not enough seasoned software developers, and what used to be a relatively meritocratic discipline is now a shit show.

Problems started somewhere in early '00s when "software architect" became a dirty word, and one hit wonders like Paul Graham pontificated that "young is smarter".

And here you are.


I've never seen JSON used in a project of any size at all where we didn't end up trying to hack constraints and types on top of it with some ugly and gross thing or another—looking at you, JSON Schema.

The more parts of your stack you tighten up, the fewer errors you'll hit and the more flexibility you'll have to use less-strict tools when it really matters. That's true at all but maybe the smallest scale. Worrying about the costs of having to document what you intend in a way your machines can verify—I mean, shouldn't you be doing that anyway?—is baffling to me. You don't need anything like Google scale to see the benefits of it. It's basic communication AFAI am concerned.


This is not a binary, though. E.g. Thrift has multiple transport/protocol settings, such that you can use JSON over HTTP for debug/testing purposes while still using a nice, compact binary representation over a raw TCP socket for production.


They both exist for a reason. JSON is really nice for your quick proof-of-concept where you have relatively few developers working on a small system. Up to a point, this also scales well.

On the other hand, if you have many developers, and the problem space starts converging toward scale, then certainly the problem space is different. Encoding of data cannot follow the language with the weakest type zoo, and efficiency starts to matter.

The hard part is to know when to make the switch, or when to anticipate growth in advance, such that you pick the right tool for the job.

It isn't just a question of the machine. For complex messaging, the added value of a well-defined message typing, provided by protobuf, will help. It will also remove a lot of problems if you have multiple different languages in the stack, talking to each other.


>JSON is really nice for your quick proof-of-concept where you have relatively few developers working on a small system.

This is really, really bad thinking and will always cause you a timebomb that will 100% explode on you in the long run.

"just a quick POC" will always end up being the actual product, and once you start down this path you end up with "let's just use mongo, it's schema-less, it'll be faster for devs and we can use a proper db later"... "let's just use nodejs for the POC, all the devs already know JS so it'll be great"... a few years later you end up with some gargantuan monstrosity that nobody wants to maintain, your "quick" language/db-du-jour is dead and unmaintained, you're EOL on four different software fronts and you now get to explain to the bosses why you need to rewrite the last four years of dev work.


Once you hit the wall with JSON what format do you use? At a past job we were stuck using SOAP/XML middleware. We hit walls but I’m not sure if a different format would work. We ended up rate limiting lower priority traffic to get past our issues.


Protobuf/gRPC.


CSV.

Just kidding, but I suspect it would be a lot faster.



That is not true of JSON numbers either practically or technically.

  This specification allows implementations to set limits on the range and precision of numbers accepted.
https://tools.ietf.org/html/rfc7159#section-6


Yes it is. All that part of the spec says is that it allows implementations to set/have limits on ranges and precision, which is obvious because for example javascript only has 53 bit numbers.

Other implementations can have other limits, but that does not mean that JSON itself or all implementations have the specific limits of javascript.


It is not the case that JSON numbers are "arbitrary precision". JSON numbers are in fact "implementation defined" precision. The technical and practical implication is exactly what was described in the article: Since precision is implementation defined (not uniform and with no lower-bound), JSON numbers make a poor vessel for inter-service communication.


Not anymore. Try writing 10000000000000000000000000000000000000000000000000000000000000000000000000000000n + 1n into your browser console.


Firefox:

    >JSON.stringify(1n)
    TypeError: BigInt value can't be serialized in JSON
Chrome:

    >JSON.stringify(1n)
    Uncaught TypeError: Do not know how to serialize a BigInt
    at JSON.stringify (<anonymous>)
    at <anonymous>:1:6
I don't think we're quite there yet.


It was counterpoint to "that does not mean that JSON itself or all implementations have the specific limits of javascript."

Javascript can do arbitrary precision integers now. So the last part is not true. You can generate your JSON strings with string concatenation in JS if you like.

Just because JSON.stringify has some limit doesn't mean JS has.


See my comment elsewhere in the thread that shows how to properly use BigInts with JSON.stringify, if you choose to support them in your environment.


Well yeah, those are bigints, but JSON does not support bigints, so the point still stands.


If by "JSON" you mean "JSON parsers in web browsers", then yes.


I'm assuming the percentage of problems where the difference between 64 bits and 53 bits precision matter are really, really small (and you should be paying attention then, because you're bound to lose precision in other parts and not only in JSON)

For everything else, it is really a non-issue.


Javascript cannot handle 64-bit unsigned integers and the largest exact integral value is 2^53-1, but yeah JSON itself has no such restrictions.


Javascript got BigInt:

   > 9007199254740991n * 9007199254740991n
   > 81129638414606663681390495662081n


You can also just dump them into strings :)


> On the other hand I am not blowing dev and qa time on learning/developing tools to replace curl/jq/browser/text editor.

Because learning how to use correct tool for the job is so difficult and so outside of dev/qa job descriptions, that it's much better to waste compute on your servers and deal with performance and scalability problems.


So the old "computer time is cheap, human time is not" is not true anymore?


I think that line is mostly propaganda from defenders of poor performing language runtimes. It’s not a dichotomy - both are important. Eg AWS bills can easily eclipse engineer salaries if you use it inefficiently.


> propaganda from defenders of poor performing language runtimes

Your comment is unnecessarily inflammatory.

> AWS bills can easily eclipse engineer salaries if you use it inefficiently.

If this happens, it is more due to architectural and algorithms problems than to the runtime of the language. There are many large applications written in Python, Ruby and PHP that run just fine.


Yes, sorry, it was a bit over the top. I've been burned a few too many times by people who use that argument to justify poor tech decisions.


It is. OTOH, "these tools have been sufficient ten years ago, so I'm sticking to them forever" leads to wasting human time (by ignoring newer, potentially more efficient, tools).


Strongly typed, schema'd data interchange leads to less human time in the long run.


Yep, you just have to put the time into ensuring that convenient <Serialisation X> utilities exist that offer the (near) same functionality of curl/jq etc.

It's a pain, but it's a short term pain, and necessary.


It is, but it's also true that spending a little bit human time on using the right tool for the job (and getting devs do their jobs) now might be cheaper in the long run that letting shoddy work slowly bleed your finances. Compute isn't free.


Depends on how many computers doesn’t it? It’s pretty easy for one developer to put code on many thousands of servers.


> On the other hand I am not blowing dev and qa time on learning/developing tools to replace curl/jq/browser/text editor.

Is your turnover so high that learning a single tool is a significant portion of employee costs?


Why not just use SOAP?

I think recent experience should tell us that the easiest thing plainly wins out. JSON can be terrible, Javascript can be terrible, but limitations on large integers isn't the worst hurdle to take.


I love item #2, as it talks about writing code that can safely handle "future" enums and values as a result of rolled back code. Maybe we should call it the Sarah Connor Pattern.

I haven't heard enough people discuss the deployment management of growing enums or state machine evolution. This is a problem more particular to software than hardware, as once hardware is shipped it's usually set in silicon, but growing of the state garden is an expectation in many software architectures.


One of the fun challenges I have in my current job is that we provide releases to customers according to the customers schedule (which is related to needing hours of downtime because it's a creaky old system).

Some customers will skip releases altogether making strategies like add a new column, back populate it online, then the next release uses the new value impossible.

I guess that point is slightly moot when it'd take 2-3 releases to achieve the end goal and each release cycle is about a month.


You're not in a great situation, but there are still lessons from the post. For example, you could break upgrades into a series of changes that can be individually rolled back, and run smoke tests in between each one.

But your biggest reliability improvement would come from getting this system moved to continuous deployment without downtime. Now you can make one change at a time and roll back if it doesn't work.


If they're in anything like the last situation I was in that seems similar, continuous deployment may not be possible – not technically, but with respect to other considerations, e.g. management or even the business's finances.

There's still a lot of small software companies maintaining, and selling, on-premises installations of systems that use, e.g. Microsoft Access as a front-end client. Even then, continuous deployment is possible, and all-else-equal, a huge improvement for the developers and support staff, but also something that lots of management or owners may be (reasonably) averse to committing to implementing.


On the plus side for you, it seems you have each customer isolated from one another.


This is a great list! One thing I would add, once you have everything on this list, is a way to experiment on your changes. Instead of flipping a flag, seeing your error rate jump, and flipping it back, you run an experiment where to flip the flag on 0.1% of requests. Now you can compare this traffic to a control group, and aren't stuck wondering "did errors go up by 5% during our rollout because we broke things, or by chance?". If things look as expected at 0.1% you can ramp to 1%, then 10% before releasing.


> using something with a solid schema that handles field names and types and gives you some way to find out when it's not even there [...] ex: protobuf

In proto3 all fields are optional, and have default values, so it becomes impossible to detect the absence of data unless you explicitly encode an empty/null state in your values.


This is true for primitives but not [completely] true for messages. Message still have the "hasMessage" semantics. So if you truly need to differentiate between unset and default for primitives, you can box them in messages.


Go structs behave the same way. I wonder if one influenced the other.


> impossible to detect the absence of data

There are other ways:

- Check if the value different than the default, e.g., empty string

- If your data is repeated, then check number of data elements != 0


> Check if the value different than the default, e.g., empty string

which is a worse version of the GP's

> explicitly encode an empty/null state in your values


Ew, what was the reason for that?


Because in proto2 everyone used "optional" anyway.


I managed releases for one of Norway’s largest hospitals, and when all in the ‘reliability list’ is checked and you have frequent releases the real headache is ‘cross system rollback’ between several systems/companies. Add that this is done with the whole hospital in emergency procedure..


Seems like a typical article from I assume a gafa employee, good advice mixed with "how to be google even if you don't need too" advice.


I don't know anything about databases. How do you roll back a significant schema change?


I believe the strategy described avoids needing to do that. You start by releasing a version of the software which can run with the old or new schema, then you apply the new schema, then you release a version of the software which actually uses the new schema. If you discover a problem at that point, you roll back to the previous version of the software, but leave the schema as it is. You then have time to figure out what to do, which may involve changing the schema again, but that will be as a forward change, rather than as a rollback.

Some migration tools do support rollback scripts for schema changes, but unless you're actually testing these before release (deploy the new version in staging, accumulate representative data in the new schema, roll back the schema, deploy an old version of the app, test that it is doing the right thing), then they aren't really something you can rely on in production.


A few strategies used by large companies, e.g. Facebook where the author worked for some time:

* Use external online schema change tooling which operates on a shadow table, so the tooling can be interrupted without affecting the original table. (Generally all of the open source MySQL online schema change tools work this way.)

* Use declarative schema management (e.g. tool operates on repo of CREATE TABLE statements), so that humans never need to bother writing "down" migrations. Want to roll something back? Use `git revert` to create a new commit restoring the CREATE TABLE to its previous state, and then push that out in your schema management system. (Full disclosure, I spend my time developing an open source system in this area, https://skeema.io)

* Ensure that your ORMs / data access layers don't interact with brand new columns until a code push occurs or a feature flag is flipped.


Skeema looks cool, and I've long been a fan of a similar tool that's specific to SQL Server – DB Ghost. I even implemented some pretty handy additional automation on top of it at various companies at which I worked. There were tho, inevitably, a small number of schema changes that needed to be encapsulated as 'migrations', and it could be arbitrarily complex and difficult to write, and test, them in a way that supported automatic rollbacks.


It really depends on the task but the general pattern is to split it up into separate steps, each of which is either low risk and/or easily reversible.

Lets say you want to add a new non nullable column with foreign keys, to replace an old non nullable column of foreign keys to a different table that’s obsolete and needs to be deleted.

1) update the code to be ok with a new nullable column. Rollback: deploy previous version of code.

2) create the new column in DB with it’s desired constraint, but make it nullable. Roll back: delete the column.

3) have the code start populating the new column as well as the old. Rollback: deploy previous version of code.

4) start backfilling historical entries with the new column. Rollback: you can’t roll this back!

5) make new column non nullable. Rollback: make it nullable.

6) update code to read from new column, continuing to write to both. Rollback: deploy previous version of code.

7) make old column nullable. Rollback: you can’t roll this back!

8) stop writing to old column. Rollback: deploy previous version of code.

9) once you’re satisfied the old column is no longer used and the version of code from the previous step will never be deployed again, drop the old column.

Rollback: you can’t roll this back!

10) rename the obsoleted table and see if anything breaks. Rollback: rename it back to its original name.

11) delete the obsoleted renamed table.

Rollback: you can’t roll this back!


In your migration file(s) you have both forward and backward steps. If operations are reducers or otherwise destructive, the forward step backs that data up to separate "recovery" tables. The n+1 migration may delete recovery tables. That's how I do it most of the time. Sometimes with full snapshots. Ymmv


For a complex system, with many tables, rolling back a large change is very difficult.


>In this case, you need to make sure you can recognize the new value and not explode just from seeing it, then get that shipped everywhere. Then you have to be able to do something reasonable when it occurs, and get that shipped everywhere. Finally, you can flip whatever flag lets you start actually emitting that new value and see what happens.

Can't those first 2 steps be combined together? Why do they need to be shipped separately?


Depends on your updating architecture.. If you have a central backend service, but scheduled patches or the user has to manually update, that might become a problem. There could also be the case that client and backend can be updated independently and one might not be done before the other. The issue might only exist for a few minutes, or an hour, but would still be a huge impact on the software.


Sorry, I still don't understand. The post say you need to do

1. Have the code recognize the new value. Get that shipped everywhere.

2. Have the code do something reasonable when the new value appears. Get that shipped everywhere.

3. Start emitting the new value.

My proposal is

1. Have the code recognize the new value. Have the code do something reasonable when the new value appears. Get that shipped everywhere.

2. Start emitting the new value.

If the user has to manually update, and might refuse to do so, the problem exists under both the 3 step and the 2 step process. So nothing is gained by using the 3 step process.

If the client and backend are updated independently, that should still be fine with the 2 step process. The 2 step process says "Get that shipped everywhere", meaning shipped to both the client and the backend.


I would rephrase this as

1. Support the new value in your schema

2. Support the new value in the client

3. Emit the new value from the server

Combining 1 and 2 is probably possible, but not a great idea. Imagine if you end up rolling back 2 and 3, but there are still potentially new values in flight. If 1 is still there, you're good. If 1 and 2 are combined, you rollback all 3, but the new value is still in flight and your client crashes.


The advice seemed to be to, first, support (i.e. correctly ignore) any possibly new value in both the client and server (i.e. any client of the 'schema').


If 2. breaks when you do 3., you roll back to 1. If you rolled back to the one before 1, then the values that 3 emitted may still be in the system, causing it to break. When you do 3, you're really testing 2 and 1 is the fallback. If you did 1 and 2 together, then you can't roll back just 2 on its own.


The important part was somewhat implied in the OP.

You have to

* Ship the new version

* Wait until you are sure you don't need a rollback to an old version

* flip the switch

If you don't wait, you might need to roll back to an old version that doesn't support the new state, which then blows up.


I get that there need to be 2 rollouts. What I don't get is that the post says there need to be 3 rollouts.


Because if you do it in two rollouts, you don't find out that the first one is broken until you do the second one... which means the first isn't necessarily a safe rollback target for the second. But the one before the first isn't safe either once the second has been rolled out because it doesn't have any logic to not explode upon seeing new values.

The third can be a flag flip instead of a code push, but it still needs to be a discrete event to start generating the new value that triggers new behavior.


So you're saying with the 3 step rollout if rollout 3/3 goes bad you should rollback straight to 1/3? I guess that does make a bit of sense in the case of a database where the newly created values can stick around.

But also there could have been a bug in 1/3 that wasn't exposed until step 3/3. Meaning even rolling back to 1/3 won't solve the problem.

The 2 step rollout is safe if we assume there are no bugs in the code, but problematic if there are bugs. But the 3 step rollout is also problematic if there are bugs. I guess the 3 step rollout has the advantage that it partitions off a section that might be buggier than other sections so that it can be individually disabled. But that'll only help sometimes, and I'm not sure if the additional complexity is worth it.

Flag flip vs code push doesn't seem to make a difference to me. All 3 rollouts could be flag flips enabling code that was written much earlier but hidden behind flags.

Another problem is doing a double rollback like that seems a bit risky to me, because other features that were being deployed simultaneously might not have been designed to handle a double rollback, only a single rollback, so they could break. If we want to allow double rollbacks, we must require that all development not just handle single rollbacks smoothly but also double rollbacks.


You shouldn’t be deploying other features simultaneously.


She has written many very interesting posts. Also one about "The One".


Hot patching is scary but it makes you optimize for the right things like easy to understand, easy to update, detectable early errors, easy to recover/rollback. And it makes you think and understand before writing the actual code.


I love this and it makes me feel both excited and scared since I am suddenly realizing the ways my org is not in compliance with this good advice.

Are there any good books that are full of more rules of thumb like these?


The Google SRE book (free online or available as dead tree):

https://landing.google.com/sre/books/


what is the load balancing story for these RPC services thusly recommended? it was completely glossed over as if it was not even relevant; i know gRPC uses HTTP/2 and persists connections so it's not as simple as throwing a proxy in front.

that seems like a non-trivial point of friction when it comes to "just using solid storage/RPC formats" or whatever.


For gRPC there is Envoy, but I have not used it so I can’t give you more info on it.

However, I think the point in the article is that you need to have well defined schemas for inter-service messaging. Something like protobuf or thrift or flatbuffers. Whether you layer gRPC on top of that is a separate concern. For example I have used Protobufs extensively at work but never gRPC, since we mostly have point-to-point connections. We checked in the message schemas into their own repo and all users across the company pull from it. We have Python and C++ codebases sharing the same schemas, it’s quite wonderful.


I think something that's missing in this list is to have a really good and consistent QA cycle. You need to have rollbacks, I agree, but even better is, when bugs can't even make it into your production build. Having automated testing set up, (actually correctly done) code reviews and quality gateways in place can save you a lot of time rolling back your code. Catch the bug before it goes into live.


But you will never catch them all.


Sure, it's just another net the bug has to fall through though. And one that catches a lot of errors, at least for us.

The biggest impacts for our code health are

- Typescript

- Automated Testing through Gitlab CI


Of cause, your comment just sounded like if you had good QA rollbacks would not be needed.


Oh no, that wasn't at all what I meant. I just wanted to mention another tool, that can reduce issues with your production code. Rollbacks need to be there in any case.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: