The same conversation at Netflix 10 years ago: I want to serve 5TB of data. Ok, ...

ignoramous · on Nov 2, 2021

> I want to serve 5TB of data. Ok, spin up an instance in AWS and put it there... it was amusing when we hired people from Google who were confused by the lack of process and approvals.

Quoting from Velocity in Software Engineering https://queue.acm.org/detail.cfm?id=3352692:

In 2003, at a time in Amazon's history when we were particularly frustrated by our speed of software engineering, we turned to Matt Round, an engineering leader who was a most interesting squeaky wheel in that his team appeared to get more done than any other, yet he remained deeply impatient and complained loudly and with great clarity about how hard it was to get anything done. He wrote a six-pager that had a great hook in the first paragraph: "To many of us Amazon feels more like a tectonic plate than an F-16."

Matt's paper had many recommendations... including the maximization of autonomy for teams and for the services operated by those teams by the adoption of REST-style interfaces, platform standardization, removal of roadblocks or gatekeepers (high-friction bureaucracy), and continuous deployment of isolated components. He also called... for an enduring performance indicator based on the percentage of their time that software engineers spent building rather than doing other tasks. Builders want to build, and Matt's timely recommendations influenced the forging of Amazon's technology brand as "the best place where builders can build."

...leading up to the creation of AWS.

ryandrake · on Nov 2, 2021

The "approval paralysis" thing happens at a lot of companies, large and small, not just GiantTech. It creeps up on you slowly: 1. A big problem happens that gains the attention of leadership. 2. The problem is root-caused to some risky thing an employee did trying to accomplish XYZ. 3. To correct this, a process is put in place that must be followed when one wants to do XYZ, and (critically) gatekeepers are anointed who must approve the activity. 4. These gatekeepers are inevitably senior already-busy people who become bottlenecks. Now we can't do this critical thing without hounding approvers. 5. Some other big problem happens and the above cycle starts all over again.

Before you know it, every even slightly risky task you need to do through the course of your job requires the blessing of approvers who are well-intentioned, but all so overloaded they don't even answer their E-mail or chats. They sometimes need to be physically grabbed in the hallway in order to unblock your project. Progress grinds to a halt and it still has not stopped production problems--just those particular classes of problems that the approval processes caught.

EDIT: Not sure what the right solution is, but it must be one that doesn't rely on a particular overloaded human doing something. Maybe an automated approval system that produces a paper trail (to help with postmortem and corrective action later) and ensuring all changes can be rolled back effortlessly. Easier said than done, obviously.

Zababa · on Nov 2, 2021

I've read on HN that "processes are organizational scar tissue", I think it applies here.

riknos314 · on Nov 2, 2021

Yep. A wise engineer once told me "Runbooks [written SOPs] are just solving bugs with people instead of code"

rShergold · on Nov 2, 2021

That's an excellent phrase. It reminds me of the navy saying "regulations are written in blood"

MauranKilom · on Nov 2, 2021

It's actually super related, given that (at least in the medical software sector) you won't get anything approved by the FDA before spelling out the entire software development operation in processes.

NaturalPhallacy · on Nov 3, 2021

Makes sense in the medical domain where people are way more likely to be injured or die.

But most software isn't that critical.

david422 · on Nov 2, 2021

What is the solution?

I've worked at big companies that are mired in process because they would rather spend more time crossing i's and dotting t's than risk breaking something. I can see why.

And I've worked at smaller companies where the clients are small and it's easy to fix things that break. Move fast and break things at a small scale maybe.

But how do you grow to be a big company and still operate like a small company? I can't seem to see an answer.

Ao7bei3s · on Nov 2, 2021

Self-service approvals.

Instead of appointing a senior eng to be approver, task the same senior eng with writing down his decision criteria (as text or where it makes sense even as code).

This has advantages for everyone:

1. It lets the engineers who need approval move at their own speed, and plan time for it as a predictable work item like any other, instead of depending on an approver for whom the approvals will usually be at a lower priority and mid-sprint.

2. For the approval policy writer, it turns this into a one time effort with a defined scope that can be planned and prioritized in his/her own backlog, instead of open ended toil that can come at any time, take any time, and not clearly relate to their own current priorities.

3. For the company, writing down the policy brings consistent decision making.

Obviously this requires trust that employees can and will say "no, can't do" when they're tasked with something that is not approvable, which can be culturally difficult (business and otherwise). Checklists (literally a list of checkboxes to click on, "I confirm that...") can help with this.

(As an example of writing down the policy as code: that's any CI/CD pipeline. But it's not limited to engineering decision making - for example, we're using a well-known open source license management tool that promises auto-approval for open source library use depending on policies configured by legal. This works moderately not so well because this particular tool is not great; the idea is sound. We still made it work: now legal wrote down their policies, trained a large number of engineers on them and those are now empowered to make decisions.)

native_samples · on Nov 2, 2021

There are many, but the problems are more subtle than this video really gives credit for.

I worked at Google at the time this video was made, and empathized (in fact I had been an SRE for years by that point). Nonetheless, there are flip sides that the video maker obviously didn't consider.

Firstly, why did everything at Google have to be replicated up the wazoo? Why was so much time spent talking about PCRs? The reason is, Google had consciously established a culture up front in which individual clusters were considered "unreliable" and everyone had to engineer around that. This was a move specifically intended to increase the velocity of the datacenter engineering groups, by ensuring they did not have to get a billion approvals to do changes. Consider how slow it'd be to get approval from every user of a Google cluster, let alone an entire datacenter, to take things offline for maintenance. These things had tens of thousands of machines per cluster and that was over a decade ago. They'd easily be running hundreds of thousands of processes, managed by dozens of different groups. Getting them all to synchronize and approve things would be impossible. So Google said - no approvals are necessary. If the SRE/NetOps/HWOPS teams want to take a cluster or even entire datacenter offline then they simply announce they're going to do it in advance, and, everyone else has to just handle it.

This was fantastic for Google's datacenter tech velocity. They had incredibly advanced facilities years ahead of anyone else, partly due to the frenetic pace of upgrades this system allowed them to achieve. The downside: software engineers have to run their services in >1 cluster, unless they're willing to tolerate downtime.

Secondly, why couldn't cat woman just run a single replica and accept some downtime? Mostly because Google had a brand to maintain. When she "just" wanted to serve 5TB, that wasn't really true. She "just" wanted to do it under the Google brand, advertised as a Google service, with all the benefits that brought her. One of the aspects of that brand that we take for granted is Google's insane levels of reliability. Nobody, and I mean nobody, spends serious time planning for "what if Google is down", even though massive companies routinely outsource all their corporate email and other critical infrastructure to them.

Now imagine how hard it'd be to maintain that brand if random services kept going offline for long periods without Google employees even noticing? They could say, sure, this particular service just wasn't important enough for us to replicate or monitor and the DC is under maintenance, we think it'll be back in 3 days, sorry. But customers and users would freak out, and rightly so. How on earth could they guess what Google would or would not find worthy of proper production quality? That would be opaque to them, yet Google has thousands of services. It'd destroy the brand to have some parts that are reliable and others not according to basically random factors nobody outside the firm can understand. The only solution is to ensure every externally visible service is reliable to the same very high degree.

Indeed, "don't trust that service because Google might kill it" is one of the worst problems the brand has, and that's partly due to efforts to avoid corporate slowdown and launch bureaucracy. Google signed off on a lot of dud uncompetitive services that had serious problems, specifically because they hated the idea of becoming a slow moving behemoth that couldn't innovate. Yet it trashed their brand in the end.

A lot of corporate process engineering is like this. It often boils down to tradeoffs consciously made by executives that the individual employee may not care about or value or even know about, but which is good for the group as a whole. Was Google wrong to take an unreliable-DC-but-reliable-services approach? I don't know but I really doubt it. Most of the stuff that SWEs were super impatient to launch and got bitchy about bureaucracy wasn't actually world changing stuff, and a lot ended up not achieving any kind of escape velocity.

edude03 · on Nov 2, 2021

This is a great explanation, thank you.

(I've never worked at google, and maybe this isn't a problem anymore however) It seems like the "solution" here would be to do for Infra what Go did for Concurrency - build an abstraction with sane defaults, and rubber stamp anything that doesn't stray from those defaults. Anything that does - requires further scrutiny.

For example, at the companies where I've been response for infrastructure (admittedly much smaller than google) I've done exactly that (with Kubernetes specific things like PodDisruptionBudgets and defaulting to 2 replicas), and if users use the default helm chart values, they can ship their service by themselves.

native_samples · on Nov 3, 2021

They did a lot of stuff like that, but the work to launch a new service wasn't only technical and some of the non automated works was there partly to enforce checkpoints on the other stuff. For example to get your service mapped through the load balancers required you to prove you'd been approved for launch by executives, so the process required filing a ticket. It's probably all different now though.

I should also note that "launch" in Google speak means "make visible to the world". If you only wanted your service to be available for Googlers it was dramatically easier and the infrastructure was entirely automatic with zero approvals being easily possible.

plaguuuuuu · on Nov 3, 2021

Trust your people! Otherwise don't hire them in the first place.

You want to deploy something to prod? Okay, either call an API or fill out a webform (or message a Slack bot idk) - but the contents will be a checklist of stuff you need to have done.

1. Did you [load test, integration test, whatever]

2. Did your local architect look over and approve the high-level design? (NB: notice how we aren't requiring an architect to sign off, we trust the developer. Because if they lie they're fired lol.)

3. Other stuff. Maybe some taxonomic stuff like tags associated with deployed infrastructure? Swagger endpoints? Go nuts as long as it's stuff actually needed by central planning - the documentation and paper trail aspect is covered here

This is picked up to be ingested into databases, wikis, emails, wherever.

Compare with my last large corp where we had change approval boards where 25+ people sat in a long meeting and essentially just asked if you'd done the above and you'd then be greenlit to go to prod (at the time deployments were pretty manual and required scheduling as well which is obviously suboptimal). I'm just about to move from a small consulting company to a startup/scaleup so it's going to be interesting to see how things move there..

ignoramous · on Nov 2, 2021

Autonomy.

Solution to such org woes, in part, is discussed by Clayton Christensen in his work, The Innovator's Solution http://web.mit.edu/6.933/www/Fall2000/teradyne/clay.html: Even after correctly identifying potentially disruptive technologies, firms still must circumvent its hierarchy and bureaucracy that can stifle the free pursuit of creative ideas. Christensen suggests that firms need to provide experimental groups within the company a freer rein. "With a few exceptions, the only instances in which mainstream firms have successfully established a timely position in a disruptive technology were those in which the firms' managers set up an autonomous organization charged with building a new and independent business around the disruptive technology." This autonomous organization will then be able to choose the customers it answers to, choose how much profit it needs to make, and how to run its business.

---

Amazon and Cloudflare are good examples of big-orgs trying their best to implement late Prof. Christensen's ideas.

Andy Jassy on Amazon's approach to innovation: https://www.hbs.edu/forum-for-growth-and-innovation/podcasts...: And then if we like the answers to those first four elements, then we ask, can we put a group of single threaded focused people on this initiative, even if it seems like they're overwhelming it with strong senior people, if you try to add really busy people do the existing business and the big new idea, they will always favor the existing business because it's surer bet. So we want to peel people away from the existing business and put them just on the new initiative.

Pace of innovation at Cloudflare https://blog.cloudflare.com/the-secret-to-cloudflare-pace-of...: ...it is not unusual for an initial product idea to start with a team small enough to split a pack of Twinkies and for the initial proof of concept to go from whiteboard to rolled out in days. We intentionally staff and structure our teams and our backlogs so that we have flexibility to pivot and innovate. Our Emerging Technology and Incubation team is a small group of product managers and engineers solely dedicated to exploring new products for new markets. Our Research team is dedicated to thinking deeply and partnering with organizations across the globe to define new standards and new ways to tackle some of the hardest challenges.

---

Also read: Clayton Christensen and Stephen Kaufman on "Resources, Process, and Priorities": https://personal.utdallas.edu/~chasteen/Christensen%20-%202n...

strictfp · on Nov 2, 2021

In change management they argue that companies tend to purposely slow down change over time to become more predictable and lock in on the "successful route". That certainly mirrors my experience. The only thing I don't understand is why you hire so many people when you let a few handful people gate everything. You might just as well fire 80% of the workforce.

jeffbee · on Nov 2, 2021

You cannot have both an organization that fastidiously protects the privacy and security of user data, and one that requires no process to build and launch software. It's just not possible.

Anyway the video is just a joke. I've never worked anywhere where it was as easy to just serve 5TB of static data as at Google. Googlers who want to just host junk under their own authority do not need to shop for quota, set up borgmon, etc.

joshuamorton · on Nov 2, 2021

Right like looking back, they're setting up a production, user facing service. If I want to just store a 5tb blob somewhere, I think that fits in freebie CNS, so I don't even have to provision resources, I just cat the file or whatever (granted, 5tb was a bit bigger 10 years ago).

Having a rule that "your user-facing service needs to be replicated" is a good rule. Replication being difficult was the problem.

bostonsre · on Nov 2, 2021

Automate as much as possible. Approval gates are there to prevent obvious issues from continuing down the pipeline. If you can automate checks for known issues that you want to prevent from happening, then you should be able to add it as a test step. Then in the catch, log why it failed and point the dev at documentation.

Manual processes suck for everyone involved.

jll29 · on Nov 2, 2021

> we turned to Matt Round, an engineering leader who was a most interesting squeaky wheel in that his team appeared to get more done than any other

Matt went on to study theology, and he's started a church community in Scotland: https://www.linkedin.com/in/mattround/

"Leader Company Name: Hope City Church Edinburgh Dates Employed: Sep 2017 – Present Driving a new church start-up."

bcoughlan · on Nov 3, 2021

What does "platform standardization" mean in this context?