#4 I work at Facebook. I worked at Twitter. I worked at CloudFlare. The answer i...

fossuser · on July 11, 2019

Yep, this also matches what I've heard through the grapevine.

Pushing bad regex to production, chaos monkey code causing cascading network failure, etc.

They're just different accidents for different reasons. Maybe it's summer and people are taking more vacation?

degenerate · on July 11, 2019

I actually like the summer vacation hypothesis. Makes the most sense to me - backup devs handling some things they are not used to.

Balgair · on July 11, 2019

So, a reverse Eternal-September? It'll get better once everyone is back from August vacations?

uber-employee · on July 12, 2019

No, because it’ll only get better until next summer.

Avamander · on July 11, 2019

These outages mean that software only gets more ~fool~ summer employee proof.

kenhwang · on July 11, 2019

I'm more partial to the summer interns hypothesis.

jmharvey · on July 11, 2019

I agree with this, but to be clear, the "summer interns hypothesis" is not "summer interns go around breaking stuff," it's "the existing population of engineers has finite resources, and when the interns and new grads show up, a lot of those resources go toward onboarding/mentoring the new people, so other stuff gets less attention."

gowld · on July 11, 2019

Pretending that junior engineers is the problem, is the problem.

rapind · on July 11, 2019

Just checking what your objection is. Is it that you think experience is overrated, or is it just that he was speculating without any evidence?

dan-0 · on July 11, 2019

Can't speak for OP, but I can tell you what mine is.

If you have an intern or a Junior Engineer, they should have a more senior engineer to monitor and mentor them.

In the situation where a Junior Engineer gets blamed for a screw up:

1. The Senior Engineer failed in their responsibility. 2. The Senior Engineer failed in their responsibility.

A Junior Engineer should be expected to write bad code, but not put it into production, that's on the Senior. If I hit approve on a Junior Engineer's PR, it's my fault if their code brings the whole system down. If a Junior Engineer had the ability to push code without a review, it's my fault for allowing that. Either way it's my fault and it shouldn't be any other way. It's a failure to properly mentor. Not saying it doesn't happen, just that it's never the Junior Engineers fault when it does.

deathanatos · on July 12, 2019

I'd caveat that slightly: only if the senior engineer is not also overburdened with other responsibilities, and the team has the capacity to take on the intern in the first place. I've been on teams where I felt like we desperately needed more FTEs, not interns. But we could hire interns, and not FTEs.

(I agree with the premise that an intern or junior eng is supposed to be mentored, and their mistakes caught. How else should they learn?)

tiredyam · on July 12, 2019

the amount of time that the summer intern / new grad eat up of seniors time is the problem. Tech debt that does not get addressed in a timely manner because of mentorship responsibilities is the problems

echelon · on July 11, 2019

If you don't train new and capable engineers, you'll eventually lose talent due to attrition and retirement. Talent can be grown in-house; engineering companies are much better environments than universities to learn how to build scalable platforms. The cost of acquisition is low, too, because junior engineers can still make valuable contributions while they learn to scale their impact.

melq · on July 12, 2019

If interns are able to take down your infrastructure, then it is the fault of the senior engineers who have designed it in a way that would allow that to happen.

bobthepanda · on July 11, 2019

Rule one of having interns and retaining your sanity is that interns get their own branch to muck around in.

jrockway · on July 11, 2019

Rule one of having a useful intern experience is to get them writing production code as quickly as possible. They check in their first change? Get that thing into production immediately. (If it's going to destabilize the system, why did you approve the CL? You two probably pair programmed the whole thing together.)

HeWhoLurksLate · on July 12, 2019

I completely agree- even if it's something small.

I'm an intern in a big company with an internal robotics and automation group, and I recently got to wire up a pretty basic control panel, install it, and watch workers use it. That was so cool, and made me appreciate what I was doing a lot more.

kenhwang · on July 11, 2019

Sure. The interns have their own branch, but it doesn't stop them from being disruptive to the human in charge of mentoring them.

vorticalbox · on July 11, 2019

All changes should be in a new branch.

devin · on July 12, 2019

I used to believe this. Having solid lower environments which are identical to production, receiving live traffic where engineers can stage changes and promote up removes some of the “all things should live on a branch” business. I know that sounds crazy, but it is possible for teams of the right size to go crazy on master as long as the safety nets and exposure to reality are high enough in lower environments.

kdelok · on July 12, 2019

I recall someone saying that holiday periods actually had better reliability for their services, because fewer people were pushing breaking changes...

I do wonder if it's that the usual maintainers of particular bits and pieces are on vacation and so others are having to step in and they're less familiar or spread too thin.

cmroanirgo · on July 12, 2019

Yes, but it always seems to come down to a very small change with far reaching consequences. For this ongoing twitter outage, it's due to an "internal configuration change"... and yet the change has wide reaching consequences.

It seems that something is being lost over time. In the old days of running on bare metal, yes servers failed for various reasons, then we added resiliency techniques whose sole purpose was to alleviate downtime. Now we're at highly complex distributed systems that have failed to keep the resiliency up there.

But the fact that all the mega-corps have had these issues seems to indicate a systemic problem rather than unconnected ones.

Perhaps a connection is the management techniques or HR hiring practices? Perhaps it's due to high turnover causing the issue? (Not that I know, of course, just throwing it out there). That is, are the people well looked after and know the systems that are being maintained? Even yourself who's 'been around the traps' with high profile companies: you have moved around a lot... Were you unhappy with those companies that caused you to move on? We've seen multiple stories here on HN about how those people in the 'maintenance' role get overlooked for promotions, etc. Is this why you move around? So, perhaps the problem is systemic and it's due to management who've got the wrong set of metrics in their spreadsheets, and aren't measuring maintenance properly?

mlinsey · on July 12, 2019

I remember all these services being far less reliable in the past. The irony of us talking about the bygone era of stability in the context of Twitter is particularly hilarious.

I do think that internet services in general are much more mission critical, and the rate of improvement hasn’t necessarily kept up. It used to be not particularly newsworthy if an AWS EBS outage took out half the consumer internet several times per year, or if Google’s index silently didn’t update for a month, or when AOL (when they were by far largest ISP in the US) was down nationwide for 19 hours, or the second-biggest messaging app in the world went down for seven days.

selestify · on July 12, 2019

Which app was down for 7 days?

bdd · on July 12, 2019

I don't see the value in lamenting the old days of a few machines where you could actually name them as Middle Earth characters, install individually, log in to one single machine to debug a site issue. The problems were smaller and individual server capacity in respect to demand was in meaningful fractions. Now the demand is so high and set of functions these big companies need to offer are so large, it's unrealistic to expect solutions that doesn't require distributed computing. It comes with "necessary evils", like but not limited to configuration management--i.e. ability to push configuration, near real time, without redeploying and restarting--, and service discovery--i.e. turning logical service names to a set of actual network and transport layer addresses, optionally with RPC protocol specifics. I refer to them as necessary evils because the logical system image of these are in fact single points of failures. Isn't it paradoxical? Not really. We then work on making these systems more resilient to the very nature of distributed systems, machine errors. Then again, we're intentionally building very powerful tools that can also enable us to take everything down with very little effort because they're all mighty powerful. Like the SPoF line above, isn't it paradoxical? Not really :) We then work on making these more resilient to human errors. We work on better developer/operator experience. Think about automated canarying of configuration, availability aware service discovery systems, simulating impact before committing these real time changes, etc. It's a lot of work and absolutely not a "solved problem" in a way single solution will work for any scale operation. We may be great at building sharp tools but we still suck at ergonomics. When I was at Twitter, a common knee-jerk comment at HN was "WTF? Why do they need 3000 engineers. I wrote a Twitter clone over the weekend". A sizable chunk of that many people work on tooling. It's hard.

You're pondering if hiring practices and turnover might be related? The answer is an absolute yes. On the other hand, these are the realities of life in large tech companies. Hiring practices change over years because there's a limited supply of of candidates experienced in such large reliability operations and industry doesn't mint many of them either. We hire people from all backgrounds and work hard on turning them to SREs or PEs. It's great for the much needed diversity (race, gender, background, everything) and I'm certain the results will be terrific but we need many more years of progress to declare success and pose in front of a mission accomplished banner on an aircraft carrier ;)

You are also wisely questioning if turnover might be contributing to these outages and prolonged recovery times. Without a single doubt, again the answer is yes but it's not the root cause. Similar to how hiring changes as company grows, tactics for handling turnover has to change too. It's not like people leave the company, but within the same company they move on and work on something else. The onus is on everyone, not just managers, directors, VPs to make sure we're building things where ownership transfer us 1) possible 2) relatively easy. This in mind, veterans in these companies approach code reviews differently. If you have tooling to remove the duty of nitpicking about frigging coding style, and applying lints, then humans can indeed give actually important feedback on complexity of operations, self describing nature of code, or even committing things along with changes to operations manual living in the same repo.

I think you're spot on with your questions but what I'm trying to say with this many words and examples is, nothing alone is the sole perpetrator of outages. A lot of issues come together and brew over time. Good news, we're getting better.

Why did I move around? Change is what makes life bearable. Joining Twitter was among the best decisions in my career. Learned a lot, made lifelong friends. They started leaving because they were yearning a change Twitter couldn't offer. I wasn't any different. Facebook was a new challenge, I met people I'd love to work with and decided give it a try. I truly enjoy life there even though I'm working on higher stress stuff. Facebook is a great place to work but I'm sure I can't convince even %1 of HN user base, so please save your keyboards' remaining butterfly switch lifetime, don't reply to tell me how much my employer sucks :) I really hope you do enjoy your startup jobs (I guess?) as much as I do my big company one.

eecc · on July 12, 2019

Not sure where you’re going, but my take is that yes, the times for calling servers individually are over.

But we’re still touching the belly of our distributed systems with very pointed tools as part of the daily workflow. That’s how accidents happen.

The analogy is clear IMHO; just as we’ve long stopped fiddling daily with the DRAM timings and clock multipliers of the Galadriel and Mordor servers, we should consider abstaining from low level “jumper switching” on distributed systems.

Of course, this also happened thanks to industry introducing PCI and automated handshaking...

wbl · on July 12, 2019

Those days of yore are when computers did things and we wrote programs that satisfied immediate needs. There was also a social element to it when there were multiple users per machine.

thrwayxyz · on July 12, 2019

[flagged]

dang · on July 12, 2019

https://news.ycombinator.com/item?id=20199143

mastratton3 · on July 11, 2019

lol yes, whats the quote on "Don't assume bad intention when incompetence is to blame"?

After seeing how people write code in the real world, I'm actually surprised there aren't more outages.

jethro_tell · on July 11, 2019

Well we have an entire profession of SRE/Systems Eng roles out there that are mostly based on limiting impact for bad code. Some of the places I've worked with the worst code/stacks had the best safety nets. I spent a while shaking my head wondering how this shit ran without an outage for so long until I realized that there was a lot of code and process involved in keeping the dumpster fire in the dumpster.

devin · on July 12, 2019

Which do you prefer? Some of the best stacks and code I’ve worked in wound up with stability issues that were a long series of changes that weren’t simple to rework. By contrast, I’ve worked in messy code, complex stacks, that gave great feedback. In the end, the answer is I want both, but I actually sort of prefer “messy” with well thought out safety nets to beautiful code and elegant design with none.

jethro_tell · on July 12, 2019

One thing that stands out from both types of stacks that I've worked with, is that most of the time, doing things simply the first time without putting in a lot of work to guess what other complications will arise later tends to produce a stack with a higher uptime even if the code gets messy later.

There are certainly some things to plan ahead for, but if you start with something complex it will never get simple again. If you start with something simple, it will get more complex as time goes by but there is a chance that the scaling problems you anticipated present in a little different way and there's a simple fix.

I like to say, 'Simple Scales' in design reviews and aim to only add complexity when absolutely necessary.

newsbinator · on July 11, 2019

Hanlon's Razor: https://en.wikipedia.org/wiki/Hanlon%27s_razor

"Never attribute to malice that which is adequately explained by stupidity."

euske · on July 11, 2019

I always thought that this cause should also include "greed". But then, greed is kinda one step closer to malice, and I'm not sure if there's a line.

rossdavidh · on July 12, 2019

Ah, but that's a lot of big corps being more stupid in the last month than last year? If it's two or three more, that's normal variation. We're now at something more like 7 or 8 more. The industry didn't get that much stupider in the last year.

aaroninsf · on July 11, 2019

I will observe, without asserting that it is actually the case,

that successful executions of #3 should be indistinguishable from #4.

(And this is maybe a consequence of #1).

Diederich · on July 11, 2019

I've also worked at a couple of the companies involved.

This is the correct analysis on every level.

kwizzt · on July 11, 2019

How does the fact you worked at those companies relate to #4?

Edit: I misread the parent and my question doesn't make a lot of sense. Please ignore it :)

bdd · on July 11, 2019

> How does the fact you worked at those companies relate to #4?

For Facebook I worked on the incident, previous Wednesday. 9.5 hours of pain...

And for my past employers, I still have friends there texting the root causes with facepalm emojis.

liberte82 · on July 11, 2019

Do tell

GrumpyNl · on July 12, 2019

Turned out to be number #1 The outage was due to an internal configuration change, which we're now fixing. Some people may be able to access Twitter again and we're working to make sure Twitter is available to everyone as quickly as possible.

captn3m0 · on July 11, 2019

Can you clarify what redefining problems would mean (with an eg)?

GuiA · on July 11, 2019

Think of computer vision tasks. Until modern deep learning approaches came around, it was built on brittle, explicitly defined pipelines that could break entirely if something minor about the input data changed.

Then the great deep learning wave of 201X happened, replacing dozens/hundreds of carefully defined steps with a more flexible, generalizable approach. The new approach still has limitations and failure cases, but it operates at a scale and efficiency the previous approaches could not even dream of.

MegaButts · on July 11, 2019

That's not redefining the problem, so much as applying a new technology to solve the same problem. Usually using the flashy new technology decreases reliability due to immature tooling, lack of testing, and just general lack of knowledge of the new approach.

Also deep learning, while incredibly powerful and useful, is not the magic cure-all to all of computer vision's problems and I have personally seen upper management's misguided belief in this ruin a company (by which I mean they can no longer retain senior staff, they have never once hit a deadline, every single one of their metrics is not where they want it to be, and a bunch of other stuff I can't say without breaking anonymity).

idlewords · on July 11, 2019

FAANG(+T)(-N)(+M)

18pfsmt · on July 12, 2019

I think we 'bumped heads' at Middlebury in '94, and I think you are in store for an "ideological reckoning" w/in 3 years.

Pinboard is a great product, so thanks for that. I am surpised you don't have your own Mastodon instance (or do you?).

gcbw2 · on July 11, 2019

since all of them happen in high profile business hours, i'd guess either #1 or #5.

For #4 to be the actual cause, outages out of business hours would be more prevalent and longer.

icebraining · on July 11, 2019

Of course it went down during business hours, that's when people are deploying stuff. It's known that services are more stable during the weekends too.

peterburkimsher · on July 12, 2019

The Archive.org outage of 26th of June was outside PST business hours.

https://twitter.com/internetarchive/status/11436045396956160...

https://twitter.com/internetarchive/status/11433789908260044...

iamtheworstdev · on July 11, 2019

Faangt = Facebook amazon Apple Netflix Google tesla?

gsich · on July 12, 2019

Gmafia

arrty88 · on July 12, 2019

Add slack to the list

Edit: and stripe

foobarbecue · on July 12, 2019

Twitter, not tesla

loblollyboy · on July 11, 2019

[flagged]

jkaplowitz · on July 11, 2019

Former employees and current employees talk via unofficial online and offline backchannels at many companies.

loblollyboy · on July 11, 2019

Ok, so maybe I overreacted

bdd · on July 11, 2019

geez, tough crowd. do you wanna ten dollar hug?

loblollyboy · on July 11, 2019

I was just polishing my bit. Not in a bad mood today so much as a bored mood. You seem like you know what you are talking about (yes, I was bored enough to stalk you, too)

bdd · on July 11, 2019

If you are bored one day and around Menlo Park, come have a coffee or ice cream at FB campus. You can troll me in person.

18pfsmt · on July 12, 2019

Isn't it interesting where this is going? We all want to meet our accusers? I don't care for FB myself, but I appreciate what you all are doing in the larger sense. Cloudflare is my fave of your former employers (since you shared that in this discussion).

dang · on July 11, 2019

Could you please stop posting unsubstantive comments to Hacker News?

hexrcs · on July 11, 2019

Life in tech is like a Quentin Tarantino movie.

_lqaf · on July 12, 2019

...except everyone is sitting at desks typing, there's no blood or surf rock or chases or self-indulgent soliloquies, and the cursing is much less creative?

jessaustin · on July 12, 2019

Maybe you're doing it wrong?

robohoe · on July 12, 2019

  cursing is much less creative?

I beg to differ.

marenkay · on July 11, 2019

Only one thing to add:

Tech debt is accrued in amounts where every VC fund would get wet pants if tech debt was worth dollars paid out.

wybiral · on July 11, 2019

I've still never seen this much downtime on these systems so it's weird to happen all at once.

It's possible that they're related without requiring any conspiracy theories or anything. Maybe these companies are just getting too big or too sloppy to maintain the same standard of uptime (compared to the past few years)? Or maybe there's some underlying issue that they're all rushing to fix which justifies the breaking prod changes within the same timeframe.

But it was weird when a it happened to two or three of them. Now we're going on something like 5 massive failures from some of the biggest services online within a little over a week...