Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The software engineering paradigms used within the company create brittle rube goldberg machines of events flowing everywhere in the company. Almost all of them are on maintenance mode, where the oncall burns out the engineers and prevents them from creating new products. There is no knowledge sharing between team members. Legacy team members guard their technical platform knowledge to solidify their place on the team.

I have never worked at Amazon. But I did work at a company that decided to implement microservices a la Amazon, even going so far as sharing Bezos' famous 2 pizza team memo. This effect essentially happened over night. As people spun up more and more microservices, things got more and more siloed, cross team collaboration was significantly more difficult, and things became an increasingly more complicated rube goldberg machine that just destroyed people with on call schedules.



Software engineers having "on call" schedules at all is crazy to me. You shouldn't be writing code at 3am to fix a bug after working all day just to turn around and work the next day as well.


It works well of the company empowers engineers to write software that doesn't break all the time.

At HBO Max every incident had a full writeup and then real solutions were put in place to make the service more stable.

My team had around 3 incidents in 2 years.

If the cultural expectation is that the on call buzzer will never go off, and that it going off is a Bad Thing, then on call itself isn't a problem.

Or as I was fond of saying "my number one design criteria (for software) is that everyone gets to sleep through the night."

The customers win (stable service) and the engineers win (sleep).


What was the time to implementation for the real solutions?

Was there any cost considerations associated with prioritizing the implementation or even limiting the scope of the solution?


The tooling there was amazing so a barebones services could get deployed into prod in a couple days if needs be.

That typically didn't happen because engineering reviews had to occur first.

A single command created a new repo, setup ingress/egress configs in AWS, and setup all the boilerplate to handle secrets management, environment configs, and the like.

All that was left to do was business logic.


It sounds like if you have production issues, there is a priority to put out an actual fix immediately.


It depends on the severity of the issue.

If the issue impacts tens of millions of customers, then yes, get it fixed right now. Extended outages can be front page news. Too many in a row and people leave the service.

Ideally monitoring catches outages when they first get started and run books have steps to quickly restore service even if a full fix cannot be put into place immediately.


My experience being "on call" as an engineer has mostly not been that you need to write code at 3am. It usually comes down to restarting a machine, deploying a new machine or copy, or informing the rest of the company that some 3rd party API that you rely on is currently down.


But that's not engineering work, that's technician or operator work. The engineer comes in later to discuss what went wrong and how to prevent it next time.

Speaking as a technician whose seen 3 AM at work many a time.


I think the devops-inspired idea of engineers owning what they deploy has become fairly popular, for better or for worse.


For my own part, this wasn't a huge team. We had the knowledge of if the issue was application/software based but would pass back to ops if it was hardware/OS related.

One possible bonus, being on call operating your own software also gives you a solid incentive to not wake yourself up in the morning by writing bad code, and fixing those issues that do arise quickly.


> being on call operating your own software also gives you a solid incentive to not wake yourself up in the morning by writing bad code

Unfortunately, my software interacts over network with software written by other people; if something goes wrong at 3 AM the users don't know which part caused the problem, so they wake up a random person.


if your pagerduty/equivalent is doing that, something has gone very wrong at your company


That’s an observability problem, not an on-call problem.


When I worked at Samsung Austin Semiconductor, it was absolutely the norm to call or text engineers after-hours to weigh in on machine irregularities.


As others alluded to, there's no reason for an engineer making $200k/yr+ to do that. You can document how to recover from those error states and pay someone 25% to handle that.


Seriously! Working weekends in retail when I was young, one of the hallmarks of a "real, professional job" was not having to work nights/weekends when your routine schedule is during the day. It was a major motivator to get through school and get skilled.

Now I see young engineers from top-tier school working "on call" without complaint. I've found ways to avoid such roles, but it always seemed ridiculous and completely unnecessary in a world where there are software engineers around the globe that could easily work full time support positions.


I found it to be rather the opposite; when I was off from a wage-slave job, I was actually off. If the boss called, you could just ignore it and say you missed it because you were studying or sleeping or with friends or whatever and they couldn't really say anything because they knew they didn't pay you enough to care.

When you're making real money, they own your ass.


Indeed, those who are exempt or whatever with salaries... consider it purchase/lease with at-will employment in most states.

Both salary and hourly gigs have income hanging by a thread with plenty of work, yet only one can get overtime.

Be given responsibility/salary for something (aka hired) by a particularly needy manager/org and be 'undependable'. Read: not at their call. See how it turns out.

The worst/eventual outcome: bye-bye money. Hopefully one has a more reasonable environment. Workers have little on their side.

As someone who does SRE (not AWS, elsewhere)... I would absolutely prefer pay as an hourly rate over salary. I don't like putting in more hours/making less money because Developer Kelly had a bad launch... but I have to, The 9s (and bills) Must Flow.

Fortunately, my current place takes this into account. I don't actually need bonuses or structure change... but the larger trends remain. The employer is buying you, salary opens the time box.


It is a self fulfilling prophecy. Unrealistic schedules results in crappy code which results in pagers/alert going off at all hours which results in unrealistic schedules. Agile's answer is that we reduce the scope of things delivered. You might as well spit into the wind. Deadlines are set by the business, no matter what the Agile evangelist said.


What would be more sane? Allowing things to remain broken at night? Having someone who is not a software engineer fix it?


Either pay another full time night shift or yes, accept that things are broken for the time.

If your product is really so important that it can't be down, hire more engineers and pass the markup to your customers.

I'm glad I live in the country where you have to have 11 hours between end of work and start of work(except for special cases afaik).


There are legitimate reasons to pull in someone after hours, but it really has to be catastrophic. I'd 100% want to be called in if I deployed something knocking out 911 service for a whole state and I was the only one with the knowledge to actually fix it in a timely manner. However, most problems are not like that and are either able to be delayed until an actual business day or can be solved by someone else.


Let's be real, we're talking about line of business apps and ecommerce stores making $5,000/day total revenue. "Critical infrastructure" has an entirely different failure model.


If the product that breaks in the night is SO important for the company, well, why is not the company paying for dedicated people (not the engineers who create the product) to take care of it when it's broken? As said above, while on-call you don't write code, you just turn off feature flags, reboot machines, etc.

If the company cannot afford that, then the product is not that important and can remain broken until the morning.

Even 24h fast food places hire 3 people (each working 8h)!


If a 7-Eleven is open 24 hours a day, they usually hire three 8-hour shifts (roughly speaking).


If you want things to not break, have redundancy in hardware and failover modes that let you function in reduced capacity.

Manual fixes should never be done in a hurry, and if your system is that fragile, I really wonder about the competency of your senior employees and leadership.


> Software engineers having "on call" schedules at all is crazy to me. You shouldn't be writing code at 3am to fix a bug after working all day just to turn around and work the next day as well.

Oncall is only crazy to anyone who also believes it's totally acceptable to have whole services down for hours throughout the night.

To those who understand what it takes to have anything available 24/7, you understand damn well that you need someone to jump on a laptop as soon as an alarm bell rings.


Well, that someone better be someone else than me, because I'm not going to do unpaid night shifts. If you want something running 24/7, it's surely important enough to warrant hiring someone else to take care of it while I'm asleep, no?

Keep your fancy valley salary (with the ridiculous rent prices attached), and I'll keep my European workers right's protection—including undisturbed sleep after my 8 hours workday.


It certainly shouldn't be unpaid. When I've been on call we got half a week's salary for the week (in EU).

Mostly things went smoothly so that's a pretty nice bonus.


Yeah, it's different in the EU. In the US, it's often expected from engineers to be on unpaid oncall—that is, these companies usually phrase being oncall as part of your ordinary duties, without additional compensation. And even if it's compensated, sometimes you cannot opt out of this without seriously harming your career.

Something ridiculous like that is luckily impossible in (most?) EU countries.


What happens when you are drunk or at a concert? What happens when you do not hear the phone (or if it is off)?

Is the effect unformal harm?

In the EU when you are on call this is a contractual thing.


Is it an US thing that on-call shifts are unpaid?

That is not common in Europe. Generous compensation and additional time off is quite typical for engineers handling on-call burdens.


Yes it is.

At one company, I was technically on call 24 hours a day 7 days a week for over ten years. Did I get called that often? No. Did I get called at the worst possible moments? Yes.


That's not a problem with the concept of being oncall. That's an entirely different problem that's not technical nor operational not industry-specific.


Isn’t the fact that you receive calls seldomly, but at the worst possible moment literally the core problem of being oncall?

And it’s certainly industry-specific. Some doctors have this, firefighters—and software engineers. Contrary to the first two, they usually don’t save lives, but revenue though.


Such a weird take.

There is a cost to having on-call. Whether it's in the extra hours you are paying your engineers or other technicians, or sleep deprivation, dwindling motivation and performance, the cost is always there.

In a business, cost is always balanced with the return on that investment.

So it trivially follows that on-call only makes sense where the return is bigger than the investment. If you are having your $100/h engineers become $20/h engineers during the day because of the on-call rotation, and you lose $200 of sales over night when things are down (even your customers are asleep) — you are actually investing that $80/h difference for 8 hours ($640) to recover $200, for a net loss of $440.

Yes, there are cases where it's fully acceptable to simply have your service down for the night. Eg. imagine a service that provides the amount of energy sun is providing for a location (to combine it with solar farm production): is it really that bad if that's down at 2am? Sure, it might be nice to get it back up before the sun is up, but this is just a trivial example where an uptime of ~70% (fluctuates) is perfectly acceptable.


> There is a cost to having on-call. Whether it's in the extra hours you are paying your engineers or other technicians, or sleep deprivation, dwindling motivation and performance, the cost is always there.

I don't understand your take. Every single time I had a job with an oncall rotation, that oncall was paid. I was paid a bonus for being oncall, I was paid a bonus if during oncalls a pager fired outside of office hours, I was paid a bonus if I was pulled into an incident response outside of my oncall rotation. There was always a cost, and we were paid for it. Being oncall represented loosely a pay bump of around 15%.

If that's not your case then I'm sorry but your problem is not the oncall rotation.


Getting paid for on call, especially this tri-level multiple bonus structure, is incredibly uncommon.


That should make my point more obvious: why would a business pay you 15% more if they are losing minor or no money or customers if services are down until someone comes back for their regular work day?


If what we’re talking about is a website/app/SaaS/etc, and if it needs to be up 24/7, then that almost certainly means that it’s being used globally, or at least across several timezones.

So, hire a team in another time zone.

This is a problem of management not prioritizing the health and wellness of their employees, simple as that.


It's absolutely acceptable to have your website go down for some reason overnight. Fix it in the morning.

Even if your app is critical infrastructure (it isn't, and 99% of you shaking your head and saying it is are objectively incorrect), you don't need a software engineer to fix it. You need an SRE. That's completely different.


> But I did work at a company that decided to implement microservices a la Amazon, even going so far as sharing Bezos' famous 2 pizza team memo. This effect essentially happened over night. As people spun up more and more microservices, things got more and more siloed, cross team collaboration was significantly more difficult, and things became an increasingly more complicated rube goldberg machine that just destroyed people with on call schedules.

Micro services solve a problem for companies that have already reached a certain scale were cross team communications have become unfeasible and now communicating by well defined API contract is a better choice.

Also ideally micro services should reduce the blast radius of outages. If an outage is not easily traced to what team is root causing it, then proper monitoring is not in place.

Sure I've had times where on call went off and it wasn't my team's fault but it was no more than 10 minutes to determine that and reroute the call and then to back to sleep.

The other fact is that when designing a new microservice it should be done in conjunction with the primary consumers of that service just like designing any other part of software. Stakeholders need to be brought in and consulted.

The advantage of microservices is that new code is only going to impact direct consumer s and downstream services.

It also allows you to upgrade a service in place or completely rewrite it so long as you adhere to the original contract.

I've seen impressive rollouts of security updates across a huge code base that was only possible because of a microservice-based design.

I've seen giant monoliths fall apart as multi-year long efforts are undertaken just to update the build system.

The hard part of a microservices is they require discipline in an organization and basically assigning an engineer for 1 to 2 weeks to write run books and add monitoring.

Of course, those runbooks should be written no matter what paradigm someone uses!


At Amazon's scale is there any alternative to a service oriented architecture?


SOAs serve the purpose to more clearly delineate responsibilities: any appearance of tight coupling is made relatively obvious.

Nothing stops someone from simply enforcing the same division in a single large code base. Your API contract can be your public API in whatever programming language, and this would allow you to work with the same assumptions from the SOA.

It would only be easier to break out of the recommended way of doing things, but you can provide simple tooling that does static analysis to prevent that (I remember using Zope3 security configuration to achieve exactly that with Python code in ~2006).

If you are concerned about a performance from such a large monolith, you could be using a functional language (or at least the pure functional paradigm) that allows easier infinite horizontal scaling.


> Nothing stops someone from simply enforcing the same division in a single large code base.

I'd say nothing except human nature.


When I said "enforcing", I really meant with static analysis tools.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: