AWS IAM is having issues again

igetspam · on Sept 17, 2020

Wherever possible, don't use us-east-1. It's one of the older regions and parts are aging. Yes, I know there are things that are only available in the old regions but most services are globally available. I've worked with a few ex AWS SWEs and SREs. They drink the kool-aid and won't say anything bad about us-east-1 but they also won't launch net-new services there. YMMV

jen20 · on Sept 17, 2020

This is good advice, but does not help mitigate IAM issues. IAM is a global service.

wiredone · on Sept 17, 2020

IAM is mastered in us-east-1 with all writes going there.

wy35 · on Sept 17, 2020

I'd much rather use us-west-2 but Lambda@EDGE is only available in us-east-1 :(

BillinghamJ · on Sept 17, 2020

The Lambda functions have to be configured in us-east-1, but they don't actually run there - they get replicated to the Cloudfront POPs you're using.

e.g. we set up our L@E functions in us-east-1, run our actual systems in eu-west-1, but 95%+ of our traffic comes in via eu-west-2, and the L@E functions output logs for that traffic to the eu-west-2 CloudWatch.

For Cloudfront stuff, it's not very region dependent.

EE84M3i · on Sept 17, 2020

Yes, but if you have Lambda@Edge it as part of a larger cloudformation stack you are stuck in us-east-1 or using separate stacks.

Aperocky · on Sept 17, 2020

> things that are only available in the old regions

And usually it's old things, like really really old if something is available in us-east-1 and not us-east-2. I would imagine some ancient ec2 instance types.

PopeDotNinja · on Sept 17, 2020

If that's they case, maybe AWS should raise the pricing on us-east-1 to fund improvements and/or encourage people to either move off to newer data centers.

Cthulhu_ · on Sept 17, 2020

They're one of the wealthiest companies in the world and you suggest they should raise prices to raise money to improve one of their older money printing locations? What planet are you from?

They can rebuild that whole datacenter a hundred times over and still not feel it in their wallet.

lsofzz · on Sept 17, 2020

> They can rebuild that whole datacenter a hundred times over and still not feel it in their wallet.

Agreed.

65a · on Sept 17, 2020

It's about aligning incentives. If the user cost is the same, but the maintenance costs are higher, then something is broken, or you're eating the cost as a feature.

blntechie · on Sept 17, 2020

That’s interesting. As us-east-1 has most of the services, we use them primarily along with us-west-2.

Are you saying us-east-2 would be better than us-east-1?

takeda · on Sept 17, 2020

Generally yes, since as they gained some experience and designed other data centers better. They also generally stay replying changes with us-east-1.

Although IAM service is global one and it is located in us-east-1, so you wouldn't be able to avoid this particular issue.

almostdeadguy · on Sept 17, 2020

I know a couple of ex-AWS folks who sort of say the opposite. us-east-1 is supposedly less stable because they roll out updates and new services there first.

oblio · on Sept 17, 2020

That's kind of silly :-)

I used to work for another pretty big company and we knew to roll out changes to the smaller DCs first.

takeda · on Sept 17, 2020

If you look at it the other way. If change works on us-east-1, it should work everywhere else.

If their goal is to reduce the number of affected locations (and not care about number of people) then deploying to us-east-1 makes more sense.

oblio · on Sept 17, 2020

Yeah, but at that scale, they could pick region #5 in terms of traffic and probably cover 5 nines worth of edge cases :-)

nerdbaggy · on Sept 17, 2020

So for east coast you would say go Ohio?

peterthehacker · on Sept 17, 2020

It’s funny how on Twitter this currently has 3 retweets and 17 likes but on HN it has 110 points!

specialp · on Sept 17, 2020

I started seeing this while running Terraform and immediately went to Twitter to see if it is down for everyone or just me. Same experience. From now on I will go to HN :)

kohtatsu · on Sept 17, 2020

https://hckrnews.com is an alt-reader that sorts the front page chronologically which helps.

simonkafan · on Sept 17, 2020

Maybe because the audience on Twitter and Hackernews is different?

jaaames · on Sept 16, 2020

I guess I'm going outside today.

banana_giraffe · on Sept 16, 2020

What a perfect time to onboard some new employees.

Blah.

s09dfhks · on Sept 17, 2020

I modified some of our IAM policies earlier this afternoon, followed by the pages that some of our teams were having IAM issues, caused me great discomfort

rootsudo · on Sept 17, 2020

Congratulations, this time it wasn't you!

Maybe, possibly, hopefully...

rootforce · on Sept 16, 2020

This probably needs a better link, but the AWS status page shows everything up.

UPDATE: Status page now shows it https://status.aws.amazon.com/#

QuinnyPig · on Sept 16, 2020

https://stop.lying.cloud if you (like me) keep getting the order of the aws, amazon, and status confused.

tgsovlerkhgsel · on Sept 16, 2020

Is this just a gimmick run by AWS and the "honest" in the logo is just a play on the host name, or is it some third-party version that adds information that AWS isn't reporting?

QuinnyPig · on Sept 17, 2020

It's mine. It's an old Chrome extension shoved into a Lambda@Edge function that dynamically transforms the actual status page to cut out a lot of the "sea of green bubbles." I should look into updating it; it used to automatically upgrade the severity one level as well but AWS changed something.

https://gaslighting.me is the non-editorialized version.

floatingatoll · on Sept 17, 2020

Do you have a writeup of “porting a chrome extension to lambda” somewhere?

outworlder · on Sept 16, 2020

Third party obviously. Check whois.

jacques_chester · on Sept 17, 2020

tgsovlerkhgsel · on Sept 17, 2020

Does it add/edit any information, or is it just a proxy?

tuananh · on Sept 17, 2020

probably just cname.

jjoonathan · on Sept 16, 2020

The AWS status page misses lots of "bump in the night" outages.

temp0826 · on Sept 16, 2020

Those statuses are updated by people (after bureaucracy and only with very high level approval), not by anything based in reality

privacy-matters · on Sept 17, 2020

That is super messed up. That is not how a status board should work.

celim307 · on Sept 17, 2020

Blame sla language and marketing

takeda · on Sept 17, 2020

In addition to that it is also cached by CDN, so it takes a while to show up when there is an outage.

lpghatguy · on Sept 17, 2020

Amazon is sadly not the only company that works this way.

runawaybottle · on Sept 16, 2020

What’s everyone’s back up plans? Just a ‘We’ll be right back’ page?

snazz · on Sept 16, 2020

Switching to that sort of page is probably the most cost-effective solution for small businesses. I've heard of larger companies running a completely redundant hot standby on another independent cloud platform and switching DNS over to the standby when something goes wrong. With auto-scaling, you're not paying to have the standby running at full throttle. Of course, you have to exclusively use services that have equivalents on the other cloud provider.

redis_mlc · on Sept 17, 2020

> What’s everyone’s back up plans? Just a ‘We’ll be right back’ page?

I've done a lot of cloud site HA work on large sites.

Waiting out the cloud outage so far ends up being the best solution for almost all companies, from both engineering and business standpoints. Eat the outage, but continue with a known-working site afterwards. You just blame the cloud provider for the downtime.

> I've heard of larger companies running a completely redundant hot standby on another independent cloud platform and switching DNS over to the standby when something goes wrong.

In theory, that makes sense. It practise, it almost never works.

If by "independent cloud platform" you mean another AZ or region in the same cloud, that is often attempted and can work reasonably well. If you mean failover from AWS to GCP, then that's unlikely, since everything is different.

An example is whenever DynDNS goes down for 2-3 hours, and everybody builds out flaky failover tools that are less reliable than their original DNS provider - and have to be maintained forever. Might work with one or two domains, becomes a huge ongoing problem with dozens of them. Also, DNS mgmt. APIs are flaky in several dimensions (availability, versioning, parameters, etc.)

Another is that you can't failover to another location that doesn't have all your data, current certificates, monitoring, etc., and the failover site needs the capacity of the original site to work. That costs ongoing money and time, and you never know how well the failover will work or how it will perform.

An example is Heartland, one of the biggest US payment providers, who failed over to another location and took a 5 day outage. Or gitlab, who took a one day outage because of database isues.

I have (automatically) failed over a large site for a publicly-traded company from one AWS region to another, but that took a year of work to setup, and I understood almost all aspects of the site. Afterwards, I realized that almost nobody really has time to organize that either at a conceptual or engineering level. And organizations don't recognize Herculean efforts like that, so think twice beforehand.

Key point: always involve your DBA from the beginning when doing a project like this.

dvtrn · on Sept 17, 2020

Did this link get changed to a twitter post from something else?

rootforce · on Sept 17, 2020

Yes, it was a link to nitter.net(an alternative twitter front end) due to HN guidelines for posting links to original sources.

NathanWilliams · on Sept 17, 2020

Noticed something odd today I think is connected to this.

The other day we started using Access Advisor, and we found some of our KMS key policies with a Principal of '*'.

It wasn't marked as globally open, so we planned to fix them a little later.

This morning we found that status had changed.

While we were in the wrong to begin with, it was a little surprising to find the interpretation of the key policy changing overnight.

Of course it became our top priority and is now fixed. Something to look out for...

TheYahiaBakou · on Sept 17, 2020

I just feel bad for the oncalls...

BillinghamJ · on Sept 16, 2020

Looks to be affecting all regions - at least within the standard aws partition. Not sure about aws-cn and aws-us-gov

ArchOversight · on Sept 16, 2020

Not seeing any issues in aws-us-gov.

krrishd · on Sept 16, 2020

TIL of aws-us-gov

dijit · on Sept 16, 2020

usually referred to as gov-cloud.

Has it's own version of everything, physically segmented, even for global systems such as IAM.

You have to be a US Citizen to work on it and you need special security clearances.

cperciva · on Sept 16, 2020

You might be getting GovCloud and AWS Secret Region confused. I've been told that I can access GovCloud if I cross the border from Canada, despite not being a US Citizen. (This came up in the context of providing FreeBSD AMIs.)

BillinghamJ · on Sept 16, 2020

I believe the secret region is just part of GovCloud. Afaik, GovCloud is just the marketing name for aws-us-gov

schwank · on Sept 16, 2020

GovCloud is a separate partition from Secret. Different regulatory framework alignment and customer onboarding.

GovCloud customers only need be a US person or entity, beyond that any further regulatory alignment is up to the customer. AWS does not audit the IAM user base for nationality or any compliance requirements.

Disclaimer: I am an AWS Public Sector Solutions Architect.

BillinghamJ · on Sept 17, 2020

Ah it looks to be aws-iso (c2s.ic.gov, top secret) and aws-iso-b (sc2s.sgov.gov, secret)?

0xbadcafebee · on Sept 17, 2020

So, just to clarify: if a "customer" who happens to employ tens of thousands of people also employs non-US citizens (such as the DoD's foreign national IT contractors), then non-US citizens would have access to GovCloud.

stevehawk · on Sept 16, 2020

Secret region is not a part of govcloud.

bythe4mile · on Sept 16, 2020

IAM is a global service. So it kind of makes sense that all regions would be affected by an outage.

BillinghamJ · on Sept 16, 2020

Well yes but it's entirely possible for it to be locally affected

bythe4mile · on Sept 16, 2020

IAM is seems to be run out of North Virginia (other than Gov Cloud)[0]. However they probably may be doing some magic under the covers to reduce latency across the globe. It would also explain why GovCloud isnt affected.

[0] - https://aws.amazon.com/about-aws/global-infrastructure/regio...

BillinghamJ · on Sept 16, 2020

Yeah certainly the existence of credentials etc must be replicated to deal with cross-region connectivity issues etc. - which I guess explains why authentication is still working fine

GovCloud I believe is technically separated in this regard, I don't think IAM credentials can operate across partitions.

Though it does seem to be possible to link the existence of a standard AWS account with GovCloud, the same is not true of the China partition

ArchOversight · on Sept 17, 2020

Gov Cloud is an entirely separate setup, you can't share anything across the two.

riffic · on Sept 16, 2020

IAM is a global service.

TazeTSchnitzel · on Sept 16, 2020

Off-topic: I hadn't heard of nitter.net before, it seems pretty cool.

dang · on Sept 17, 2020

Sorry to disappoint, but we've changed the URL from https://nitter.net/RyanGartin/status/1306352941964701696#m to the original source, as the site guidelines ask: https://news.ycombinator.com/newsguidelines.html.

rootforce · on Sept 17, 2020

Yep, sorry about that. Thanks for the update.

rootforce · on Sept 16, 2020

I like it a lot better than the current twitter experience, and it's open source.

DC-3 · on Sept 17, 2020

It makes me sad to see these reminders of how fast websites are perfectly capable of being.

boring_twenties · on Sept 17, 2020

It's self-hostable, too

holler · on Sept 16, 2020

my site is down on all environments (us-east-1). oy vey

leesalminen · on Sept 16, 2020

Is your site dependent on IAM?

andrewxdiamond · on Sept 16, 2020

Pretty hard to not be dependent on IAM. Authentication and authorization are some of the most core concepts you can have

cperciva · on Sept 16, 2020

Amazon says that "The issue continues to affect create, describe, modify or delete of IAM accounts and roles. [...] Authentication using IAM accounts and roles are not affected."

So it's entirely possible to depend on IAM but not in a way which this is breaking.

sk5t · on Sept 17, 2020

This makes sense, observing services that rely on IAM/STS very, very routinely--but without changing IAM properties--and no alerts popped up during this outage.

andrewxdiamond · on Sept 16, 2020

For sure, that’s just not what the parent comment asked

temp667 · on Sept 16, 2020

I thought IAM authentication was working, but the create / remove / etc steps were down. This is a HUGE impact if all IAM role access is down globally for AWS.

Please confirm and I will work to spread the news - can I cite you as the source?

holler · on Sept 16, 2020

https://sqwok.im, it's entirely on aws services... iam is interwoven throughout :] rather than get frustrated, I just chalk it up to 2020, will go for a walk.. oh wait there's apocalyptic smoke outside!

leesalminen · on Sept 16, 2020

https://sqwok.im/ just loaded for me.

holler · on Sept 16, 2020

sweeeet, thanks for the heads up

sebmellen · on Sept 17, 2020

Loads fine for me an hour later. Nice site! I love the design and idea behind it.

holler · on Sept 17, 2020

hey thank you, I appreciate the feedback! I've been working hard to bring it to life and had just pushed out a new @mentions feature right when aws went down... feel free to "@guac" if you feel like trying it out and I'll hop into the conversation

derision · on Sept 16, 2020

[flagged]

dang · on Sept 17, 2020

Please don't take HN threads into regional flamewar.

https://news.ycombinator.com/newsguidelines.html

holler · on Sept 16, 2020

do you mean like multi-region? I'd like to get there, one of the aws services is only available in us-east-1 right now (iot related).

timmattison · on Sept 17, 2020

Which service? I'd also like to hear more about your multi-region journey (I work at AWS on the IoT services). Any way we can connect directly? You can message me on Twitter (@timmattison) if that works.

holler · on Sept 18, 2020

Hey Tim, thanks for reaching out! created a post sqwok.im/p/QqyVbKne4IiIdQ to connect at.

derision · on Sept 16, 2020

I was referring to

> will go for a walk.. oh wait there's apocalyptic smoke outside!

sebmellen · on Sept 17, 2020

I can always rely on HN for finding out why AWS is broken... Is it time to switch to GCP?

renewiltord · on Sept 17, 2020

No, sometimes you GCP failures posted here too. Your best bet is to move to Oracle Cloud. I have never seen an Oracle Cloud (not Oracle Data Cloud, which is the ad-tech product) outage posted here so I think they must have 100% uptime.

MattSayar · on Sept 17, 2020

Sarcasm detected! There's always going to be pros and cons to every provider, so the obvious choice is host everything on my home theater PC since I never turn that thing off.

dijit · on Sept 17, 2020

Please don't consider this a counterpoint, but I have had single physically deployed servers that have significantly higher average uptime than my ec2 instances in AWS us-east-1 in the last 5 years.

(significantly higher == 100% availability, it hasn't gone down... yet)

mcqueenjordan · on Sept 17, 2020

(Not considering as a counterpoint, just want to point this out for whomever.)

When (not if) it does go down, however long it takes to replace is downtime that has to be amortized over the entire X years, to get an accurate picture. If it takes > 260 minutes to detect, replace, and fully recover, then you've already blown the four nines budget you "saved up" for the past five years, when amortizing.

Side-note from a recent twitter thread: If you're not rigorously monitoring availability (preferably with canaries/probes), it's probably much lower than you think. (Using the royal 'you' here, not pointing any fingers.)

renewiltord · on Sept 17, 2020

I actually had a server with uptime in a decade+. Then my provider (Hetzner) decided they wanted to get rid of the hardware. Gave me fair warning and dumped it. I backed everything up and only restored my DNS and MX records from BIND (because I wanted to keep getting mail). Not the database, not the blog theme, nothing. One of these days I gotta do it.

Now I host everything on managed services. It was the right decision then, I was broke. Now, though, the cost of having to deal with it is higher.

TheCoelacanth · on Sept 17, 2020

That's not at all surprising because shutting down and starting new ec2 instances is expected to be a relatively common occurrence. High availability comes from a load balancer routing traffic to an instance that is currently available, not from having a single instance that just runs forever.

viraptor · on Sept 17, 2020

https://en.wikipedia.org/wiki/Survivorship_bias

theli0nheart · on Sept 17, 2020

This doesn't apply. Having a server fail on you doesn't preclude you from posting to HN.

viraptor · on Sept 17, 2020

But it becomes uninteresting, which causes self-censorship. How many times did you hear about someone's server running for 12 minutes? How many times about server running for over a year? Nobody posts about the first one. (Unless it was a spectacular failure for some reason)

Similar issue comes with estimating how reliable things are. People are more likely to respond "I had an issue with X too, here's my story" rather than "all good, nothing to report".

sebmellen · on Sept 17, 2020

That or IBM... A friend from high school also just set up what he calls "micro-cloud in the cloud". Haven't heard of any downtime yet, maybe that's a better bet...

65a · on Sept 17, 2020

You've never seen my basement's outages posted here either.

snazz · on Sept 17, 2020

How often the "____ is down" threads show up on HN is a function of the popularity and mission-critical-ness of the service. I wouldn't use it as a metric for choosing a cloud provider unless you control for popularity.

gundmc · on Sept 17, 2020

Even "controlling for popularity", frequency of HN posts is a terrible metric. If you're in the position to be making a significant investment in a cloud provider, you can probably afford the actual market research on reliability and uptime from reputable sources.

dilyevsky · on Sept 17, 2020

If aws is too unstable for your needs you’re in for a major adventure on gcp

sebmellen · on Sept 17, 2020

I said this a bit facetiously after working with an organization that had massive troubles with GCP downtime early on, and a friend who works on their infra team and gives me reports of how often things go wrong. I guess it didn't translate ¯\_(ツ)_/¯.

benlivengood · on Sept 17, 2020

Why not both?

bfieidhbrjr · on Sept 17, 2020

Azure. The web panel is a bazillion times better and they know AD... pretty well.

dijit · on Sept 17, 2020

I think this is sarcasm.

But genuinely AzureAD is really good. Authentication and Authorization systems are difficult, AD has always been begrudgingly the most well rounded (yes, it has warts) but AzureAD really is nice.

I especially like that my local admin can delegate Enterprise Apps to me so I can create SAML/OAUTH2 SSO links between stuff we use without needing the keys to the kingdom.

I recently set up Enterprise Federation with GCP and it took less than a day. (compared with many months in my last company which used on-prem AD)

sargun · on Sept 17, 2020

Does azure AD actually do active directory yet (LDAP and all)? Or is it still just a name?

dijit · on Sept 17, 2020

All of our windows desktop machines use it as their login controller, but I’m not 100% sure of the implementation- it is likely there needs to be a local login server because from what I remember of Windows it’s using WINS to find a logon server, and that I think must be local.

easton · on Sept 17, 2020

It’s actually a new-ish “credential provider” system for logging into Windows. It’s not just Azure AD either, Gsuite can be used for logging into and managing Windows now. There is no longer a need for a local domain controller either.

https://docs.microsoft.com/en-us/windows/win32/secauthn/cred...

sk5t · on Sept 17, 2020

Oof, WINS is some super antique stuff. The (traditional) process to find and select a domain controller is known as "DC Locator"--or else search for DsGetDc--and is essentially DNS-based.

Caveat, I've been out of the Windows/AD game for 7+ years!

MrMorden · on Sept 17, 2020

For systems expecting LDAP and Kerberos you can use either AAD Domain Services (a read-only DCaaS) or an on-premises AD controller.

animationwill · on Sept 17, 2020

It’s definitely not AD and isn’t compatible with many things that you’d expect. They just repurposed the name

gscho · on Sept 16, 2020

Let's all move to serverless!

ryanmarsh · on Sept 16, 2020

Legit not sure if you're being sarcastic or not.

loopdoend · on Sept 17, 2020

Maybe your sarcasm detector is broken.

ryanmarsh · on Sept 19, 2020

I'm not a robot, I'm a person, please speak to me as anyone you'd meet in person.

tus88 · on Sept 16, 2020

It sirtainly is. And I came in this morning specifically to creating some Lambda roles to test. Fark.

panny · on Sept 17, 2020

I noticed (pre outage) IAM console won't work at all if I --disable-reading-from-canvas in my launch args to prevent fingerprinting. All the other service consoles I use work. I have to have a special config for my browser just for AWS because of it. Wishful thinking, but maybe they're fixing that just for me.