Hacker News new | past | comments | ask | show | jobs | submit login
AWS IAM is having issues again (twitter.com/ryangartin)
226 points by rootforce on Sept 16, 2020 | hide | past | favorite | 118 comments



Wherever possible, don't use us-east-1. It's one of the older regions and parts are aging. Yes, I know there are things that are only available in the old regions but most services are globally available. I've worked with a few ex AWS SWEs and SREs. They drink the kool-aid and won't say anything bad about us-east-1 but they also won't launch net-new services there. YMMV


This is good advice, but does not help mitigate IAM issues. IAM is a global service.


IAM is mastered in us-east-1 with all writes going there.


I'd much rather use us-west-2 but Lambda@EDGE is only available in us-east-1 :(


The Lambda functions have to be configured in us-east-1, but they don't actually run there - they get replicated to the Cloudfront POPs you're using.

e.g. we set up our L@E functions in us-east-1, run our actual systems in eu-west-1, but 95%+ of our traffic comes in via eu-west-2, and the L@E functions output logs for that traffic to the eu-west-2 CloudWatch.

For Cloudfront stuff, it's not very region dependent.


Yes, but if you have Lambda@Edge it as part of a larger cloudformation stack you are stuck in us-east-1 or using separate stacks.


> things that are only available in the old regions

And usually it's old things, like really really old if something is available in us-east-1 and not us-east-2. I would imagine some ancient ec2 instance types.


If that's they case, maybe AWS should raise the pricing on us-east-1 to fund improvements and/or encourage people to either move off to newer data centers.


They're one of the wealthiest companies in the world and you suggest they should raise prices to raise money to improve one of their older money printing locations? What planet are you from?

They can rebuild that whole datacenter a hundred times over and still not feel it in their wallet.


> They can rebuild that whole datacenter a hundred times over and still not feel it in their wallet.

Agreed.


It's about aligning incentives. If the user cost is the same, but the maintenance costs are higher, then something is broken, or you're eating the cost as a feature.


That’s interesting. As us-east-1 has most of the services, we use them primarily along with us-west-2.

Are you saying us-east-2 would be better than us-east-1?


Generally yes, since as they gained some experience and designed other data centers better. They also generally stay replying changes with us-east-1.

Although IAM service is global one and it is located in us-east-1, so you wouldn't be able to avoid this particular issue.


I know a couple of ex-AWS folks who sort of say the opposite. us-east-1 is supposedly less stable because they roll out updates and new services there first.


That's kind of silly :-)

I used to work for another pretty big company and we knew to roll out changes to the smaller DCs first.


If you look at it the other way. If change works on us-east-1, it should work everywhere else.

If their goal is to reduce the number of affected locations (and not care about number of people) then deploying to us-east-1 makes more sense.


Yeah, but at that scale, they could pick region #5 in terms of traffic and probably cover 5 nines worth of edge cases :-)


So for east coast you would say go Ohio?


It’s funny how on Twitter this currently has 3 retweets and 17 likes but on HN it has 110 points!


I started seeing this while running Terraform and immediately went to Twitter to see if it is down for everyone or just me. Same experience. From now on I will go to HN :)


https://hckrnews.com is an alt-reader that sorts the front page chronologically which helps.


Maybe because the audience on Twitter and Hackernews is different?


I guess I'm going outside today.


What a perfect time to onboard some new employees.

Blah.


I modified some of our IAM policies earlier this afternoon, followed by the pages that some of our teams were having IAM issues, caused me great discomfort


Congratulations, this time it wasn't you!

Maybe, possibly, hopefully...


This probably needs a better link, but the AWS status page shows everything up.

UPDATE: Status page now shows it https://status.aws.amazon.com/#


https://stop.lying.cloud if you (like me) keep getting the order of the aws, amazon, and status confused.


Is this just a gimmick run by AWS and the "honest" in the logo is just a play on the host name, or is it some third-party version that adds information that AWS isn't reporting?


It's mine. It's an old Chrome extension shoved into a Lambda@Edge function that dynamically transforms the actual status page to cut out a lot of the "sea of green bubbles." I should look into updating it; it used to automatically upgrade the severity one level as well but AWS changed something.

https://gaslighting.me is the non-editorialized version.


Do you have a writeup of “porting a chrome extension to lambda” somewhere?


Third party obviously. Check whois.


The page footer says "© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved."


Does it add/edit any information, or is it just a proxy?


probably just cname.


The AWS status page misses lots of "bump in the night" outages.


Those statuses are updated by people (after bureaucracy and only with very high level approval), not by anything based in reality


That is super messed up. That is not how a status board should work.


Blame sla language and marketing


In addition to that it is also cached by CDN, so it takes a while to show up when there is an outage.


Amazon is sadly not the only company that works this way.


What’s everyone’s back up plans? Just a ‘We’ll be right back’ page?


Switching to that sort of page is probably the most cost-effective solution for small businesses. I've heard of larger companies running a completely redundant hot standby on another independent cloud platform and switching DNS over to the standby when something goes wrong. With auto-scaling, you're not paying to have the standby running at full throttle. Of course, you have to exclusively use services that have equivalents on the other cloud provider.


> What’s everyone’s back up plans? Just a ‘We’ll be right back’ page?

I've done a lot of cloud site HA work on large sites.

Waiting out the cloud outage so far ends up being the best solution for almost all companies, from both engineering and business standpoints. Eat the outage, but continue with a known-working site afterwards. You just blame the cloud provider for the downtime.

> I've heard of larger companies running a completely redundant hot standby on another independent cloud platform and switching DNS over to the standby when something goes wrong.

In theory, that makes sense. It practise, it almost never works.

If by "independent cloud platform" you mean another AZ or region in the same cloud, that is often attempted and can work reasonably well. If you mean failover from AWS to GCP, then that's unlikely, since everything is different.

An example is whenever DynDNS goes down for 2-3 hours, and everybody builds out flaky failover tools that are less reliable than their original DNS provider - and have to be maintained forever. Might work with one or two domains, becomes a huge ongoing problem with dozens of them. Also, DNS mgmt. APIs are flaky in several dimensions (availability, versioning, parameters, etc.)

Another is that you can't failover to another location that doesn't have all your data, current certificates, monitoring, etc., and the failover site needs the capacity of the original site to work. That costs ongoing money and time, and you never know how well the failover will work or how it will perform.

An example is Heartland, one of the biggest US payment providers, who failed over to another location and took a 5 day outage. Or gitlab, who took a one day outage because of database isues.

I have (automatically) failed over a large site for a publicly-traded company from one AWS region to another, but that took a year of work to setup, and I understood almost all aspects of the site. Afterwards, I realized that almost nobody really has time to organize that either at a conceptual or engineering level. And organizations don't recognize Herculean efforts like that, so think twice beforehand.

Key point: always involve your DBA from the beginning when doing a project like this.


Did this link get changed to a twitter post from something else?


Yes, it was a link to nitter.net(an alternative twitter front end) due to HN guidelines for posting links to original sources.


Noticed something odd today I think is connected to this.

The other day we started using Access Advisor, and we found some of our KMS key policies with a Principal of '*'.

It wasn't marked as globally open, so we planned to fix them a little later.

This morning we found that status had changed.

While we were in the wrong to begin with, it was a little surprising to find the interpretation of the key policy changing overnight.

Of course it became our top priority and is now fixed. Something to look out for...


I just feel bad for the oncalls...


Looks to be affecting all regions - at least within the standard aws partition. Not sure about aws-cn and aws-us-gov


Not seeing any issues in aws-us-gov.


TIL of aws-us-gov


usually referred to as gov-cloud.

Has it's own version of everything, physically segmented, even for global systems such as IAM.

You have to be a US Citizen to work on it and you need special security clearances.


You might be getting GovCloud and AWS Secret Region confused. I've been told that I can access GovCloud if I cross the border from Canada, despite not being a US Citizen. (This came up in the context of providing FreeBSD AMIs.)


I believe the secret region is just part of GovCloud. Afaik, GovCloud is just the marketing name for aws-us-gov


GovCloud is a separate partition from Secret. Different regulatory framework alignment and customer onboarding.

GovCloud customers only need be a US person or entity, beyond that any further regulatory alignment is up to the customer. AWS does not audit the IAM user base for nationality or any compliance requirements.

Disclaimer: I am an AWS Public Sector Solutions Architect.


Ah it looks to be aws-iso (c2s.ic.gov, top secret) and aws-iso-b (sc2s.sgov.gov, secret)?


So, just to clarify: if a "customer" who happens to employ tens of thousands of people also employs non-US citizens (such as the DoD's foreign national IT contractors), then non-US citizens would have access to GovCloud.


Secret region is not a part of govcloud.


IAM is a global service. So it kind of makes sense that all regions would be affected by an outage.


Well yes but it's entirely possible for it to be locally affected


IAM is seems to be run out of North Virginia (other than Gov Cloud)[0]. However they probably may be doing some magic under the covers to reduce latency across the globe. It would also explain why GovCloud isnt affected.

[0] - https://aws.amazon.com/about-aws/global-infrastructure/regio...


Yeah certainly the existence of credentials etc must be replicated to deal with cross-region connectivity issues etc. - which I guess explains why authentication is still working fine

GovCloud I believe is technically separated in this regard, I don't think IAM credentials can operate across partitions.

Though it does seem to be possible to link the existence of a standard AWS account with GovCloud, the same is not true of the China partition


Gov Cloud is an entirely separate setup, you can't share anything across the two.


IAM is a global service.


Off-topic: I hadn't heard of nitter.net before, it seems pretty cool.


Sorry to disappoint, but we've changed the URL from https://nitter.net/RyanGartin/status/1306352941964701696#m to the original source, as the site guidelines ask: https://news.ycombinator.com/newsguidelines.html.


Yep, sorry about that. Thanks for the update.


I like it a lot better than the current twitter experience, and it's open source.


It makes me sad to see these reminders of how fast websites are perfectly capable of being.


It's self-hostable, too


my site is down on all environments (us-east-1). oy vey


Is your site dependent on IAM?


Pretty hard to not be dependent on IAM. Authentication and authorization are some of the most core concepts you can have


Amazon says that "The issue continues to affect create, describe, modify or delete of IAM accounts and roles. [...] Authentication using IAM accounts and roles are not affected."

So it's entirely possible to depend on IAM but not in a way which this is breaking.


This makes sense, observing services that rely on IAM/STS very, very routinely--but without changing IAM properties--and no alerts popped up during this outage.


For sure, that’s just not what the parent comment asked


I thought IAM authentication was working, but the create / remove / etc steps were down. This is a HUGE impact if all IAM role access is down globally for AWS.

Please confirm and I will work to spread the news - can I cite you as the source?


https://sqwok.im, it's entirely on aws services... iam is interwoven throughout :] rather than get frustrated, I just chalk it up to 2020, will go for a walk.. oh wait there's apocalyptic smoke outside!


https://sqwok.im/ just loaded for me.


sweeeet, thanks for the heads up


Loads fine for me an hour later. Nice site! I love the design and idea behind it.


hey thank you, I appreciate the feedback! I've been working hard to bring it to life and had just pushed out a new @mentions feature right when aws went down... feel free to "@guac" if you feel like trying it out and I'll hop into the conversation


[flagged]


Please don't take HN threads into regional flamewar.

https://news.ycombinator.com/newsguidelines.html


do you mean like multi-region? I'd like to get there, one of the aws services is only available in us-east-1 right now (iot related).


Which service? I'd also like to hear more about your multi-region journey (I work at AWS on the IoT services). Any way we can connect directly? You can message me on Twitter (@timmattison) if that works.


Hey Tim, thanks for reaching out! created a post sqwok.im/p/QqyVbKne4IiIdQ to connect at.


I was referring to

> will go for a walk.. oh wait there's apocalyptic smoke outside!


I can always rely on HN for finding out why AWS is broken... Is it time to switch to GCP?


No, sometimes you GCP failures posted here too. Your best bet is to move to Oracle Cloud. I have never seen an Oracle Cloud (not Oracle Data Cloud, which is the ad-tech product) outage posted here so I think they must have 100% uptime.


Sarcasm detected! There's always going to be pros and cons to every provider, so the obvious choice is host everything on my home theater PC since I never turn that thing off.


Please don't consider this a counterpoint, but I have had single physically deployed servers that have significantly higher average uptime than my ec2 instances in AWS us-east-1 in the last 5 years.

(significantly higher == 100% availability, it hasn't gone down... yet)


(Not considering as a counterpoint, just want to point this out for whomever.)

When (not if) it does go down, however long it takes to replace is downtime that has to be amortized over the entire X years, to get an accurate picture. If it takes > 260 minutes to detect, replace, and fully recover, then you've already blown the four nines budget you "saved up" for the past five years, when amortizing.

Side-note from a recent twitter thread: If you're not rigorously monitoring availability (preferably with canaries/probes), it's probably much lower than you think. (Using the royal 'you' here, not pointing any fingers.)


I actually had a server with uptime in a decade+. Then my provider (Hetzner) decided they wanted to get rid of the hardware. Gave me fair warning and dumped it. I backed everything up and only restored my DNS and MX records from BIND (because I wanted to keep getting mail). Not the database, not the blog theme, nothing. One of these days I gotta do it.

Now I host everything on managed services. It was the right decision then, I was broke. Now, though, the cost of having to deal with it is higher.


That's not at all surprising because shutting down and starting new ec2 instances is expected to be a relatively common occurrence. High availability comes from a load balancer routing traffic to an instance that is currently available, not from having a single instance that just runs forever.



This doesn't apply. Having a server fail on you doesn't preclude you from posting to HN.


But it becomes uninteresting, which causes self-censorship. How many times did you hear about someone's server running for 12 minutes? How many times about server running for over a year? Nobody posts about the first one. (Unless it was a spectacular failure for some reason)

Similar issue comes with estimating how reliable things are. People are more likely to respond "I had an issue with X too, here's my story" rather than "all good, nothing to report".


That or IBM... A friend from high school also just set up what he calls "micro-cloud in the cloud". Haven't heard of any downtime yet, maybe that's a better bet...


You've never seen my basement's outages posted here either.


How often the "____ is down" threads show up on HN is a function of the popularity and mission-critical-ness of the service. I wouldn't use it as a metric for choosing a cloud provider unless you control for popularity.


Even "controlling for popularity", frequency of HN posts is a terrible metric. If you're in the position to be making a significant investment in a cloud provider, you can probably afford the actual market research on reliability and uptime from reputable sources.


If aws is too unstable for your needs you’re in for a major adventure on gcp


I said this a bit facetiously after working with an organization that had massive troubles with GCP downtime early on, and a friend who works on their infra team and gives me reports of how often things go wrong. I guess it didn't translate ¯\_(ツ)_/¯.


Why not both?


Azure. The web panel is a bazillion times better and they know AD... pretty well.


I think this is sarcasm.

But genuinely AzureAD is really good. Authentication and Authorization systems are difficult, AD has always been begrudgingly the most well rounded (yes, it has warts) but AzureAD really is nice.

I especially like that my local admin can delegate Enterprise Apps to me so I can create SAML/OAUTH2 SSO links between stuff we use without needing the keys to the kingdom.

I recently set up Enterprise Federation with GCP and it took less than a day. (compared with many months in my last company which used on-prem AD)


Does azure AD actually do active directory yet (LDAP and all)? Or is it still just a name?


All of our windows desktop machines use it as their login controller, but I’m not 100% sure of the implementation- it is likely there needs to be a local login server because from what I remember of Windows it’s using WINS to find a logon server, and that I think must be local.


It’s actually a new-ish “credential provider” system for logging into Windows. It’s not just Azure AD either, Gsuite can be used for logging into and managing Windows now. There is no longer a need for a local domain controller either.

https://docs.microsoft.com/en-us/windows/win32/secauthn/cred...


Oof, WINS is some super antique stuff. The (traditional) process to find and select a domain controller is known as "DC Locator"--or else search for DsGetDc--and is essentially DNS-based.

Caveat, I've been out of the Windows/AD game for 7+ years!


For systems expecting LDAP and Kerberos you can use either AAD Domain Services (a read-only DCaaS) or an on-premises AD controller.


It’s definitely not AD and isn’t compatible with many things that you’d expect. They just repurposed the name


Let's all move to serverless!


Legit not sure if you're being sarcastic or not.


Maybe your sarcasm detector is broken.


I'm not a robot, I'm a person, please speak to me as anyone you'd meet in person.


It sirtainly is. And I came in this morning specifically to creating some Lambda roles to test. Fark.


I noticed (pre outage) IAM console won't work at all if I --disable-reading-from-canvas in my launch args to prevent fingerprinting. All the other service consoles I use work. I have to have a special config for my browser just for AWS because of it. Wishful thinking, but maybe they're fixing that just for me.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: