Cloudflare Network Performance Issues

buildbuildbuild · on July 2, 2019

Cloudflare: Your status page showed "all systems operational" for over 20 minutes while your primary domain was returning a 502 error. Please change this to update automatically, many other engineering teams depend on you.

https://i.imgur.com/qHBM2JW.png

matt_oriordan · on July 2, 2019

Sadly this reminds me of AWS outages too where the same applies. How is it that hundreds of developers know there's an issue before AWS do, or Cloudflare in this instance. See my blog post on similar AWS uptime reporting issues at https://www.ably.io/blog/honest-status-reporting-aws-service.

At Ably, our status site had an incident update about Cloudflare issues being worked on (by routing away from CF) before Cloudflare did: https://status.ably.io/incidents/647

We have machine generated incidents created automatically when error rates increase beyond a certain point stating "Our automated systems have detected a fault, we've been alerted and looking at it". See https://status.ably.io/incidents/569 for example. I think much larger companies like Cloudflare and Amazon could certainly invest a bit in similar systems to make it easier for their customers to know where the problem likely lies.

samstave · on July 2, 2019

Heh, I am reminded of when the control plane at AWS went down... and we had a custom autoscaling config that would query for the number of instances running and scale appropriately... but when the AWS API died... we kept getting zero running instances...

So our system thought none were running and so it kept launching instances....

These were SPOT instances and thus only cost like .10 per hour...

But we launched like 2500 instances which all needed to slurp down their DB and config - so it overloaded all other control plane systems...

We had to reboot the entire system. Which took forever.

The only good things was this happened at 11am - so all team members were online and avail... and then AWS refunded all costs.

---

The other fun time was when a newbie dev checked in AWS creds to git - but he created the 201th repo (we had only paid for 200) -- and as it was the next repo which wasnt paid for, it was by default public - thus slurped up by bots asap - which then used the AWS creds to launch bitcoin mining bots in every single region around the globe. Like 1700 instances.

The thing that sucked about that was it happened at like 3am and we had to rally on that one pretty fast. AWS still refunded all costs...

therein · on July 2, 2019

> but he created the 201th repo (we had only paid for 200)

That's an odd choice of a failure mode.

> AWS still refunded all costs...

Yeah they should. It was their silly design choice that lead to disclosure of secrets after all.

What kind of failure mode is that even. Failing to create the repo would have led to a better user experience for sure.

Can you imagine if S3 charged more for private objects and once you reach your count, it just makes them public and posts them on Reddit?

samstave · on July 2, 2019

>Can you imagine if S3 charged more for private objects and once you reach your count, it just makes them public and posts them on Reddit?

Exactly! What a stupid design UX.

joncrane · on July 3, 2019

>It was their silly design choice that lead to disclosure of secrets after all.

Wait, was the 200 private repos issue an AWS thing or a GitHub/GitLab/whatever thing?

What AWS product has a concept of private/public repos and limits on how many of the former you can get for a certain price?

samstave · on July 3, 2019

It was a git thing.

Never post aws secrets to git.

joncrane · on July 2, 2019

>How is it that hundreds of developers know there's an issue before AWS do

Trust me, they know.

They know about problems we never find out about, too.

RafiqM · on July 2, 2019

At least in the case with AWS, unfortunately there's business involved - because of their uptime guarantee, incidents that would be called downtime by a purely technical team are left as "operational" or "partly degraded". Otherwise, they might have to shell out millions or tens of millions.

Maxious · on July 2, 2019

You have to provide "your request logs that document the errors and corroborate your claimed outage" for the AWS Compute SLA https://aws.amazon.com/compute/sla/

MrGilbert · on July 2, 2019

slightly off-topic: Nice and clean, yet detailed enough status page. I like it. :)

matt_oriordan · on July 2, 2019

Thanks :)

felideon · on July 2, 2019

On the OT subject, the status logo is blurry (at least on a 4k monitor), whereas on the main website it's nice and crisp.

matt_oriordan · on July 2, 2019

Ok, thanks for the heads up. Will raise a status website issue.

Heliosmaster · on July 2, 2019

Are you sure it wasn't cached? For us it was showing correctly the incident...

btown · on July 2, 2019

If cloudflarestatus.com isn't serving text content as Cache-Control: no-cache, that's itself an egregious bug.

grenoire · on July 2, 2019

It serves Cache-Control: max-age=0, private, must-revalidate. Same thing as far as I get it, maybe wider support.

buildbuildbuild · on July 2, 2019

Their status page is hosted on Cloudfront, perhaps adding an unintended caching layer.

auscompgeek · on July 2, 2019

Is it? From my end it looks like a plain ec2 instance from statuspage.io.

jmngomes · on July 2, 2019

Cloudfront supports Cache-Control: no-cache

buildbuildbuild · on July 2, 2019

Cloudfront itself can cache the content regardless. (if its headers are showing HIT)

acdha · on July 2, 2019

It might have been cached on the server but was not cached on my clients: I did a forced refresh and used a private browser window but it only showed the “almost everything is good” view throughout the duration of the incident.

The actual incident page was correct.

wepower_ico · on July 2, 2019

Can confirm that we had the same view. Operational while not working.

grey-area · on July 2, 2019

All I saw was a vague message at the top, all services were marked green and operational throughout.

kreetx · on July 2, 2019

They should derive whether something is down from how much traffic the status page gets, and have alerts tied to that as well. I think it would be pretty accurate.

nullify88 · on July 2, 2019

We saw an update within 5 minutes of our sites going down. Are you doing caching somewhere?

buildbuildbuild · on July 2, 2019

Interesting, I was command+shift+R refreshing and also tried from a VPN in another region. Perhaps our CF-hosted sites returned 502s in my region sooner than yours, causing me to check the status page sooner.

seiferteric · on July 2, 2019

Seems unlikely, I don't use cloudflare have never visited the status page before. I noticied several 502 pages this morning and searched for "cloudflare status page" and saw the "all systems operational".

takeda · on July 2, 2019

It isn't in their best interest to have a reliable status page. They want status page to contain information about failure only after people know about issues through other means. For the same reason cloud providers don't provide rebates on outages unless you ask for it explicitly.

I think the only way is to have status page operated by an independent 3rd party, but I don't think there's a viable business model for someone to provide such service. Perhaps there might be even a risk of lawsuits against you.

eeeeeeeeeeeee · on July 2, 2019

It was showing correctly for me the first time I checked it.

gsich · on July 2, 2019

Why even have a status page if it doesn't reflect reality?

profmonocle · on July 2, 2019

Once cloudflare.com came back I decided to check out their business SLA, and it's not very encouraging:

> For any and each Outage Period during a monthly billing period the Company will provide as a Service Credit an amount calculated as follows: Service Credit = (Outage Period minutes * Affected Customer Ratio) ÷ Scheduled Availability minutes

- https://www.cloudflare.com/business-sla/

So assuming an outage affects 100% of your users (this one seems like it did, but that's not clear), they only refund the time the service was offline? According to pingdom this outage lasted ~25 minutes, so that's 25/(31 * 24 * 60) = .056% of our bill, roughly 11 cents.

It sounds like you just don't pay for the time the service didn't work, which isn't much of a guarantee, that's just expected (of course you shouldn't pay for services not provided). Most SLAs for critical services have something like under 99.99% uptime you get 10% of your bill back, under 99.5% you get 20% back, under 99% you get 50% back. (*Numbers completely made up to demonstrate the concept.)

Am I misreading this? Morning coffee hasn't kicked in yet so maybe I am.

btown · on July 2, 2019

This doesn't surprise me at all - SLA's are widely overrated. No SLA will cover damages incurred by lost business due to an outage. What you likely want is some kind of third-party insurance for downtime caused by outages out of your control - but I'm not even sure this exists.

kortilla · on July 2, 2019

I guess you’ve never worked in enterprise. SLAs for critical systems very frequently incur payback in excess of the billing on outages.

dboreham · on July 2, 2019

But not consequential losses which is what the parent mentions.

profmonocle · on July 2, 2019

I'm definitely not suggesting CF should cover losses. Sorry if I gave that impression. That would effectively require them to be an insurance company since they'd have to investigate claims, and possibly charge customers differently based on risk. (i.e. you don't want to bill a customer $200 per month if 10 minutes of downtime could lose $20 million in sales.)

I mean something like Amazon EC2's SLA (https://aws.amazon.com/compute/sla/) where credits are proportional to downtime, but not 1:1. i.e. they credit 100% for >= 5% downtime. With Cloudflare's SLA, 5% downtime (1.5 days in a month) would only give you a 5% credit.

phonon · on July 2, 2019

There are different types of contingent business interruption insurance available. If you're large enough (like Fortune 500), you can negotiate the terms of your policy.

mathattack · on July 2, 2019

This does exist. It’s just a matter of who pays for it.

mwj · on July 2, 2019

Also the standard SLA you get will be wildly different from the bespoke contacts negotiated by enterprises. Just depends on your spend.

mwj · on July 2, 2019

This type of insurance does exist. Speak to your broker.

zymhan · on July 2, 2019

They reserve the best SLA for Enterprise, naturally.

https://www.cloudflare.com/plans/enterprise/

> 100% uptime and 25x Enterprise SLA

> In the rare event of downtime, Enterprise customers receive a 25x credit against the monthly fee, in proportion to the respective disruption and affected customer ratio.

adamparsons · on July 2, 2019

Two years missing revenue for enterprise customers, ouch that's going to hurt.

markonen · on July 2, 2019

25x the downtime, so about 10 hours' worth of charges if the downtime was 25 minutes.

monkeywork · on July 2, 2019

you missed the:

>in proportion to the respective disruption and affected customer ratio.

so no they aren't losing 2 yrs of revenue

lukevp · on July 2, 2019

That's the business SLA. The Enterprise one is 100% uptime with a 25x payback so those are the ones they are focusing on keeping up. We were with Verizon before CloudFlare and their SLA was a similar pay back for outage. I think this is pretty typical, what service did you have that had a different setup for the SLA?

profmonocle · on July 2, 2019

Amazon EC2 has an SLA like I mentioned: https://aws.amazon.com/compute/sla/

tedd4u · on July 2, 2019

AWS CloudFront SLA refund/credit policy appears to work as you describe: https://aws.amazon.com/cloudfront/sla/

Monthly Uptime:

   * 99.0% <= uptime < 99.9%:  10% service credit
   * 95.0% <= uptime < 99.0%:  25% service credit
   *  0.0% <= uptime < 95.0%: 100% service credit

But CloudFlare's Enterprise SLA (25x credit) is similar or maybe even a little bit better (because you get to 100% at 96% instead of 95%). Of course when you are doing an Enterprise deal you can negotiate for whatever terms are mutually acceptable as long as you're willing to pay.

In any case, the function of the credit policy is to ensure there is enough pain for the provider to put in place the quality / reliability practices, process and code to protect themselves from losses. IMO most sustainable business pull in much more revenue per hour than 25x their CDN cost.

It would be interesting to know how CloudFlare's infra and processes differentiate free, Business and Enterprise customers.

jeroen · on July 2, 2019

25/(31 * 24 * 60) == 0.056%, not 5.6%.

profmonocle · on July 2, 2019

Fixed, thanks. Like I said about the coffee. :)

fishpen0 · on July 2, 2019

That may be the public SLA for people who create an account without private contracts. Large businesses never use the standard SLA terms.

kevingadd · on July 2, 2019

I left Cloudflare for AWS a long time ago despite CF's affordability since they didn't seem to care that much about uptime or quality. Their frontend was corrupting response bodies + caching response bodies (in retrospect, this was probably pre-discovery cloudbleed) and there was no way to get a response or help with it.

kayoone · on July 2, 2019

not sure what cloudflare costs but compared to the amount our business lost in those 30min, it probably does not matter much.

grey-area · on July 2, 2019

Throughout this outage, https://www.cloudflarestatus.com/ continued to show green - all services operational, with almost all services marked 'operational' and some vague cryptic message about users in this region being affected:

Investigating - Cloudflare is observing network performance issues. Customers may be experiencing 502 errors while accessing sites on Cloudflare. We are working to mitigate impact to Internet users in this region.

It seemed very much like a global outage affecting all services. Is this status page not automatically updated with service status, or is it just manually updated by humans? Even if manually updated, surely when posting that status message, the status of all the services should be set to degraded?

NikolaeVarius · on July 2, 2019

This is not my experience, I have received many updates both through email and updates to the cloudflare status page throughout the incident, except for possibly the first 10 minutes

grey-area · on July 2, 2019

You're probably talking about the yellow note at the top of the page (which is still there, with a little more detail now). That was updated and is fine.

I'm talking about the service indicators for each service lower down the page, which remained green throughout and appear to be just decoration, not an actual indication of service status, they all said operational throughout the incident (I reloaded a few times).

In particular I'm thinking of the Cloudflare Sites and Services section.

dessant · on July 2, 2019

A great example for why you shouldn't transfer your domain to Cloudflare Registrar if you're also using their CDN. Those who have transferred their domains cannot change DNS servers to mitigate the outage.

nullify88 · on July 2, 2019

You can setup your domains using their CNAME method. You do not have to delegate your entire domain to them. https://support.cloudflare.com/hc/en-us/articles/36002061511...

Together with a short TTL we were able to recover without relying on their dashboards.

chaz6 · on July 2, 2019

Only if you are a paid customer. The free service does not allow this, and this is why I do not use Cloudflare for personal use.

dewey · on July 2, 2019

For personal use you also probably don't need the high availability of switching over your domain the moment they are having problems?

MrStonedOne · on July 2, 2019

One thing is not every free or pro plan on cloudflare is personal use.

I'm running the web servers, official wiki, and game external resource portal for the most active open source video game on github, through cloudflare, and maybe we might not want our 60 million requests a month website to go down when cloudflare does.

Because I can tell you right now our 300 a month budget (that mind you, is capable of covering 7 game servers that can handle 100 connected players (each)) can't take the 80 dollar hit just to make cloudflare not a single point of failure.

dewey · on July 2, 2019

If you don’t count this as a personal or hobby project it’s a community project. They don’t have a pricing plan for that so you either have to go pro or go to some other provider who gives you this much for free. Why should a company give you even more pro features for free if you are already getting a lot of things for free?

MrStonedOne · on July 2, 2019

Pro($20) does not give you the cname feature, business ($100) does.

techslave · on July 2, 2019

he’s not arguing that they should. he’s just pointing out the need for your own plan b.

matt_oriordan · on July 2, 2019

Ah, used to be an Enterprise only feature. We have access to that, but didn't realise it's now available to Business. Perhaps time for them to consider offering it more widely!

Scirra_Tom · on July 2, 2019

No good if your Registrar (Namecheap) is behind CF as well!

antsar · on July 2, 2019

That would seem like a solid reason to avoid Namecheap.

Edit: Wait, do they use Cloudflare?

  $ dig namecheap.com +short
  198.54.117.250

whois:

  CIDR:           198.54.112.0/20
  NetName:        NAMEC-4
  Organization:   Namecheap, Inc. (NAMEC-4)
  Updated:        2015-11-13

https://whois.arin.net/rest/net/NET-198-54-112-0-1.html

helb · on July 2, 2019

They don't seem to use CF's nameservers, but try the www. subdomain:

    $ kdig +short www.namecheap.com
    www.namecheap.com.cdn.cloudflare.net.
    104.16.99.56
    104.16.100.56

qiller · on July 2, 2019

I've got Cloudflare captcha just yesterday when logging in to Namecheap

toast0 · on July 2, 2019

Changing your NS records at the registry could help, but keep in mind most TLDs are serving NS records with 1-2 day TTLs, so you'll still see a lot of traffic going to the old server.

If this is something you want to be able to mitigate, you really need to be running a seperate DNS infra from your hosting/CDN and use short TTL cnames to delegate hostnames to the CDN. This becomes a big challenge if you host on an apex domain (eg example.org instead of www.example.org), so don't do that.

acdanger · on July 2, 2019

I use namecheap to manage my name servers. It's also down.

bovermyer · on July 2, 2019

Yeah, I'm regretting that very choice now.

matt_oriordan · on July 2, 2019

Problem is not just about using Cloudflare as your DNS registrar really. Even if you have a different registrar, the Cloudflare model is to have the NS (nameserver) records set up to point to Cloudflare, and then they in turn resolve the DNS. You cannot really use Cloudflare without that set up. Changes to nameservers at a registrar level are rarely quick, at least quick enough to mitigate a disaster like this. It's why we've used two completely different domains at Ably (ably.io and ably-realtime.com) for all services we provide.

We wrote about a strategy to circumvent this sort of thing a little while back https://www.ably.io/blog/routing-around-single-point-of-fail.... Given two incidents in a matter of weeks, I think a revisit of that article in light of most businesses who operate on a single domain would be useful :)

dsugarman · on July 2, 2019

we were able to recover by submitting an API call rather than their UX, it was slow but the API requests work

vegardx · on July 2, 2019

I'm guessing that they're going to be back up again before those changes can "propagate", given the relatively high TTLs most TLDs run with.

apple4ever · on July 2, 2019

Really good point. I debated about switching, and this exactly reason is why I didn't. Its one thing I didn't want to have in the same bucket.

TomAnthony · on July 2, 2019

They have now released an initial statement [1]:

For about 30 minutes today, visitors to Cloudflare sites received 502 errors caused by a massive spike in CPU utilization on our network. This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels.

This was not an attack (as some have speculated) and we are incredibly sorry that this incident occurred. Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.

[1] https://blog.cloudflare.com/cloudflare-outage/

technonerd · on July 2, 2019

Ahh the "we test in prod" method

jgrahamc · on July 2, 2019

Yeah, except that wasn't meant to be the way things work.

js2 · on July 2, 2019

> Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide.

Sounds like backtracking. If so, I'll bet there's a conversation happening about switching to re2.

edit: hmmm, https://github.com/cloudflare/lua-re2

I'm also curious why this rollout wasn't staged.

jgrahamc · on July 2, 2019

Sorry, I would have posted this myself but was too busy.

ilkkao · on July 2, 2019

Interesting to hear later if the process was followed. Hard to believe they deploy changes to every production location at once.

pmlnr · on July 2, 2019

Good.

I'm tired of people not learning that trusting a single gateway with 50% of the internet is bad.

Yes, I know, free DDOS protection. There has to be another way of doing this, some mesh based DDOS protection or so.

eeeeeeeeeeeee · on July 2, 2019

I agree that the centralization of the internet is troubling, but Cloudflare is solving a systemic problem that nobody else is tackling. The DDoS problem was not being solved for anyone except for enterprise customers until Cloudflare came along and to this day there is very little competition in this space. Your post makes it sound like making a "mesh based DDoS" system is somehow trivial. Who is going to pay for this? How does it work? How do you ensure latency is not atrocious? Why hasn't someone made this already? Cloudflare at least has a financial model that can be sustained, and it doesn't include harvesting all of our personal data.

Without CF, many websites would not stay on-line during an attack. And they would cease to exist because many of those places would never be able to afford DDoS protection. I know so many sites, including ones I run, that I would not be able to keep on the public internet without CF DDoS protection. There really is no real competition in this space.

I think we need to consider the fact that while this outage does take a lot of sites off-line at once, it is temporary, and it is still extremely rare. And the alternative is potentially that many websites would cease to exist at all, period, without something like Cloudflare existing.

pmlnr · on July 2, 2019

> Your post makes it sound like making a "mesh based DDoS" system is somehow trivial.

It does not. "There has to be" ends in "!" and is an expression is a wish.

> Who is going to pay for this?

All of us. Everyone. I keep circling back to the idea that we should all join a protection ring, so we'd all share "cost" on this.

> How do you ensure latency is not atrocious? > Why hasn't someone made this already?

First one is very technical, and the answer is most probably by localisation and by only actually turning it on when needed.

For the second: because it's a very hard technical problem and there is basically no money in it. Business value maybe, but it would need people and companies to collaborate, it would probably need committee level decisions, and so far, nobody wanted to deal with this.

Or at least that's my theory.

EDIT Maybe dat:// will eventually become a viable option, and with that, due to the distributed nature, this kind of DDOS protection is sort of built in.

lkbm · on July 2, 2019

This seems like a case for putting more of the internet through a single gateway. Having my downtime correlated with everyone else's means users will be more forgiving because they'll perceive it as "the Internet's down" rather than "lkbm's site is broken". (We saw this with CloudFlare. Some users were pissed, and others jumped in with "it's not their fault; AWS is down". That doesn't happen when our stuff specifically goes down.

We need to decentralize the Internet, but this occurrence is not the reason. It's an argument to keep consolidating.

_wmd · on July 2, 2019

Centralizing more of the Internet to help people align their excuses is probably the worst reasoning I've ever read

MrGilbert · on July 2, 2019

But maybe this is how "the collective" works?

gambler · on July 2, 2019

It's a horribly short-sighted, irresponsible and dangerous attitude. When your service goes down on its own its users can switch to some backup process temporarily. If half the internet goes down, they are screwed. How much they're screwed depends on how they're using your services at the moment, which you most likely can't even know.

xtracto · on July 2, 2019

Nobody Gets Fired For Buying IBM.

pmlnr · on July 2, 2019

> Nobody Gets Fired For Buying IBM...

... mainframes for business critical applications.

As for buying IBM Cloud services, it's a bit fuzzier.

hoechst · on July 2, 2019

How cares whose fault it is? In the end, you're not able to provide service to your users and you potentially lose money.

pmlnr · on July 2, 2019

Business risk cares. SLAs, SLOs, SLIs, they are all about this; to be able to direct the blame.

lkbm · on July 2, 2019

The argument being made above only holds water if Cloudflare has worse uptime than smaller providers, and that it's because they're big.

This is often the case when monopolies realize they're in a position where they can get away with sucking, but I don't believe that to be the case with Cloudflare yet.

manigandham · on July 2, 2019

That's a strange perspective. Users only care that a service works and there is a long history of them blaming whatever is most visible.

If you're site is down, it's your fault in their eyes, not some core infrastructure provider.

camgunz · on July 2, 2019

CYA buddy, CYA

(sarcasm!)

Waterluvian · on July 2, 2019

I get your point. But I'm not sure I can ever grok the mindset of someone who thinks, "good, that'll teach 'em". Not sure I'd ever want to work with that individual.

drunkpotato · on July 2, 2019

For the kind of coworker I’d like, a grumpy old cynic beats the peppy corporate cheerleaders any day of the week.

enraged_camel · on July 2, 2019

I feel like that's a false dichotomy.

pmlnr · on July 2, 2019

Fair point. I'm wondering though, how else you make people understand something like this?

bovermyer · on July 2, 2019

Education and positive reinforcement.

Teach people the right way to do things, and then make that behavior self-rewarding.

luckylion · on July 2, 2019

What if that behavior isn't self-rewarding and is actually expensive? Not using Cloudflare, Google Maps, Analytics et all means you need to use something else, need to spend attention points somewhere, need to pay for the services. Very few people will do that because "it's the right thing to do".

bovermyer · on July 2, 2019

I didn't say it was easy or obvious. If it were, I would be out of a job.

bovermyer · on July 2, 2019

This is important to recognize. Victim blaming is a highly destructive practice.

inimino · on July 2, 2019

Assigning blame for technical business decisions to the people who made those decisions is victim blaming?

bovermyer · on July 2, 2019

Targeting the process or behavior that lead to an event is far more effective than targeting the person that triggered it. Do not conflate the two.

harshreality · on July 2, 2019

I don't think there is another cheap & easy to administer way, or more people would be doing that. Also CF is nice for the slight protection it provides against obvious bots and mass hacking attempts. And decreasing requests for static page resources. And it also (as long as your website doesn't leak its own ip, i.e. not sending emails except through Received-path-scrubbing services) hides your ip somewhat, meaning if you're careful you can run a website from your home on a raspberry pi without any issues.

dictum · on July 2, 2019

I agree on principle, but as an end user, I must selfishly disagree. I find life after Cloudflare to be better than before it.

There's a Cloudflare location right next to me, and that improves latency across many of the websites I use, and makes their public DNS service the fastest.

I just wish they had active competition in the same space. CDNs have network locations nearby but they don't offer an easy UX for relatively unexperienced website owners, and DDoS protection services usually have less network locations than content CDNs.

Avamander · on July 2, 2019

With a lot of these things, if people can't agree to make a standard system to do all of it just as well then there's going to be a big company that does so. We have had this with e-mail, signed exchanges, DDOS protection and will probably have it with many other things unless people pull themselves together at least after these proprietary solutions are created and create better alternatives.

buildbuildbuild · on July 2, 2019

Their free CDN is my biggest draw to their service: I can handle millions of requests per day with a $5 VM, sane caching headers, unoptimized code, and Cloudflare free tier.

philipwhiuk · on July 2, 2019

It is slightly ironic that the best defence for a distributed problem is to build a single point of failure.

dalore · on July 2, 2019

9.6% not 50%

pmlnr · on July 2, 2019

9.6% including providers like digitalocean. I'm skeptical about that 9.6%.

tmlee · on July 2, 2019

So it seems... Even https://cloudflare.com/ itself is down

btown · on July 2, 2019

More importantly, their admin dashboard is down. It's impossible to bypass their "orange cloud" proxies and send traffic directly to our hosting. That they can't flip a switch and have their nameservers send dash.cloudflare.com to a separate piece of redundant infrastructure is mind-boggling.

duijf · on July 2, 2019

We were able to flip this switch on our services through their API, as we have our Cloudflare config in Terraform.

The API wasn't working perfectly, but with some retries we were able to change the config for our domains.

ilogik · on July 2, 2019

change the nameserver at the registrar to someone else

_khhm · on July 2, 2019

By the time that change propagates, cloudflare will be backup

skywhopper · on July 2, 2019

Yes, but then next time you will be able to control your DNS.

scaryclam · on July 2, 2019

Though, only if you're using a short TTL. I'm not arguing against your position in general though.

cj · on July 2, 2019

Agreed. First thing I tried to do was login Cloudflare to disable proxying. But can't.

uncoder0 · on July 2, 2019

Yep, I'm seeing 502's all over the web as of a few minutes ago.

steelaz · on July 2, 2019

Great, now we can't even disable HTTP proxy to allow traffic directly to AWS :/

PudgePacket · on July 2, 2019

Point your DNS back to your actual host? Might be the only short term solution, though DNS propagation times kinda rule that out :/

cj · on July 2, 2019

That's a no-go if you transferred your domains to Cloudflare Registrar.

PudgePacket · on July 2, 2019

Something something all your eggs in one basket :)

Gigablah · on July 2, 2019

Seems like even other registrars might rely on Cloudflare (e.g. Namecheap) so now people have to continuously ensure there’s no cross-pollination between their infra providers...

steelaz · on July 2, 2019

I think the only option here would be to change our name servers at registrar level to point to AWS and recreate all DNS records there, but then you have to deal with name server propagation.

jgrahamc · on July 2, 2019

We use our own service.

haneefmubarak · on July 2, 2019

EDIT: the static pages load but trying to log in just times out. I think the static cache is back up but the rest of it is still down.

Seems to be back up and running for me.

ratsimihah · on July 2, 2019

So cloudflare.com runs on Cloudflare?

dexen · on July 2, 2019

Turtles all the way down.

But in all frankness, if Cloudflare's own site did not run off of Cloudflare's infrastructure, why would anybody trust them with their websites?

beckingz · on July 2, 2019

Ironic isn't it.

kreetx · on July 2, 2019

For the site itself that's totally fine, and the status page is separate as it should be, too.

ht85 · on July 2, 2019

This is fine-ish...

In case of network issues affecting their proxy, being able to change your configuration to allow direct traffic would be really nice.

ratsimihah · on July 2, 2019

Fair. Some service recently had their status page hosted on the same server that went down, that was funny.

kreetx · on July 2, 2019

I guess it's like diversifying one's assets - you put your status page elsewhere than on your own infrastructure.

apple4ever · on July 2, 2019

Discord's status page is also down haha

auscompgeek · on July 2, 2019

I'm quite surprised Discord's status page is behind Cloudflare. I thought they were using statuspage.io.

theoctopus · on July 2, 2019

They are, but that subdomain is proxied through Cloudflare. If they'd set it to just DNS then it would have still worked.

OskarS · on July 2, 2019

Certainly what I would use if I was building cloudflare.com...

deca6cda37d0 · on July 2, 2019

7,5% of all websites...

https://w3techs.com/technologies/details/cn-cloudflare/all/a...

colinbartlett · on July 2, 2019

That number matches what I am seeing on StatusGator: Of the 438 status pages we monitor, 52 of them are showing some kind of warn or down notice right now. That's almost 12%.

Though some of them might not be because of Cloudflare, the ones I spot checked all do appear related. Medium, DigitalOcean, Shopify, CodeShip, Pingdom, and many more. The impact is staggering.

hanspeter · on July 2, 2019

Add to that all the sites using resources hosted on Cloudflare's CDN.

snug · on July 2, 2019

9.6% of websites that have one of those CDNs

deca6cda37d0 · on July 2, 2019

aah i misread so 7,5% of all websites

lgats · on July 2, 2019

How about a weighted percentage of the Alexa top 1m?

svirelka · on July 2, 2019

Update - Cloudflare has implemented a fix for this issue and is currently monitoring the results.

Description: Major outage impacted all Cloudflare services globally. We saw a massive spike in CPU that caused primary and secondary systems to fall over. We shut down the process that was causing the CPU spike. Service restored to normal within ~30 minutes. We’re now investigating the root cause of what happened.

auscompgeek · on July 2, 2019

https://www.cloudflarestatus.com/incidents/tx4pgxs6zxdr

(edit: thank you mods for changing the submission URL)

bytebuster · on July 2, 2019

oh wow. that's a long list.

yoran · on July 2, 2019

aka the whole world.

danShumway · on July 2, 2019

Don't want to go off topic, but if I want to prevent my website going down because of stuff like this in the future, will having back up DNS entries solve the problem?

I know DNS will fall back if it can't reach a service, but would a 502 trigger that?

aclelland · on July 2, 2019

Yeah, you'll want to keep your domain registered through a different registrar and if CF goes down you can update your DNS Name Servers point from CF to something like AWS Route 53.

This has a few drawbacks like making sure your Route53 configuration is identical to your CF config, ensuring your origin servers can cope with the additional load if CF caching isn't available and the DNS propagation time required for the Name Servers to update.

During the last outage, we were able to get into the CF dashboard and simply disable the proxy which allowed our clients to access our origin servers directly but this time we can't even get into the Dashboard.

danShumway · on July 2, 2019

Yeah, if I had access to the DNS records this would be easy, but like you said, even the dashboard is down.

Ideally, I'd want something where if Cloudflare goes down, I don't have to change anything, but... 502 isn't going to trigger that without some work on my part.

Meh.

rosswilson · on July 2, 2019

If you received a HTTP 502 then DNS must've already resolved. Browsers typically will do a DNS lookup, and then try establishing a TCP connection to one of the returned hosts. Its only if it can't establish a TCP connection to a host will it (sometimes) try another host from the DNS response.

SteveNuts · on July 2, 2019

No, you'd need to use an application-aware DNS provider like Route 53 that could detect the failure.

Backup DNS entries won't help, what you need is to use multiple DNS providers and add them all to your NS records

danShumway · on July 2, 2019

I think this is the correct answer.

Since I'm mostly thinking about static sites, I could also have something local that pings the site, and if it goes down, it could update my nameservers to point elsewhere on its own.

Probably more trouble than its worth though.

LeonM · on July 2, 2019

This is a good question.

I'm not sure if the browser receives all A/AAAA records from the syscall, or just one. I guess that if the browser has the whole list, and the error is in the 500 range, it could retry a different IP but I'm not sure if browsers do this.

swiley · on July 2, 2019

DNS isn’t handled by the kernel, it’s handled by the network library runtime and that does return a list of addresses (I think it actually has you iterate through them in C anyway.)

iofiiiiiiiii · on July 2, 2019

What would solve it is having your DNS have a low lifetime and then changing the DNS to point to not-Cloudflare. It would still be down for some users as long as the old (Cloudflare) DNS is cached, though.

aeonflux · on July 2, 2019

Make sure that DNS provider offers an API and that none of their infrastructure is hosted on CF. :D

partiallypro · on July 2, 2019

A problem with this though is that some registrars take hours to propagate, by the time you have it switched it will have already likely been resolved. If you spread that across hundreds of customers, you'd have a bad time.

lol768 · on July 2, 2019

No, I don't believe it would.

aeonflux · on July 2, 2019

No, dns doesn't work like this, you can't have backup records.

LeonM · on July 2, 2019

Not necessarily backup records, but you can add multiple A/AAAA records. There is no guaranteed order though.

It is possible to have multiple A/AAAA records pointing to different loadbalancers, but I don't know how browsers would deal with this.

Bender · on July 2, 2019

Browsers would deal with it just fine (assuming the site is down hard and not responding with errors). Its some of the API tools and old libraries that may not. They would need retry logic that mimics the browser cycling through multiple A records. OTOH, API tools that have retry logic would just keep trying until the errors clear up. A browser will stop retrying when something responds unless there was javascript running in memory that had retry logic.

If there are errors, the site would need to be modified to not respond if broken and unable to proxy to a working origin. Perhaps CF have not coded their proxies in this manor.

aeonflux · on July 2, 2019

The lookup is essentially random. If you point DNS to 2 IP's and one of those goes down, then (without going into detail) half of your requests will fail.

LeonM · on July 2, 2019

Yes, that is correct.

But it does make it possible for a browser to retry a different IP if a 500 range error, or timeout occurs.

But like I said in my other post: I don't know if browsers actually do this.

gadgetoid · on July 2, 2019

I just changed a setting in my CloudFlare account... did I break everything?

darkcha0s · on July 2, 2019

Oi! Change it back!

Gigablah · on July 2, 2019

Well, they can’t now :p

gadgetoid · on July 2, 2019

Touché.

On one hand I think "Maybe I should diversify my infrastructure."

And on the other I think "But one of the biggest upsells was convinience."

And it's fortunate I don't have a third hand, because I'd be thinking "Oh crap oh crap I just migrated a client website to LightSail + CloudFlare saying how super awesome and robust it would be."

But it's okay now because it looks like everything is back up!

jgrahamc · on July 2, 2019

Thaxll · on July 2, 2019

After trolling Verizon curious what reasons they will come with for that outage.

apple4ever · on July 2, 2019

Maybe this is payback by Verizon...

voidwtf · on July 2, 2019

Care to elaborate how they were "trolling" Verizon?

EamonnMR · on July 2, 2019

They came off that way in this blog post:

https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-...

acdha · on July 2, 2019

That didn't seem like trolling – just a public call for Verizon to follow internet best practices. Given that most large ISPs treat failures as a PR exercise, that's probably necessary.

voidwtf · on July 2, 2019

Agreed.

They reached out to Verizon privately, a Tier 1 carrier with expectations and responsibilities as a good netizen, and got no response.

They attempted to reach out through Verizon's public forms of communication and got a bullshit irrelevant CS response despite requesting escalation.

They then called out Verizon before the community as a whole.

They don't have the luxury of waiting for a well prepared letter from some Verizon lawyers. Modern day customer expectations don't allow for it. You may call it trolling, but all I saw was a company asking another company to stop pissing in the public pool.

jason_zig · on July 2, 2019

So what should the ideal redundancy plan be here? If you can't log into the CDN provider and they are down do you you just have a second one ready (and paid for) and then log into your registrar and be ready to switch to that secondary CDN provider in this scenario? Or is there some sort of load balancing / routing solution between CDN's that I don't know about / understand?

morpheuskafka · on July 2, 2019

If you use Cloudflare nameservers, you have to change to new nameservers, wait for that to propogate, and then wait for clients cached records TTLs to expire. So it will be a major disruption no matter what you do.

apple4ever · on July 2, 2019

What about the API? If that is up, you could script it.

bowmessage · on July 2, 2019

Can you use your own nameservers that delegate to cloudflare or akami 50% each, then adjust? Is there a service more suited for this than r53?

manigandham · on July 2, 2019

If you're using them for TLS certs then it's an even bigger problem unless you have them provisioned elsewhere.

morpheuskafka · on July 2, 2019

Unless you need EV you can just pull some wildcards from Lets Encrpt (as long as you don't use pubkey pinning). No need to automate as it's just a one off.

r1ch · on July 2, 2019

The Cloudflare DNS-over-HTTPS resolver was serving up 502 errors as well, though the standard port 53 UDP resolver was working. This event definitely made me regret choosing Cloudflare as my sole DoH server.

captn3m0 · on July 3, 2019

Hear hear Mozilla!

yuchi · on July 2, 2019

Either downforeveryoneorjustme.com is itself served by Cloudflare too, or has been hugged to death.

bwb · on July 2, 2019

Yep, we are 100% on CF workers :)

Here is a quick image of the peak downtime on downforeveryoneorjustme.com:

https://ibb.co/PZ9BMRc

yuchi · on July 2, 2019

Thanks for the fantastic work on the service. And thanks for sharing that stats!

close04 · on July 2, 2019

It returns 502 Bad Gateway cloudflare so I assume it's not the HN hug...

ttctciyf · on July 2, 2019

Also, https://downdetector.co.uk/problems/cloudflare is currently detecting the outage in very direct fashion!

cube00 · on July 2, 2019

Not impressed it's serving an error page claiming the underlying host is to blame (this one from discord)

Error 502 Bad gateway

You - Browser - Working

Sydney - Cloudflare - Working

storage.googleapis.com - Host - Error

What happened? The web server reported a bad gateway error.

jrwiegand · on July 2, 2019

Seems fitting that Cloudflare spoke so aggressively against Verizon[0][1] last week and then this incident happens to them. I will be interested to read the postmortem on this situation. I really like Cloudflare but you should be careful not to jinx yourself with blogs posts like that.

[0]: https://blog.cloudflare.com/the-deep-dive-into-how-verizon-a... [1]: https://web.archive.org/web/20190628223129/https://blog.clou...

sprite · on July 2, 2019

When you can't update your DNS because your registrar uses CloudFlare also....

stevekemp2 · on July 2, 2019

You could use a dedicated dns provider, such as AWS

frenchman99 · on July 2, 2019

Any workarounds or solutions ? I'm an on-call engineer with lots of questions coming in. I'm not sure what I can do apart from moving the domain off Cloudflare, bug DNS propagation would take a few hours and by then Cloudflare might be up again.

NKCSS · on July 2, 2019

Outages can always happen, when they do with companies like this, at least you'll know that some of the best people out there are working on the issue and that it will be resolved asap.

CloudFlare has proven in the past to be a very capable party, I don't think panicking now and try to move everything away is a smart move. Also, a few people have been saying that even if you want to, the site to do so is not reachable, so that would be a challenge as well.

frenchman99 · on July 2, 2019

Panicking was not the plan. Asking for advice was.

eeeeeeeeeeeee · on July 2, 2019

My suggestion is wait. I wouldn't even consider flipping my sites over to another DNS unless the outage begins lasting over four hours or so. A lot of top sites use Cloudflare and this sort of outage is extremely rare for them (I can't remember a time when Cloudflare's own site and dashboard were taken offline).

Then again, it depends on the priority of your site. But there are tons of top sites on Cloudflare and I bet a lot of those places don't have plans for emergency switching over to another DNS provider / CDN on short notice as it's often a fairly disruptive change, especially now that more frontend logic for a site is implemented alongside the CDN/LB.

Nas808 · on July 2, 2019

1.1.1.1 DNS is down, and seeing a lot of 502s on cloudflare sites.