If a single regex can take down the Internet for a half hour, that's definitely not good -- for a class of errors that can be easily prevented, tested, etc.
The timing is unfortunate too, after calling out Verizon for lack of due process and negligence.
I'm sure they have an undo or rollback for deployments but probably worth investing into further.
They also need to resolve the catch-22 where people could not login and disable CloudFlare proxy ("orange cloud") since cloudflare.com itself was down.
You'd think after leaking private data for literally months less than 3 years ago (and only noticing because Google had to point it out to them) that they'd, y'know, have at least some kind of QA environment fed with sample traffic by now. Really hard to believe they're still getting caught testing in prod
For working in that field, the arrogance of CloudFlare is still unbelievable to me.
After their huge Cloudbleed issue with the addition of this one, they continue to call out everyone through their blog posts. And everyone seems fine with it because they are a hype company.
I don't use CloudFlare nor have any interest in them, but I don't see the arrogance. The issues CloudFlare have are things everyone takes seriously and are working very hard on. Deployment and memory safety are hard problems that happens to the best of the best. It happens Google, Amazon and Facebook. If anything the idea that this would damaging, because it is more public, is arrogant. If CloudFlare would be saying that everything is fine you might have a point, but they aren't. Just like the other companies mentioned they seem to be improving their routines, programming and infrastructure to try and mitigate these problems.
What they are criticising however are things like not adopting new protocols or not taking things that affects everyone seriously. This isn't something that would happen if people were trying. And the response from some of the industry is "we know what we are doing", and shortly after the same thing happens again and again and again.
So I don't really see CloudFlare being that arrogant, if anything it's the "you are not better than us" from some parts of the industry that is. The day I see CloudFlare not trying I would be happy calling them arrogant. But if anything I would caution that they are too successful by trying more than most.
> The issues CloudFlare have are things everyone takes seriously and are working very hard on. Deployment and memory safety are hard problems that happens to the best of the best.
Cloudflare improved a lot. You can see just from what they're open sourcing that the usage of go and rust increased significantly. And I'm sure we'll notice improvements in deployment practices.
When Cloudbleed happened I was very vocal and skeptical, but this is different. Everyone makes mistakes.
As a random outsider who really couldn't care less about the service CloudFlare provides: their responses to outages and transparency is really great and I wish more tech companies would do the same. It gets tiring hearing about large outages at over services/providers and only learning that they were caused by "network partitions", or other networking issues. Every company has to deal with these issues and CloudFlare does an awesome job at letting me at least learn something about what went wrong when these incidents happen.
We’ve actually had our data leaked by one of their engineers working in his free time. He found an open database and leaked in to the press. He was probably just scanning random ip ranges and stumbled upon it and I don’t think he was targeting CF clients in particular. Hopefully they will stay humble and fix their own issues first.
On a side note an anecdote came out of that leak...
We were then contacted by this big name tech website if the data is ours, before they published the article. Unfortunately the author sent us an email via his @gmail address which did not add to his credibility so his email was brushed off for a day or two until we saw it published. Can’t say if it was a dark pattern of his to not use his work email to notify us or not...
And how good your employees are... How good your review process is... How good xyz is...
If your engineers are so solid, and them making a mistake on a given release is individually 0.5%, and you have 50 engineers, you will see the probability of nothing going wrong is about 77%(0.995^50), and something going wrong is 1-0.995^50. Pretty low, i might say.
Dont do this to your engineers. 80% test coverage is a sweet spot, the rest is caught better with other approaches. No reason to kill engineers productivity everytime something fails on production by blaming their tests arent good.
Since the work involved in doing a regular expression match can depend largely on the input for non-trivial expressions, one fun case (probably not the one here, though) is that a user of your system could start using a pathological case input that no amount of standard testing (synthetic or replayed traffic, staging environments, production canaries) would have caught.
Didn't take anything down, but did cause an inordinate amount of effort tracking down what was suddenly blocking the event loop without any operational changes to the system...
It's nowhere near as standardly applied as the other approaches to release verification, though.
And in complex cases (say, a large multi-tenant service with complex configuration), it can be very hard to find the combination of inputs necessary to catch this issue. If you have hundreds of customer configurations, and only one of them has this particular feature enabled (or uses this sort of expression), fuzzing is less likely to be effective.
> If a single regex can take down the Internet for a half hour, that's definitely not good
As I commented yesterday, this is due to the fact, that "the Internet" thinks it needs to use Cloudflare services, although there really is no need to do so.
Stupid people making stupid decisions and then wondering why their services are down.
I'm always beyond impressed with how responsive and transparent CF is with incidence and post mortem communication. Given who the CEO and COO are, I suppose this shouldn't be surprising, never the less as a customer it builds a great deal of trust. Kudos.
Yes, they do really well on this - open, transparent, posting information quickly as soon as they were fairly sure what the problem was. I always really enjoy their writing, both incident reports and writeups of new features. The only thing I think they could have managed better was their status page, which claimed they were up (every service was green) when they were not.
Kinda wonder at this point what findings exist on their Availability SOC 2, assuming they've gotten one.
The repeated outages plus the constant malicious advertising by scammy ad providers through cloudflare are slowly turning me off to the service as a potential enterprise customer. Unfortunate too since plenty of superlatively qualified people build great things there (hat tip to Nick Sullivan), but it seems like the build-fast culture may now be impeding the availability requirements of their clients.
This is also a great example of a case where SLAs are meaningless without rigorous enforcement provisions negotiated in by enterprise clients. Cloudflare advertises 100% uptime (https://www.cloudflare.com/business-sla/) but every time they fall over, they're down for what, an hour at a time? Just this one issue would've blown anyone else's 99.99% SLA out of the water -- https://www.cloudflarestatus.com/incidents/tx4pgxs6zxdr
I love the service, but if I'm to consider consuming the service, they'd do well to have the equivalent of a long term servicing branch as its own isolated environment, one where changes are only merged in once they've proven to be hyper-stable.
An SLA of 100% just mean your account will be credited for any downtime. It doesn't mean that the company guarantees 100% uptime. No company signs a 100% or 99.99% SLA expecting to actually get 99.99% uptime but with the understanding they will be compensated when their is an issue.
None of the major cloud vendors actually hit 99.99% uptime.
Interesting distinction here: 100% SLA on responding to incoming DNS requests. The R53 console or management interfaces could be down and the SLA stays in tact-- if you can't update your DNS then 100% incorrect responses isn't very helpful.
By its very nature, an SLA of 100% is a guarantee that the service will be available 100% of the time or else the relevant penalties, explicitly stated or otherwise applicable, can be applied.
The question is whether the guarantee is meaningful by way of whether the penalties will significantly dissuade failures to meet the guarantee, and I'd argue in the case of Cloudflare, this isn't the case.
[Edit: Cloudflare's standard] penalty is a service credit defined as follows:
> 6.1 For any and each Outage Period during a monthly billing period the Company will provide as a Service Credit an amount calculated as follows: Service Credit = (Outage Period minutes * Affected Customer Ratio) ÷ Scheduled Availability minutes
And that's woefully inadequate for any enterprise client with mission- or life-critical services.
---
TL;DR: A SLA is a guarantee, by the very definition of the word "guarantee," that a service will be delivered to a specific level and that certain agreed-upon penalties will be applied to the service provider if this guarantee is not met.
As you note, unless you negotiate a custom contract, usually the "penalties" are very, very mild. It's effectively the same as there not being a penalty at all. The SLA is just a marketing nice-to-have divorced from the engineering realities
Yup, and given Cloudflare's recent performance, I'd venture that more heavy-handed contracts need to be negotiated with them to drive an improvement in performance, or at the very least a paradigm shift in how they sustain availability to the clients who really care for it.
As an engineer, I get pissed whenever I see 100% uptime, or eleven-nines, nine-nines, or other impossible targets. Like, how am I supposed to design a system with numbers like that?
The thing is, a real SLA will have things like time to detect errors and time to mitigate, time to repair, etc.
100% uptime doesn't necessarily mean nothing failed, it means the failure detection and mitigation worked within the allowed windows. In a typical internet environment, that means allowing connections to die when the server they're connected to dies. It's would be possible to handoff tcp connections, but nobody does it.
If you want to get close to those numbers, you need to have a real reason, and then you need to make sure you have a good plan for everything that can go wrong. Power, routers, fiber, load balancers, switches, hosts, etc. And then do your best not to push bad software / bad configuration.
Bare metal on quality hardware with redundant networking goes a long way towards reliability, once the kinks are worked out.
If you only look at the SLOs you are a junior engineer working for someone else making the big decisions. If you are designing a system you want to look at the SLA. Engineers are not just assembly line workers that consume specs and spit out parts.
Nothing wrong with just using SLOs, but if you are a technical lead or senior engineer, you should have the big picture.
You honestly think a missile defense system will work. Backhoes are much more creative than that. You will need defense in depth, roaming patrols, as well as air and satellite based monitoring assets.
There was a point in time where that wasn't true but as people started accepting a lower quality of service it became easier to just pay than do the right thing.
There's a lot of good info here, but there are many more questions raised in my mind based on what I'm reading in the SOC3 than perhaps what you might've expected. I can ideally run through them if I catch you again at DEF CON this year. I'm also willing to sign your standard MNDA to review your SOC 2, but we can take that thread offline.
> We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents.
Good.
> Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.
Wow. This seems like a very immature operational stance.
Any deployment of any kind should be subject to minimum deployment safety, that they claim they have.
> At 1402 UTC we understood what was happening and decided to issue a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic. That occurred at 1409 UTC.
Many large companies would have had automatic roll-back of this kind of change in less time than it took CloudFlare to (apparently) have humans decide to roll-back, and possibly before a single (usually not global) deployment had actually completed on all hosts/instances.
However, what is more concerning is that it seems you shouldn't rely on CloudFlare's "WAF Managed Rulesets" at all, since they seem to be willing to turn it off instead of correctly rolling back a bad deployment, which they only did > 43 minutes later:
> We then went on to review the offending pull request, roll back the specific rules, test the change to ensure that we were 100% certain that we had the correct fix, and re-enabled the WAF Managed Rulesets at 1452 UTC.
How were they not able to trivially roll back to the previous deployment?
Which is why the entire (mostly in Agile environments) model of "deploy to prod as soon as you can" is absolute nuts.
If you're dev at a hipster app maybe a dozen people use to holler "yo" at each other, by all means go for it. If you're operating one of the biggest and most important chonks of Internet infra... maaaaaybe stick to established practices such as stage testing, release schedules and incremental rollouts?
I don't want to return to the old slothful release schedules of the 2000s, where features and bug fixes was mostly stagnant.
You can have staging, scheduled QA signed off releases, that happen every day. I have worked on some fairly large significant services and we still released several times a day, just that you did not trigger the final prod release yourself but the QAs pressed the button instead. Though usually just once a day per microservice.
I have also worked with several clients lately without QA where devs could themselves push to prod many times a day. I am not sure these systems were that much less stable, though they were all mostly greenfield and not critical public government systems. They were off course a lot smaller changes, and quick to undo. Which is the core element of "release straight to prod" ethos.
I am sure Cloudflare have a significant QA process whilst using todays fast moving release schedules.
What is always a grey-zone is configuration changes. Even if properly versioned and on a release schedule train with several staging environments, configuration is often very environment sensitive. So maybe they could not test it properly in any staging environments but had to hope prod worked...
However Cloudflare will hopefully implement some way to make sure this particular configuration and subsequent future changes are not as bottle-necked that instead can be be gradually released to a subset and region-by-region instead of a big bang to all. Though canary/blue-green/etc releases of core routing configurations is hard.
Cloudflare does have test colos that use a subset of real network traffic for testing. It's actually the primary testing methodology, and the employees are usually some of the first people forced through test updates.
This release wasn't meant to go out, and the fact it did means it would have bypassed the test environments either way.
Avoiding blame is different than acknowledging responsibility. A post mortem should be very conscious about blame - never target the engineer who deployed the change, for example. Take responsibility for the machine that allowed the unsafe change to be deployed. (Where machine could be tooling or process, as appropriate.)
At 1402 UTC we understood what was happening and decided to issue a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic. That occurred at 1409 UTC.
So for about 50 minutes, those who relied on the WAF were open to attack?
WAF wouldn’t just prevent DDoS I assume. I’m pretty sure there are WAF rulesets that attempt to block attacks such as XSS or even remote code execution vulnerabilities.
I can confirm that there are WAF rules that block things like basic SQL injection. A client uses Akamai and if it detects certain strings in a request, like "<script>", it'll block the request before it ever gets to the application. The bad part is that some developers get complacent in their development and rely on the WAF to do their security for them.
Can you give an example of where the cost of possibly-denied could ever be higher than definitely-denied?
First Cloudflare literally denied service, then as a hotfix there was a higher-than-normal potential for denying service, and eventually the normal potential for denying service was restored. I'm trying to comprehend how the second phase could ever be worse than the first phase.
Now, if you're talking about elevating the potential for compromised confidentiality and/or integrity rather than merely availability, I'd agree, but generally [D]DoS refers to availability.
Leaning on a WAF to plug gaping vulnerabilities that can be discovered and exploited during the period of time before the WAF was restored means you have much bigger problems than uptime.
> Leaning on a WAF to plug gaping vulnerabilities that can be discovered and exploited during the period of time before the WAF was restored means you have much bigger problems than uptime.
It's also, roughly speaking, the selling point of products called "WAF". (and yes, relying on them is not great)
Peculiar that eastdakota (Cloudflare's CEO) doesn't seem to be tweeting at the Cloudflare team responsible for this, telling them they should be ashamed and are guilty of malpractice.
When it was Verizon that took down the internet he felt it was appropriate to do that to the Verizon teams, after all.
> Our team should be and is ashamed. And we deserve criticism. ...
I still don't think that publicly shaming anyone is a good leadership style nor is it a good way to motivate people to perform better in the future, but kudos for the self-awareness, at least.
Cloudflare was responsive and reasonable. Verizon was unreachable and deflected responsibility when they finally made a statement. And public shaming does often motivate companies to be more responsive to their customers.
AFAIK Cloudflare isn't in any way a "customer" of Verizon. Verizon doesn't owe Cloudflare any kind of response or devotion of resources. Verizon owes it's actual customers a resolution to their problem, which they gave.
I'm not saying Verizon is perfect nor absolved of fault, but Cloudflare was/is not owed any kind of explanation or assistance by VZ, and it's absurd of CF to still be whining about that fact (as they are doing in some other tweets today). If CF wants some kind of SLA with VZ, they should engage them in a business relationship, not try to publicly shame them.
I’d say what they really need is a representative governing body over major network carriers to establish proper standards and levy fines for those that do not comply.
Kind of similar to a homes association saying “hey that trash on your lawn affects your neighbor, clean it up!”
It’s true that they are not a customer but at that level what they do affects each other, and it’s better to resolve things civilly and privately instead of publicly on twitter.
In theory this should be the job of the FCC, and in Europe the local regulatory agencies (BNetzA in Germany for example). But properly funding them to do their jobs doesn't seem to be very high om the political agendas these days.
Basically: you can try to keep a low TTL DNS, but it'll be more DNS traffic, and 5-10% of traffic takes forever to cut over because nobody respects TTL. Worst case you have just as much down time as before, best case most of your traffic is recovered in a few minutes.
It may be useful to note, for whoever is reading this, that low DNS TTL only ever makes sense for anything that you can do a cutover either automatically or on short notice, not for all records. Otherwise, you are now at mercy of outages on your DNS providers.
Just leaving it out there so one doesn't get the idea that "low TTL == Always Good"
What sort of regular expression pitfalls can cause this sort of CPU utilization? I know they're possible but I am curious about specific examples of something similar to what caused Cloudflare's issue here.
I would strongly recommend deploying such a regular expression matcher to avoid problems like this.
There are examples in the above article that you can use to test anything in your production deployment that accepts regular expressions to see how well it copes.
Part of the problem is that "regular expressions" are not really regular expressions in the Chomsky sense.
Regular languages have some very nice properties relating to how they can be evaluated. Some regular expression engines have features that pulls the expressions from being a regular language into more complexity.
>"It doesn't cost a provider like Verizon anything to have such limits in place. And there's no good reason, other than sloppiness or laziness, that they wouldn't have such limits in place."[1]
The difference between Verizon and Cloudflare in this case is that Cloudflare generally _does_ fix their mistakes when they screw up (and generally don't make the same type again).. whereas Verizon has screwed up internet routing more times than I'd like to think about. No company is perfect, but I'd say this comparison is pretty apples to oranges.
"You have a problem, and you decide to use a regexp to solve it. Now you have two problems"
Although of course I'm just kidding and I'm sure that a good regexp probably is the right solution for what they're doing in that instance: they have a lot of bright people.
Nothing like having what should be a world class company falling prey to the same type of screw-ups that plaque 'the local guy maintaining some wordpress site on a shared server'.
Separately there is nothing that says that a company like Cloudflare has to air their dirty laundry (as the saying goes). The vast majority of 'customers' really don't care why something happened at all or the reason. All they know is that they don't have service.
Pretend that the local electric company had a power outage (and it wasn't caused by some obvious weather event). Does it really matter if they tell people that 'some hardware we deployed failed and we are making sure it never happens again'. I know tech thinks they are great for these types of post-mortems but the truth is only tech people really care to hear them. (And guess what all it probably means is that that issue won't happen again...)
100 not true. All you have to do is pull a list of the daily additions and deletions and you will see that they have many customers that are not 'tech' people. Further you are assuming all the customers of theirs that are tech people even read and keep up with blog posts like this.
Good engineers like knowing why things break. I just started a book on the reasons why buildings collapse. It’s essential a series of post-mortems of specific events. I have zero formal architectural or civil engineering experience — just an inquisitive disposition. For anyone interested, the book is called ”Why Buildings Fall Down.”
You are reading this for entertainment and perhaps to learn but my point wasn't that it wasn't potentially interesting but more the actual business purpose of doing this.
One again using an example of lost luggage I don't really care (other than an interesting story) why my luggage was lost I just don't want it to happen and an airline writing a detailed story doesn't give me any more confidence they won't have a different problem happen again. If anything it maybe even opens up the door if something I read seems that it could have been avoided (whereby if they say nothing I might not know that it could).
Probably for the kind of work they are doing avoid regex? Or at least the very complicated modern regex (simple autonoma that you can compile in advance might be ok)
If you're trying to do pattern matching, is there actually a widely used alternative to regex? The more I can avoid using regex for mission-critical things, the happier I will be, but I'm really not aware of anything better for this type of application.
I run a service placing bids the last few seconds on eBay. Every time this happens I lose measurable business (we place thousands of bids per day). While it doesn’t affect scheduled bids, they can’t place bids and are likely to move to a competitor. These recent outages have been costly.
From experience, using two services (active/active) is really the only way to avoid downtime. DNS can be trickier, but there are providers that can fallback automatically or split requests (dnsmadeeasy etc)
Haha, people pay you to bid-snipe on eBay for them?
It just baffles me that manual/third-party bid-sniping is still a thing. eBay has had automatic bidding for more than twenty years. You'll pay the same whether you put in the winning bid a week in advance or 5 seconds. But people see that "you lost this auction" notice and they're irrationally convinced that it would have gone differently if they'd bid at the last minute, somehow.
Sniping is the only rational way to big on eBay. You bid a single time, with the absolute maximum you would go for, and put it in the last 10 seconds. It prevents both you and your opponents from raising the price irrationally. Automatic bidding just encourages prices to go up, based on emotion. Sniping in the last few seconds removes the emotional component.
This absolutely does work. If you get outbid normally on eBay, say at least half an hour or so before the auction ends, then people get irrational and you quickly end up in a bidding war. This way, as a bidder, you set the maximum price you are willing to pay outside of a heated 'damn, I missed out' mentality and people don't have time to respond.
Also this helps when placing bids on auctions that end when you're asleep and you want to effect the above. I've tried to do it manually in the past but naturally life intervenes and you're somewhere with no signal or in the middle of something. Having tried it both ways sniping is definitely better than regular bidding.
If other bidders are irrational, then bid-sniping can work. It doesn't give others the opportunity to contemplate, "I've been out-bid, do I actually want this item more than I originally thought?"
And it's well-known since early eBay days that many bidders are irrational, including but not limited to competitive impulse to "win". Plus you sometimes have shill bidders.
Sniping approximates sealed bids, with the highest-bidder the second-highest sealed bid amount or a small increment above it. (Unfortunately for eBay, that would tend to decrease their cuts, unless the appeal of the sealed bid format brings in sufficiently more bidder activity.)
Another advantage of the software/service is that it automates. If you want to buy a Foo, you can look at the search lists, find a few Foos (possibly of varying value in the details), say how much you'll pay for each one, and let the software attempt to buy each one by its auction end until it's bought one, then it stops. If eBay implemented this itself, it might be too much headache in customer support, but third-parties could provide it to power users.
(I don't buy enough on eBay anymore to bother with anything other than conventional manual bids, but I see the appeal of automation.)
Maybe use cloudflare only for the landing page and docs but serve your bidding app frontend and backend directly? Since users are already there it will only affect first load :)
The timing is unfortunate too, after calling out Verizon for lack of due process and negligence.
I'm sure they have an undo or rollback for deployments but probably worth investing into further.
They also need to resolve the catch-22 where people could not login and disable CloudFlare proxy ("orange cloud") since cloudflare.com itself was down.