Cloudflare outage caused by bad software deploy

aleem · on July 2, 2019

If a single regex can take down the Internet for a half hour, that's definitely not good -- for a class of errors that can be easily prevented, tested, etc.

The timing is unfortunate too, after calling out Verizon for lack of due process and negligence.

I'm sure they have an undo or rollback for deployments but probably worth investing into further.

They also need to resolve the catch-22 where people could not login and disable CloudFlare proxy ("orange cloud") since cloudflare.com itself was down.

pfundstein · on July 2, 2019

> The timing is unfortunate too, after calling out Verizon for lack of due process and negligence.

Nonetheless, Verizon could take a leaf out of their responsiveness and transparency book.

lima · on July 3, 2019

Yeah. They criticized Verizon for being unresponsive. Mistakes happen.

_wmd · on July 2, 2019

You'd think after leaking private data for literally months less than 3 years ago (and only noticing because Google had to point it out to them) that they'd, y'know, have at least some kind of QA environment fed with sample traffic by now. Really hard to believe they're still getting caught testing in prod

rainyMammoth · on July 3, 2019

For working in that field, the arrogance of CloudFlare is still unbelievable to me.

After their huge Cloudbleed issue with the addition of this one, they continue to call out everyone through their blog posts. And everyone seems fine with it because they are a hype company.

issati · on July 3, 2019

I don't use CloudFlare nor have any interest in them, but I don't see the arrogance. The issues CloudFlare have are things everyone takes seriously and are working very hard on. Deployment and memory safety are hard problems that happens to the best of the best. It happens Google, Amazon and Facebook. If anything the idea that this would damaging, because it is more public, is arrogant. If CloudFlare would be saying that everything is fine you might have a point, but they aren't. Just like the other companies mentioned they seem to be improving their routines, programming and infrastructure to try and mitigate these problems.

What they are criticising however are things like not adopting new protocols or not taking things that affects everyone seriously. This isn't something that would happen if people were trying. And the response from some of the industry is "we know what we are doing", and shortly after the same thing happens again and again and again.

So I don't really see CloudFlare being that arrogant, if anything it's the "you are not better than us" from some parts of the industry that is. The day I see CloudFlare not trying I would be happy calling them arrogant. But if anything I would caution that they are too successful by trying more than most.

tg180 · on July 3, 2019

> The issues CloudFlare have are things everyone takes seriously and are working very hard on. Deployment and memory safety are hard problems that happens to the best of the best.

Cloudflare improved a lot. You can see just from what they're open sourcing that the usage of go and rust increased significantly. And I'm sure we'll notice improvements in deployment practices.

When Cloudbleed happened I was very vocal and skeptical, but this is different. Everyone makes mistakes.

cookiecaper · on July 3, 2019

> Cloudflare improved a lot. You can see just from what they're open sourcing that the usage of go and rust increased significantly.

You say this like using trendy languages implicitly indicates improvement.

nightfly · on July 3, 2019

As a random outsider who really couldn't care less about the service CloudFlare provides: their responses to outages and transparency is really great and I wish more tech companies would do the same. It gets tiring hearing about large outages at over services/providers and only learning that they were caused by "network partitions", or other networking issues. Every company has to deal with these issues and CloudFlare does an awesome job at letting me at least learn something about what went wrong when these incidents happen.

kostko · on July 3, 2019

We’ve actually had our data leaked by one of their engineers working in his free time. He found an open database and leaked in to the press. He was probably just scanning random ip ranges and stumbled upon it and I don’t think he was targeting CF clients in particular. Hopefully they will stay humble and fix their own issues first. On a side note an anecdote came out of that leak... We were then contacted by this big name tech website if the data is ours, before they published the article. Unfortunately the author sent us an email via his @gmail address which did not add to his credibility so his email was brushed off for a day or two until we saw it published. Can’t say if it was a dark pattern of his to not use his work email to notify us or not...

detaro · on July 14, 2019

If he wasn't doing it as his job, using a work mail address to contact someone over a security issue sounds like it would have been a bad idea.

tehlike · on July 3, 2019

It should be taken as a given that testing is necessary but not sufficient to prevent production outages, or limit their impact.

Monitoring, canaries, experimantations do need to be adopted at pretty much everywhere possible.

runeks · on July 3, 2019

> It should be taken as a given that testing is necessary but not sufficient to prevent production outages […]

That depends on how good your tests are.

tehlike · on July 3, 2019

And how good your employees are... How good your review process is... How good xyz is...

If your engineers are so solid, and them making a mistake on a given release is individually 0.5%, and you have 50 engineers, you will see the probability of nothing going wrong is about 77%(0.995^50), and something going wrong is 1-0.995^50. Pretty low, i might say.

Dont do this to your engineers. 80% test coverage is a sweet spot, the rest is caught better with other approaches. No reason to kill engineers productivity everytime something fails on production by blaming their tests arent good.

hamandcheese · on July 4, 2019

The probability of something going wrong should be 1-(P(something_nothing_wrong))

In this example, that’s 23%.

nbm · on July 3, 2019

Since the work involved in doing a regular expression match can depend largely on the input for non-trivial expressions, one fun case (probably not the one here, though) is that a user of your system could start using a pathological case input that no amount of standard testing (synthetic or replayed traffic, staging environments, production canaries) would have caught.

Didn't take anything down, but did cause an inordinate amount of effort tracking down what was suddenly blocking the event loop without any operational changes to the system...

wbl · on July 3, 2019

See https://swtch.com/~rsc/regexp/ to understand why that isn't necessarily true.

js2 · on July 3, 2019

Cloudflare uses re2 which doesn't suffer this problem, but apparently they don't use it here?

https://github.com/cloudflare/lua-re2

https://github.com/google/re2

https://github.com/google/re2/wiki/WhyRE2

jswny · on July 3, 2019

Sounds like a job for property based testing!

menzoic · on July 3, 2019

Fuzz testing could help

nbm · on July 3, 2019

Yep, it could help in some cases.

It's nowhere near as standardly applied as the other approaches to release verification, though.

And in complex cases (say, a large multi-tenant service with complex configuration), it can be very hard to find the combination of inputs necessary to catch this issue. If you have hundreds of customer configurations, and only one of them has this particular feature enabled (or uses this sort of expression), fuzzing is less likely to be effective.

martin_a · on July 3, 2019

> If a single regex can take down the Internet for a half hour, that's definitely not good

As I commented yesterday, this is due to the fact, that "the Internet" thinks it needs to use Cloudflare services, although there really is no need to do so.

Stupid people making stupid decisions and then wondering why their services are down.

neom · on July 2, 2019

I'm always beyond impressed with how responsive and transparent CF is with incidence and post mortem communication. Given who the CEO and COO are, I suppose this shouldn't be surprising, never the less as a customer it builds a great deal of trust. Kudos.

grey-area · on July 2, 2019

Yes, they do really well on this - open, transparent, posting information quickly as soon as they were fairly sure what the problem was. I always really enjoy their writing, both incident reports and writeups of new features. The only thing I think they could have managed better was their status page, which claimed they were up (every service was green) when they were not.

gridspy · on July 2, 2019

I think they were blindsided that this was even possible. So they hadn't thought to add a panel to the status page for this one.

I bet they will now.

grey-area · on July 2, 2019

I think their status page doesn't update service status automatically in response to downtime, which it really should.

lanrh1836 · on July 3, 2019

A quick look at their Glassdoor reviews paints a very different story if the reviews are to be believed...

PopeDotNinja · on July 3, 2019

I don't understand what you are saying.

bryant · on July 2, 2019

Kinda wonder at this point what findings exist on their Availability SOC 2, assuming they've gotten one.

The repeated outages plus the constant malicious advertising by scammy ad providers through cloudflare are slowly turning me off to the service as a potential enterprise customer. Unfortunate too since plenty of superlatively qualified people build great things there (hat tip to Nick Sullivan), but it seems like the build-fast culture may now be impeding the availability requirements of their clients.

This is also a great example of a case where SLAs are meaningless without rigorous enforcement provisions negotiated in by enterprise clients. Cloudflare advertises 100% uptime (https://www.cloudflare.com/business-sla/) but every time they fall over, they're down for what, an hour at a time? Just this one issue would've blown anyone else's 99.99% SLA out of the water -- https://www.cloudflarestatus.com/incidents/tx4pgxs6zxdr

I love the service, but if I'm to consider consuming the service, they'd do well to have the equivalent of a long term servicing branch as its own isolated environment, one where changes are only merged in once they've proven to be hyper-stable.

tick_tock_tick · on July 2, 2019

An SLA of 100% just mean your account will be credited for any downtime. It doesn't mean that the company guarantees 100% uptime. No company signs a 100% or 99.99% SLA expecting to actually get 99.99% uptime but with the understanding they will be compensated when their is an issue.

None of the major cloud vendors actually hit 99.99% uptime.

merlincorey · on July 2, 2019

> None of the major cloud vendors actually hit 99.99% uptime.

None of them even promise that -- last time I checked, it was 99.95% for most of them.

rwiggins · on July 3, 2019

AWS services have their own individual SLAs. Route53, in particular, has a 100% SLA: https://aws.amazon.com/route53/sla/

(To my knowledge, it's the only AWS service to promise 100%.)

syntheticcdo · on July 3, 2019

Interesting distinction here: 100% SLA on responding to incoming DNS requests. The R53 console or management interfaces could be down and the SLA stays in tact-- if you can't update your DNS then 100% incorrect responses isn't very helpful.

rwiggins · on July 3, 2019

Very true. I wonder if the control plane is hosted in a single region.

bryant · on July 2, 2019

By its very nature, an SLA of 100% is a guarantee that the service will be available 100% of the time or else the relevant penalties, explicitly stated or otherwise applicable, can be applied.

The question is whether the guarantee is meaningful by way of whether the penalties will significantly dissuade failures to meet the guarantee, and I'd argue in the case of Cloudflare, this isn't the case.

[Edit: Cloudflare's standard] penalty is a service credit defined as follows:

> 6.1 For any and each Outage Period during a monthly billing period the Company will provide as a Service Credit an amount calculated as follows: Service Credit = (Outage Period minutes * Affected Customer Ratio) ÷ Scheduled Availability minutes

https://www.cloudflare.com/business-sla/

And that's woefully inadequate for any enterprise client with mission- or life-critical services.

---

TL;DR: A SLA is a guarantee, by the very definition of the word "guarantee," that a service will be delivered to a specific level and that certain agreed-upon penalties will be applied to the service provider if this guarantee is not met.

Edited for tone.

opportune · on July 2, 2019

As you note, unless you negotiate a custom contract, usually the "penalties" are very, very mild. It's effectively the same as there not being a penalty at all. The SLA is just a marketing nice-to-have divorced from the engineering realities

bryant · on July 2, 2019

Yup, and given Cloudflare's recent performance, I'd venture that more heavy-handed contracts need to be negotiated with them to drive an improvement in performance, or at the very least a paradigm shift in how they sustain availability to the clients who really care for it.

klodolph · on July 2, 2019

As an engineer, I get pissed whenever I see 100% uptime, or eleven-nines, nine-nines, or other impossible targets. Like, how am I supposed to design a system with numbers like that?

toast0 · on July 3, 2019

The thing is, a real SLA will have things like time to detect errors and time to mitigate, time to repair, etc.

100% uptime doesn't necessarily mean nothing failed, it means the failure detection and mitigation worked within the allowed windows. In a typical internet environment, that means allowing connections to die when the server they're connected to dies. It's would be possible to handoff tcp connections, but nobody does it.

If you want to get close to those numbers, you need to have a real reason, and then you need to make sure you have a good plan for everything that can go wrong. Power, routers, fiber, load balancers, switches, hosts, etc. And then do your best not to push bad software / bad configuration.

Bare metal on quality hardware with redundant networking goes a long way towards reliability, once the kinks are worked out.

cortesoft · on July 2, 2019

SLAs aren't for engineers, they are for financial people to make agreements on payments for downtime.

klodolph · on July 2, 2019

Good SLAs are also for engineers.

dserodio · on July 3, 2019

That's what SLOs are for

klodolph · on July 3, 2019

If you only look at the SLOs you are a junior engineer working for someone else making the big decisions. If you are designing a system you want to look at the SLA. Engineers are not just assembly line workers that consume specs and spit out parts.

Nothing wrong with just using SLOs, but if you are a technical lead or senior engineer, you should have the big picture.

Lorin · on July 2, 2019

Deploy once, never update, and deploy a missile defense to prevent backhoes from digging up fiber?

JaimeThompson · on July 2, 2019

You honestly think a missile defense system will work. Backhoes are much more creative than that. You will need defense in depth, roaming patrols, as well as air and satellite based monitoring assets.

jacques_chester · on July 2, 2019

And then the fibre will get cut by a building crew working on a guard tower.

ohyeshedid · on July 3, 2019

https://i.imgur.com/rDW7W3d.png

klodolph · on July 2, 2019

Ah yes, missile defenses, like the MIM-104 Patriot: https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...

haggy · on July 2, 2019

You don't. Like others have said SLA's are a sales tool, not an objective for the engineering team to achieve.

jaclaz · on July 3, 2019

Excuse my curiosity, but what are numbers that you would find acceptable as targets?

Or - if you prefer - what is the "reasonable" percentage of issues - timewise - for an internet service?

bifrost · on July 2, 2019

I agree, advertised SLAs are garbage.

Agreements to uphold past performance are much better.

cortesoft · on July 2, 2019

SLAs are all about getting compensation when they are broken. It isn't about actual uptime.

bifrost · on July 3, 2019

There was a point in time where that wasn't true but as people started accepting a lower quality of service it became easier to just pay than do the right thing.

ejcx · on July 2, 2019

You can read our SOC3 (public facing SOC2) if you're curious about your availability question: https://www.cloudflare.com/compliance/

There's a lot of good info in there

bryant · on July 2, 2019

Ah, hi Evan.

There's a lot of good info here, but there are many more questions raised in my mind based on what I'm reading in the SOC3 than perhaps what you might've expected. I can ideally run through them if I catch you again at DEF CON this year. I'm also willing to sign your standard MNDA to review your SOC 2, but we can take that thread offline.

ti_ranger · on July 3, 2019

> We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents.

Good.

> Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.

Wow. This seems like a very immature operational stance.

Any deployment of any kind should be subject to minimum deployment safety, that they claim they have.

> At 1402 UTC we understood what was happening and decided to issue a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic. That occurred at 1409 UTC.

Many large companies would have had automatic roll-back of this kind of change in less time than it took CloudFlare to (apparently) have humans decide to roll-back, and possibly before a single (usually not global) deployment had actually completed on all hosts/instances.

However, what is more concerning is that it seems you shouldn't rely on CloudFlare's "WAF Managed Rulesets" at all, since they seem to be willing to turn it off instead of correctly rolling back a bad deployment, which they only did > 43 minutes later:

> We then went on to review the offending pull request, roll back the specific rules, test the change to ensure that we were 100% certain that we had the correct fix, and re-enabled the WAF Managed Rulesets at 1452 UTC.

How were they not able to trivially roll back to the previous deployment?

londons_explore · on July 3, 2019

So many employees deploying so many changes at a time it wasn't clear which one was the cause...?

mschuster91 · on July 3, 2019

Which is why the entire (mostly in Agile environments) model of "deploy to prod as soon as you can" is absolute nuts.

If you're dev at a hipster app maybe a dozen people use to holler "yo" at each other, by all means go for it. If you're operating one of the biggest and most important chonks of Internet infra... maaaaaybe stick to established practices such as stage testing, release schedules and incremental rollouts?

flurdy · on July 3, 2019

Why not both.

I don't want to return to the old slothful release schedules of the 2000s, where features and bug fixes was mostly stagnant.

You can have staging, scheduled QA signed off releases, that happen every day. I have worked on some fairly large significant services and we still released several times a day, just that you did not trigger the final prod release yourself but the QAs pressed the button instead. Though usually just once a day per microservice.

I have also worked with several clients lately without QA where devs could themselves push to prod many times a day. I am not sure these systems were that much less stable, though they were all mostly greenfield and not critical public government systems. They were off course a lot smaller changes, and quick to undo. Which is the core element of "release straight to prod" ethos.

I am sure Cloudflare have a significant QA process whilst using todays fast moving release schedules.

What is always a grey-zone is configuration changes. Even if properly versioned and on a release schedule train with several staging environments, configuration is often very environment sensitive. So maybe they could not test it properly in any staging environments but had to hope prod worked...

However Cloudflare will hopefully implement some way to make sure this particular configuration and subsequent future changes are not as bottle-necked that instead can be be gradually released to a subset and region-by-region instead of a big bang to all. Though canary/blue-green/etc releases of core routing configurations is hard.

pornel · on July 3, 2019

Deployments are scheduled and managed by the SRE team.

grey-area · on July 2, 2019

I really want to know the regexp and corresponding input(s) which killed the internet now :) Was it just aaaaaaaaaaaah?

https://swtch.com/~rsc/regexp/regexp1.html

gbrayut · on July 3, 2019

Probably something mundane like ^[\s\u200c]+|[\s\u200c]+$

That's the one that took down Stack Overflow a few years ago https://stackstatus.net/post/147710624694/outage-postmortem-...

snarf21 · on July 3, 2019

It reminds me of an old joke. I decided to solve a software problem with regular expressions and now I have two problems.

almost_usual · on July 3, 2019

I'm assuming it's something pretty embarrassing if it's not in the post mortem.

hunter2_ · on July 3, 2019

The first sentence here is "This is a short placeholder blog and will be replaced with a full post-mortem...".

I'd bet big money that they do include it.

almost_usual · on July 3, 2019

Look forward to the full post-mortem

rob-olmos · on July 2, 2019

For the size and importance of Cloudflare some insights to a couple questions would be nice:

1. Why are WAF rules not progressively deployed since there's already a system to do so?

2. Maybe there should also be a testing environment that receives a mirror of production traffic before deployments reach real users?

(I understand the WAF change was not set to take action, but a separate environment would be less likely to affect production)

JakeTheAndroid · on July 2, 2019

Cloudflare does have test colos that use a subset of real network traffic for testing. It's actually the primary testing methodology, and the employees are usually some of the first people forced through test updates.

This release wasn't meant to go out, and the fact it did means it would have bypassed the test environments either way.

souterrain · on July 2, 2019

Cloudflare should write a guide to doing post-event communication. Or perhaps they shouldn’t, as this seems to be a potential differentiator.

This is direct and doesn’t attempt to avoid blame. Well done.

_qwfv · on July 3, 2019

Avoiding blame is different than acknowledging responsibility. A post mortem should be very conscious about blame - never target the engineer who deployed the change, for example. Take responsibility for the machine that allowed the unsafe change to be deployed. (Where machine could be tooling or process, as appropriate.)

lgats · on July 2, 2019

At 1402 UTC we understood what was happening and decided to issue a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic. That occurred at 1409 UTC.

So for about 50 minutes, those who relied on the WAF were open to attack?

foota · on July 2, 2019

Isn't open to DDoS better than can't be reached?

javagram · on July 2, 2019

WAF wouldn’t just prevent DDoS I assume. I’m pretty sure there are WAF rulesets that attempt to block attacks such as XSS or even remote code execution vulnerabilities.

dx87 · on July 2, 2019

I can confirm that there are WAF rules that block things like basic SQL injection. A client uses Akamai and if it detects certain strings in a request, like "<script>", it'll block the request before it ever gets to the application. The bad part is that some developers get complacent in their development and rely on the WAF to do their security for them.

detaro · on July 2, 2019

Depends on the relative costs of the two options?

hunter2_ · on July 3, 2019

Can you give an example of where the cost of possibly-denied could ever be higher than definitely-denied?

First Cloudflare literally denied service, then as a hotfix there was a higher-than-normal potential for denying service, and eventually the normal potential for denying service was restored. I'm trying to comprehend how the second phase could ever be worse than the first phase.

Now, if you're talking about elevating the potential for compromised confidentiality and/or integrity rather than merely availability, I'd agree, but generally [D]DoS refers to availability.

Leaning on a WAF to plug gaping vulnerabilities that can be discovered and exploited during the period of time before the WAF was restored means you have much bigger problems than uptime.

detaro · on July 3, 2019

> Leaning on a WAF to plug gaping vulnerabilities that can be discovered and exploited during the period of time before the WAF was restored means you have much bigger problems than uptime.

It's also, roughly speaking, the selling point of products called "WAF". (and yes, relying on them is not great)

txcwpalpha · on July 2, 2019

Peculiar that eastdakota (Cloudflare's CEO) doesn't seem to be tweeting at the Cloudflare team responsible for this, telling them they should be ashamed and are guilty of malpractice.

When it was Verizon that took down the internet he felt it was appropriate to do that to the Verizon teams, after all.

edit: right after posting this comment, he did tweet the following: https://twitter.com/eastdakota/status/1146196836035620864

> I'd say both we and Verizon deserve to be ashamed.

As well as this: https://twitter.com/eastdakota/status/1146170209780113408

> Our team should be and is ashamed. And we deserve criticism. ...

I still don't think that publicly shaming anyone is a good leadership style nor is it a good way to motivate people to perform better in the future, but kudos for the self-awareness, at least.

threezero · on July 2, 2019

Cloudflare was responsive and reasonable. Verizon was unreachable and deflected responsibility when they finally made a statement. And public shaming does often motivate companies to be more responsive to their customers.

txcwpalpha · on July 2, 2019

AFAIK Cloudflare isn't in any way a "customer" of Verizon. Verizon doesn't owe Cloudflare any kind of response or devotion of resources. Verizon owes it's actual customers a resolution to their problem, which they gave.

I'm not saying Verizon is perfect nor absolved of fault, but Cloudflare was/is not owed any kind of explanation or assistance by VZ, and it's absurd of CF to still be whining about that fact (as they are doing in some other tweets today). If CF wants some kind of SLA with VZ, they should engage them in a business relationship, not try to publicly shame them.

thegagne · on July 3, 2019

I’d say what they really need is a representative governing body over major network carriers to establish proper standards and levy fines for those that do not comply.

Kind of similar to a homes association saying “hey that trash on your lawn affects your neighbor, clean it up!”

It’s true that they are not a customer but at that level what they do affects each other, and it’s better to resolve things civilly and privately instead of publicly on twitter.

mschuster91 · on July 3, 2019

In theory this should be the job of the FCC, and in Europe the local regulatory agencies (BNetzA in Germany for example). But properly funding them to do their jobs doesn't seem to be very high om the political agendas these days.

0xbadcafebee · on July 2, 2019

How to implement a multi-CDN strategy (streamroot.io): https://news.ycombinator.com/item?id=18399523

Etsy implementing multiple CDN (7 years ago, the CDNcontrol project looks abandoned): https://speakerdeck.com/ickymettle/integrating-multiple-cdn-... https://dyn.com/blog/speaking-with-etsy-about-multi-cdns-and...

Basically: you can try to keep a low TTL DNS, but it'll be more DNS traffic, and 5-10% of traffic takes forever to cut over because nobody respects TTL. Worst case you have just as much down time as before, best case most of your traffic is recovered in a few minutes.

outworlder · on July 2, 2019

It may be useful to note, for whoever is reading this, that low DNS TTL only ever makes sense for anything that you can do a cutover either automatically or on short notice, not for all records. Otherwise, you are now at mercy of outages on your DNS providers.

Just leaving it out there so one doesn't get the idea that "low TTL == Always Good"

cfors · on July 2, 2019

I wonder what happened with that poor regex expression.

My thoughts are immediately shifting to one of my favorite articles of all time "Regular Expression Matching can be Simple and Fast..." [0]

[0] https://swtch.com/~rsc/regexp/regexp1.html

UI_at_80x24 · on July 2, 2019

Can anybody suggest a Systems Engineer-centric forum/site? (Not Windows 'help I can't print' level, more DataCenter grade.)

HN does have some great content/replies that touch on these topics, but I'd like something more.

gridspy · on July 2, 2019

Perhaps these QA sites are interesting?

https://superuser.com/ https://serverfault.com/

But yes, the content like this on HN is fascinating and I would also like more.

sequoia · on July 2, 2019

What sort of regular expression pitfalls can cause this sort of CPU utilization? I know they're possible but I am curious about specific examples of something similar to what caused Cloudflare's issue here.

roro159 · on July 2, 2019

DoS with regex is a thing: https://www.owasp.org/index.php/Regular_expression_Denial_of...

StackOverflow had a similar case a while back: https://stackstatus.net/post/147710624694/outage-postmortem-...

andreareina · on July 2, 2019

HN discussion of the StackOverflow incident: https://news.ycombinator.com/item?id=12131909

novas0x2a · on July 2, 2019

Some regex languages allow backtracking, and backtracking is usually the thing that causes regexes to blow up in resource cost: https://www.regular-expressions.info/catastrophic.html

edwintorok · on July 2, 2019

You probably want a regex engine that runs in linear time:

* Google's RE2 https://github.com/google/re2/wiki/WhyRE2

* https://github.com/laurikari/tre/

There is a good series of articles about the problem: https://swtch.com/~rsc/regexp/regexp3.html

I would strongly recommend deploying such a regular expression matcher to avoid problems like this. There are examples in the above article that you can use to test anything in your production deployment that accepts regular expressions to see how well it copes.

novas0x2a · on July 2, 2019

Might have been a misdirect reply (although useful), but yeah, agree, linear-time regex engines are generally a much better idea.

RcouF1uZ4gsC · on July 3, 2019

Part of the problem is that "regular expressions" are not really regular expressions in the Chomsky sense.

Regular languages have some very nice properties relating to how they can be evaluated. Some regular expression engines have features that pulls the expressions from being a regular language into more complexity.

ErikCorry · on July 2, 2019

For examples check the bugs that were duped into https://bugs.chromium.org/p/v8/issues/detail?id=430

tyingq · on July 2, 2019

There's also other similar things in this space, like hash table collisions that can eat CPU.

djhworld · on July 2, 2019

A good war story there, at least the problem was relatively simple and quick to identify as the root cause, rather than something deeper.

Would be interested to see what the gnarly regex was that was bombing their CPUs so hard!

mrzasa · on July 3, 2019

Shameless plug: understanding regex engine implementation can help with avoiding performance pitfalls: https://medium.com/textmaster-engineering/performance-of-reg...

ksara · on July 3, 2019

>"It doesn't cost a provider like Verizon anything to have such limits in place. And there's no good reason, other than sloppiness or laziness, that they wouldn't have such limits in place."[1]

[1] https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-...

Operyl · on July 3, 2019

The difference between Verizon and Cloudflare in this case is that Cloudflare generally _does_ fix their mistakes when they screw up (and generally don't make the same type again).. whereas Verizon has screwed up internet routing more times than I'd like to think about. No company is perfect, but I'd say this comparison is pretty apples to oranges.

nodesocket · on July 2, 2019

It is interesting NGINX returned 502 nearly instantly under very heavy CPU load. I would have expected requests to just hang or timeout.

docapotamus · on July 2, 2019

I would imagine it's tiered. The Nginx servers at the front returning the 502 probably aren't the boxes running the code

jgrahamc · on July 3, 2019

BentFranklin · on July 2, 2019

Kind of funny that it was a regexp.

davidw · on July 2, 2019

I'm reminded of:

"You have a problem, and you decide to use a regexp to solve it. Now you have two problems"

Although of course I'm just kidding and I'm sure that a good regexp probably is the right solution for what they're doing in that instance: they have a lot of bright people.

gist · on July 2, 2019

Nothing like having what should be a world class company falling prey to the same type of screw-ups that plaque 'the local guy maintaining some wordpress site on a shared server'.

Separately there is nothing that says that a company like Cloudflare has to air their dirty laundry (as the saying goes). The vast majority of 'customers' really don't care why something happened at all or the reason. All they know is that they don't have service.

Pretend that the local electric company had a power outage (and it wasn't caused by some obvious weather event). Does it really matter if they tell people that 'some hardware we deployed failed and we are making sure it never happens again'. I know tech thinks they are great for these types of post-mortems but the truth is only tech people really care to hear them. (And guess what all it probably means is that that issue won't happen again...)

GranPC · on July 2, 2019

> I know tech thinks they are great for these types of post-mortems but the truth is only tech people really care to hear them.

Well, Cloudflare is in luck; most of their customers are "tech people"!

gist · on July 6, 2019

100 not true. All you have to do is pull a list of the daily additions and deletions and you will see that they have many customers that are not 'tech' people. Further you are assuming all the customers of theirs that are tech people even read and keep up with blog posts like this.

jdavis703 · on July 3, 2019

Good engineers like knowing why things break. I just started a book on the reasons why buildings collapse. It’s essential a series of post-mortems of specific events. I have zero formal architectural or civil engineering experience — just an inquisitive disposition. For anyone interested, the book is called ”Why Buildings Fall Down.”

gist · on July 6, 2019

You are reading this for entertainment and perhaps to learn but my point wasn't that it wasn't potentially interesting but more the actual business purpose of doing this.

One again using an example of lost luggage I don't really care (other than an interesting story) why my luggage was lost I just don't want it to happen and an airline writing a detailed story doesn't give me any more confidence they won't have a different problem happen again. If anything it maybe even opens up the door if something I read seems that it could have been avoided (whereby if they say nothing I might not know that it could).

ngold · on July 2, 2019

I fail to see the point of your post? Are you arguing that less transparency and information is a good thing?

I doubt any of the people you are talking about "not caring" read this site to begin with.

rubyn00bie · on July 3, 2019

This line kills me:

> We were seeing an unprecedented CPU exhaustion event, which was novel for us as we had not experienced global CPU exhaustion before.

I'd imagine it was quite novel for most anyone affected /s

quickthrower2 · on July 2, 2019

Probably for the kind of work they are doing avoid regex? Or at least the very complicated modern regex (simple autonoma that you can compile in advance might be ok)

txcwpalpha · on July 3, 2019

If you're trying to do pattern matching, is there actually a widely used alternative to regex? The more I can avoid using regex for mission-critical things, the happier I will be, but I'm really not aware of anything better for this type of application.

quickthrower2 · on July 3, 2019

Ive tried parser combinators. They are nice but a bit more labour than writing out a regex and I’m not sure how performance compares

bdibs · on July 2, 2019

Seems like a terrible idea to deploy changes to such a vital piece of their software GLOBALLY without some sort of rollout procedure.

tolgahanuzun · on July 3, 2019

It is very difficult to explain this to customers who don't understand technology. 30 minutes is a very big time. :/

suchow · on July 2, 2019

Is there a usage error in the first sentence or has English lost the blog / blog post distinction?

tom_ · on July 3, 2019

Some people distinguish between the two, some don't.

tomcam · on July 2, 2019

I run a service placing bids the last few seconds on eBay. Every time this happens I lose measurable business (we place thousands of bids per day). While it doesn’t affect scheduled bids, they can’t place bids and are likely to move to a competitor. These recent outages have been costly.

Does anyone know a more reliable provider?

sbr464 · on July 2, 2019

From experience, using two services (active/active) is really the only way to avoid downtime. DNS can be trickier, but there are providers that can fallback automatically or split requests (dnsmadeeasy etc)

PhasmaFelis · on July 2, 2019

Haha, people pay you to bid-snipe on eBay for them?

It just baffles me that manual/third-party bid-sniping is still a thing. eBay has had automatic bidding for more than twenty years. You'll pay the same whether you put in the winning bid a week in advance or 5 seconds. But people see that "you lost this auction" notice and they're irrationally convinced that it would have gone differently if they'd bid at the last minute, somehow.

docker_up · on July 2, 2019

Sniping is the only rational way to big on eBay. You bid a single time, with the absolute maximum you would go for, and put it in the last 10 seconds. It prevents both you and your opponents from raising the price irrationally. Automatic bidding just encourages prices to go up, based on emotion. Sniping in the last few seconds removes the emotional component.

skibble · on July 2, 2019

This absolutely does work. If you get outbid normally on eBay, say at least half an hour or so before the auction ends, then people get irrational and you quickly end up in a bidding war. This way, as a bidder, you set the maximum price you are willing to pay outside of a heated 'damn, I missed out' mentality and people don't have time to respond.

Also this helps when placing bids on auctions that end when you're asleep and you want to effect the above. I've tried to do it manually in the past but naturally life intervenes and you're somewhere with no signal or in the middle of something. Having tried it both ways sniping is definitely better than regular bidding.

beering · on July 2, 2019

If other bidders are irrational, then bid-sniping can work. It doesn't give others the opportunity to contemplate, "I've been out-bid, do I actually want this item more than I originally thought?"

neilv · on July 2, 2019

And it's well-known since early eBay days that many bidders are irrational, including but not limited to competitive impulse to "win". Plus you sometimes have shill bidders.

Sniping approximates sealed bids, with the highest-bidder the second-highest sealed bid amount or a small increment above it. (Unfortunately for eBay, that would tend to decrease their cuts, unless the appeal of the sealed bid format brings in sufficiently more bidder activity.)

Another advantage of the software/service is that it automates. If you want to buy a Foo, you can look at the search lists, find a few Foos (possibly of varying value in the details), say how much you'll pay for each one, and let the software attempt to buy each one by its auction end until it's bought one, then it stops. If eBay implemented this itself, it might be too much headache in customer support, but third-parties could provide it to power users.

(I don't buy enough on eBay anymore to bother with anything other than conventional manual bids, but I see the appeal of automation.)

steve_adams_86 · on July 3, 2019

If everyone is using a bidder like this, isn't it essentially like a blind auction?

deforciant · on July 2, 2019

Maybe use cloudflare only for the landing page and docs but serve your bidding app frontend and backend directly? Since users are already there it will only affect first load :)

tomcam · on July 2, 2019

That is exactly how we use Cloudflare, but I appreciate the guidance. I am new to devops.

BuddhaSource · on July 3, 2019

Do builds go through stage role out? For service like Cloudflare.

pearapps · on July 3, 2019

No way!?!?!??!?!?!?!??!?!??!?!?!

rodgerd · on July 2, 2019

Karma for shitting on Verizon, maybe.